[RFC PATCH 0/6] Configurable fair allocation zone policy v3

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] Configurable fair allocation zone policy v3
@ 2013-12-17 16:48 Mel Gorman
  2013-12-17 16:48 ` [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

This series is currently untested and is being posted to sync up discussions
on the treatment of page cache pages, particularly the sysv part. I have
not thought it through in detail but postings patches is the easiest way
to highlight where I think a problem might be.

Changelog since v2
o Drop an accounting patch, behaviour is deliberate
o Special case tmpfs and shmem pages for discussion

Changelog since v1
o Fix lot of brain damage in the configurable policy patch
o Yoink a page cache annotation patch
o Only account batch pages against allocations eligible for the fair policy
o Add patch that default distributes file pages on remote nodes

Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of how
the page allocator and kswapd interacted on the per-zone LRU lists.

Unfortunately a side-effect missed during review was that it's now very
easy to allocate remote memory on NUMA machines. The problem is that
it is not a simple case of just restoring local allocation policies as
there are genuine reasons why global page aging may be prefereable. It's
still a major change to default behaviour so this patch makes the policy
configurable and sets what I think is a sensible default.

The patches are on top of some NUMA balancing patches currently in -mm.
It's untested and posted to discuss patches 4 and 6.

 Documentation/sysctl/vm.txt |  29 ++++++++++
 include/linux/gfp.h         |   4 +-
 include/linux/mmzone.h      |   2 +
 include/linux/pagemap.h     |   2 +-
 include/linux/swap.h        |   2 +
 kernel/sysctl.c             |   8 +++
 mm/filemap.c                |   2 +
 mm/page_alloc.c             | 136 +++++++++++++++++++++++++++++++++++++-------
 mm/shmem.c                  |  14 +++++
 9 files changed, 176 insertions(+), 23 deletions(-)

-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
  2013-12-17 16:48 ` [PATCH 2/6] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

From: Johannes Weiner <hannes@cmpxchg.org>

Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator
and slab.

The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone.  It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.

Only round-robin allocations that are usually freed through page
reclaim or slab shrinking.

Cc: <stable@kernel.org>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f0..f861d02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ zonelist_scan:
 		 * back to remote zones that do not partake in the
 		 * fairness round-robin cycle of this zonelist.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & GFP_MOVABLE_MASK)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
 			if (zone_reclaim_mode &&
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] mm: page_alloc: Break out zone page aging distribution into its own helper
  2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
  2013-12-17 16:48 ` [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
  2013-12-17 16:48 ` [PATCH 3/6] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

This patch moves the decision on whether to round-robin allocations between
zones and nodes into its own helper functions. It'll make some later patches
easier to understand and it will be automatically inlined.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f861d02..64020eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1872,6 +1872,42 @@ static inline void init_zone_allows_reclaim(int nid)
 #endif	/* CONFIG_NUMA */
 
 /*
+ * Distribute pages in proportion to the individual zone size to ensure fair
+ * page aging.  The zone a page was allocated in should have no effect on the
+ * time the page has in memory before being reclaimed.
+ * 
+ * Returns true if this zone should be skipped to spread the page ages to
+ * other zones.
+ */
+static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
+				struct zone *zone, int alloc_flags)
+{
+	/* Only round robin in the allocator fast path */
+	if (!(alloc_flags & ALLOC_WMARK_LOW))
+		return false;
+
+	/* Only round robin pages likely to be LRU or reclaimable slab */
+	if (!(gfp_mask & GFP_MOVABLE_MASK))
+		return false;
+
+	/* Distribute to the next zone if this zone has exhausted its batch */
+	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+		return true;
+
+	/*
+	 * When zone_reclaim_mode is enabled, try to stay in local zones in the
+	 * fastpath.  If that fails, the slowpath is entered, which will do
+	 * another pass starting with the local zones, but ultimately fall back
+	 * back to remote zones that do not partake in the fairness round-robin
+	 * cycle of this zonelist.
+	 */
+	if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+		return true;
+
+	return false;
+}
+
+/*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
  */
@@ -1907,27 +1943,12 @@ zonelist_scan:
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
 			goto try_this_zone;
-		/*
-		 * Distribute pages in proportion to the individual
-		 * zone size to ensure fair page aging.  The zone a
-		 * page was allocated in should have no effect on the
-		 * time the page has in memory before being reclaimed.
-		 *
-		 * When zone_reclaim_mode is enabled, try to stay in
-		 * local zones in the fastpath.  If that fails, the
-		 * slowpath is entered, which will do another pass
-		 * starting with the local zones, but ultimately fall
-		 * back to remote zones that do not partake in the
-		 * fairness round-robin cycle of this zonelist.
-		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & GFP_MOVABLE_MASK)) {
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-				continue;
-			if (zone_reclaim_mode &&
-			    !zone_local(preferred_zone, zone))
-				continue;
-		}
+
+		/* Distribute pages to ensure fair page aging */
+		if (zone_distribute_age(gfp_mask, preferred_zone, zone,
+					alloc_flags))
+			continue;
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
  2013-12-17 16:48 ` [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
  2013-12-17 16:48 ` [PATCH 2/6] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
  2013-12-17 16:48 ` [PATCH 4/6] mm: Annotate page cache allocations Mel Gorman
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

zone_local is using node_distance which is a more expensive call than
necessary. On x86, it's another function call in the allocator fast path
and increases cache footprint. This patch makes the assumption zones on a
local node will share the same node ID. The necessary information should
already be cache hot.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64020eb..fd9677e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
 
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
-	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+	return zone_to_nid(zone) == numa_node_id();
 }
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] mm: Annotate page cache allocations
  2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
                   ` (2 preceding siblings ...)
  2013-12-17 16:48 ` [PATCH 3/6] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
  2013-12-17 16:48 ` [PATCH 5/6] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

The fair zone allocation policy needs to distinguish between anonymous,
slab and file-backed pages. This patch annotates many of the page cache
allocations by adjusting __page_cache_alloc. This does not guarantee
that all page cache allocations are being properly annotated. One case
for special consideration is shmem. sysv shared memory and MAP_SHARED
anonymous pages are backed by this and they should be treated as anon by
the fair allocation policy. It is also used by tmpfs which arguably should
be treated as file by the fair allocation policy.

The primary top-level shmem allocation function is shmem_getpage_gfp
which ultimately uses alloc_pages_vma() and not __page_cache_alloc. This
is correct for sysv and MAP_SHARED but tmpfs is still treated as anonymous.
This patch special cases shmem to annotate tmpfs allocations as files for
the fair zone allocation policy.

NOTE: At time of writing it has not been double checked that it annotates
	the different shmem request types. Furthermore, this patch was
	originally base on a patch from Johannes and does not have his
	signed-off-by. Without his signed-off, I cannot sign it off

Cannot-sign-off-without-Johannes
---
 include/linux/gfp.h     |  4 +++-
 include/linux/pagemap.h |  2 +-
 mm/filemap.c            |  2 ++
 mm/shmem.c              | 14 ++++++++++++++
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..f69e4cb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_PAGECACHE	0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -92,6 +93,7 @@ struct vm_area_struct;
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 #define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE)   /* Page cache allocation */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..bda4845 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp | __GFP_PAGECACHE, 0);
 }
 #endif
 
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..5bb9225 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	int n;
 	struct page *page;
 
+	gfp |= __GFP_PAGECACHE;
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
diff --git a/mm/shmem.c b/mm/shmem.c
index 8297623..02d7a9c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
 	return page;
 }
 
+/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */
+static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info)
+{
+	/* If no internal shm_mount then it must be tmpfs */
+	if (IS_ERR(shm_mnt))
+		return true;
+
+	/* Consider it to be tmpfs if the superblock is not the internal mount */
+	return info->vfs_inode.i_sb != shm_mnt->mnt_sb;
+}
+
 static struct page *shmem_alloc_page(gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
 {
@@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp,
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 
+	if (shmem_inode_on_tmpfs(info))
+		gfp |= __GFP_PAGECACHE;
+
 	page = alloc_page_vma(gfp, &pvma, 0);
 
 	/* Drop reference taken by mpol_shared_policy_lookup() */
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
                   ` (3 preceding siblings ...)
  2013-12-17 16:48 ` [PATCH 4/6] mm: Annotate page cache allocations Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
  2013-12-17 16:48 ` [PATCH 6/6] mm: page_alloc: add vm.pagecache_interleave to control default mempolicy for page cache Mel Gorman
  2013-12-17 20:02 ` [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Johannes Weiner
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of
how the page allocator and kswapd interacted on the per-zone LRU lists.
Unfortunately it was missed during review that a consequence is that
we also round-robin between NUMA nodes. This is bad for two reasons

1. It alters the semantics of MPOL_LOCAL without telling anyone
2. It incurs an immediate remote memory performance hit in exchange
   for a potential performance gain when memory needs to be reclaimed
   later

No cookies for the reviewers on this one.

This patch makes the behaviour of the fair zone allocator policy
configurable.  By default it will only distribute pages that are going
to exist on the LRU between zones local to the allocating process. This
preserves the historical semantics of MPOL_LOCAL.

By default, slab pages are not distributed between zones after this patch is
applied. It can be argued that they should get similar treatment but they
have different lifecycles to LRU pages, the shrinkers are not zone-aware
and the interaction between the page allocator and kswapd is different
for slabs. If it turns out to be an almost universal win, we can change
the default.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/vm.txt |  32 ++++++++++++++
 include/linux/mmzone.h      |   2 +
 include/linux/swap.h        |   2 +
 kernel/sysctl.c             |   8 ++++
 mm/page_alloc.c             | 102 ++++++++++++++++++++++++++++++++++++++------
 5 files changed, 134 insertions(+), 12 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb..8eaa562 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- zone_distribute_mode
 - zone_reclaim_mode
 
 ==============================================================
@@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
 
 ==============================================================
 
+zone_distribute_mode
+
+Pages allocation and reclaim are managed on a per-zone basis. When the
+system needs to reclaim memory, candidate pages are selected from these
+per-zone lists.  Historically, a potential consequence was that recently
+allocated pages were considered reclaim candidates. From a zone-local
+perspective, page aging was preserved but from a system-wide perspective
+there was an age inversion problem.
+
+A similar problem occurs on a node level where young pages may be reclaimed
+from the local node instead of allocating remote memory. Unforuntately, the
+cost of accessing remote nodes is higher so the system must choose by default
+between favouring page aging or node locality. zone_distribute_mode controls
+how the system will distribute page ages between zones.
+
+0	= Never round-robin based on age
+
+Otherwise the values are ORed together
+
+1	= Distribute anon pages between zones local to the allocating node
+2	= Distribute file pages between zones local to the allocating node
+4	= Distribute slab pages between zones local to the allocating node
+
+The following three flags effectively alter MPOL_DEFAULT, be careful.
+
+8	= Distribute anon pages between zones remote to the allocating node
+16	= Distribute file pages between zones remote to the allocating node
+32	= Distribute slab pages between zones remote to the allocating node
+
+==============================================================
+
 zone_reclaim_mode:
 
 Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b835d3f..20a75e3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -897,6 +897,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
+int sysctl_zone_distribute_mode_handler(struct ctl_table *, int,
+			void __user *, size_t *, loff_t *);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..44329b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,6 +318,8 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
+extern unsigned __bitwise__ zone_distribute_mode;
+
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a6047..b75c08f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1349,6 +1349,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
+	{
+		.procname	= "zone_distribute_mode",
+		.data		= &zone_distribute_mode,
+		.maxlen		= sizeof(zone_distribute_mode),
+		.mode		= 0644,
+		.proc_handler	= sysctl_zone_distribute_mode_handler,
+		.extra1		= &zero,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "zone_reclaim_mode",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd9677e..c2a2229 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1871,6 +1871,49 @@ static inline void init_zone_allows_reclaim(int nid)
 }
 #endif	/* CONFIG_NUMA */
 
+/* Controls how page ages are distributed across zones automatically */
+unsigned __bitwise__ zone_distribute_mode __read_mostly;
+
+/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */
+#define DISTRIBUTE_DISABLE	(0)
+#define DISTRIBUTE_LOCAL_ANON	(1UL << 0)
+#define DISTRIBUTE_LOCAL_FILE	(1UL << 1)
+#define DISTRIBUTE_LOCAL_SLAB	(1UL << 2)
+#define DISTRIBUTE_REMOTE_ANON	(1UL << 3)
+#define DISTRIBUTE_REMOTE_FILE	(1UL << 4)
+#define DISTRIBUTE_REMOTE_SLAB	(1UL << 5)
+
+#define DISTRIBUTE_STUPID_ANON	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
+#define DISTRIBUTE_STUPID_FILE	(DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
+#define DISTRIBUTE_STUPID_SLAB	(DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
+#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+
+/* Only these GFP flags are affected by the fair zone allocation policy */
+#define DISTRIBUTE_GFP_MASK	((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
+
+int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	/* If you are an admin reading this comment, what were you thinking? */
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) ==
+							DISTRIBUTE_STUPID_ANON))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON;
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) ==
+							DISTRIBUTE_STUPID_FILE))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE;
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) ==
+							DISTRIBUTE_STUPID_SLAB))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB;
+
+	return 0;
+}
+
 /*
  * Distribute pages in proportion to the individual zone size to ensure fair
  * page aging.  The zone a page was allocated in should have no effect on the
@@ -1882,26 +1925,60 @@ static inline void init_zone_allows_reclaim(int nid)
 static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
 				struct zone *zone, int alloc_flags)
 {
+	bool zone_is_local;
+	bool is_file, is_slab, is_anon;
+
 	/* Only round robin in the allocator fast path */
 	if (!(alloc_flags & ALLOC_WMARK_LOW))
 		return false;
 
-	/* Only round robin pages likely to be LRU or reclaimable slab */
-	if (!(gfp_mask & GFP_MOVABLE_MASK))
+	/* Only a subset of GFP flags are considered for fair zone policy */
+	if (!(gfp_mask & DISTRIBUTE_GFP_MASK))
 		return false;
 
-	/* Distribute to the next zone if this zone has exhausted its batch */
-	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-		return true;
-
 	/*
-	 * When zone_reclaim_mode is enabled, try to stay in local zones in the
-	 * fastpath.  If that fails, the slowpath is entered, which will do
-	 * another pass starting with the local zones, but ultimately fall back
-	 * back to remote zones that do not partake in the fairness round-robin
-	 * cycle of this zonelist.
+	 * Classify the type of allocation. From this point on, the fair zone
+	 * allocation policy is being applied. If the allocation does not meet
+	 * the criteria the zone must be skipped.
 	 */
-	if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+	is_file = gfp_mask & __GFP_PAGECACHE;
+	is_slab = gfp_mask & __GFP_RECLAIMABLE;
+	is_anon = (!is_file && !is_slab);
+	WARN_ON_ONCE(is_slab && is_file);
+
+	zone_is_local = zone_local(preferred_zone, zone);
+	if (zone_local(preferred_zone, zone)) {
+		/* Distribute between zones local to the node if requested */
+		if (is_anon && (zone_distribute_mode & DISTRIBUTE_LOCAL_ANON))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+			goto check_batch;
+	} else {
+		/*
+		 * When zone_reclaim_mode is enabled, stick to local zones. If
+		 * that fails, the slowpath is entered, which will do another
+		 * pass starting with the local zones, but ultimately fall back
+		 * back to remote zones that do not partake in the fairness
+		 * round-robin cycle of this zonelist.
+		 */
+		if (zone_reclaim_mode)
+			return false;
+
+		if (is_anon && (zone_distribute_mode & DISTRIBUTE_REMOTE_ANON))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+			goto check_batch;
+		if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+			goto check_batch;
+	}
+
+	return true;
+
+check_batch:
+	/* Distribute to the next zone if this zone has exhausted its batch */
+	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 		return true;
 
 	return false;
@@ -3797,6 +3874,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 		__build_all_zonelists(NULL);
 		mminit_verify_zonelist();
 		cpuset_init_current_mems_allowed();
+		zone_distribute_mode = DISTRIBUTE_DEFAULT;
 	} else {
 #ifdef CONFIG_MEMORY_HOTPLUG
 		if (zone)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] mm: page_alloc: add vm.pagecache_interleave to control default mempolicy for page cache
  2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
                   ` (4 preceding siblings ...)
  2013-12-17 16:48 ` [PATCH 5/6] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
  2013-12-17 20:02 ` [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Johannes Weiner
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

This patch introduces a vm.pagecache_interleave sysctl that allows the
administrator to alter the default memory allocation policy for file-backed
pages. It removes a more configurable interface that is expected to be
too complex to expose to users and give an unnecessarily level of control.

By default it is disabled but there is strong evidence that users on NUMA
machines will want to enable this. The default is expected to change
once the documention is in sync. Ideally it would also be possible to
control on a per-process basis by allowing processes to select either an
MPOL_LOCAL or MPOL_INTERLEAVE_PAGECACHE memory policy as memory policies
are the traditional way for controlling allocation behaviour.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/vm.txt | 61 +++++++++++++++++++++------------------------
 include/linux/mmzone.h      |  2 +-
 include/linux/swap.h        |  2 +-
 kernel/sysctl.c             |  8 +++---
 mm/page_alloc.c             | 18 +++++--------
 5 files changed, 41 insertions(+), 50 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 8eaa562..655ed0a 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -49,6 +49,7 @@ Currently, these files are in /proc/sys/vm:
 - oom_kill_allocating_task
 - overcommit_memory
 - overcommit_ratio
+- pagecache_interleave
 - page-cluster
 - panic_on_oom
 - percpu_pagelist_fraction
@@ -56,7 +57,6 @@ Currently, these files are in /proc/sys/vm:
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
-- zone_distribute_mode
 - zone_reclaim_mode
 
 ==============================================================
@@ -608,6 +608,34 @@ of physical RAM.  See above.
 
 ==============================================================
 
+pagecache_interleave:
+
+This setting is only relevant to NUMA machines.
+
+Historically, the default behaviour of the system is to allocate memory
+local to the process. The behaviour is usually modified through the use
+of memory policies while zone_reclaim_mode controls how strict the local
+memory allocation policy is.
+
+Issues arise when the allocating process is frequently running on the same
+node. The kernels memory reclaim daemon runs one instance per NUMA node.
+A consequence is that relatively new memory may be reclaimed by kswapd when
+the allocating process is running on a specific node. The user-visible
+impact is that the system appears to do more IO than necessary when a
+workload is accessing files that are larger than a given NUMA node.
+
+One way of addressing this is to use the interleave memory policy but that
+is not always possible.
+
+Another option is to enable this setting. When enabled, the default
+memory allocation changes from MPOL_LOCAL to interleaving file-backed
+pages by default. The downside is that some file accesses will now be
+to remote memory even though the local node had available resources.
+The upside is that workloads working on files larger than a NUMA node
+will not reclaim active pages prematurely.
+
+==============================================================
+
 page-cluster
 
 page-cluster controls the number of pages up to which consecutive pages
@@ -725,37 +753,6 @@ causes the kernel to prefer to reclaim dentries and inodes.
 
 ==============================================================
 
-zone_distribute_mode
-
-Pages allocation and reclaim are managed on a per-zone basis. When the
-system needs to reclaim memory, candidate pages are selected from these
-per-zone lists.  Historically, a potential consequence was that recently
-allocated pages were considered reclaim candidates. From a zone-local
-perspective, page aging was preserved but from a system-wide perspective
-there was an age inversion problem.
-
-A similar problem occurs on a node level where young pages may be reclaimed
-from the local node instead of allocating remote memory. Unforuntately, the
-cost of accessing remote nodes is higher so the system must choose by default
-between favouring page aging or node locality. zone_distribute_mode controls
-how the system will distribute page ages between zones.
-
-0	= Never round-robin based on age
-
-Otherwise the values are ORed together
-
-1	= Distribute anon pages between zones local to the allocating node
-2	= Distribute file pages between zones local to the allocating node
-4	= Distribute slab pages between zones local to the allocating node
-
-The following three flags effectively alter MPOL_DEFAULT, be careful.
-
-8	= Distribute anon pages between zones remote to the allocating node
-16	= Distribute file pages between zones remote to the allocating node
-32	= Distribute slab pages between zones remote to the allocating node
-
-==============================================================
-
 zone_reclaim_mode:
 
 Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 20a75e3..2fb9e2d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -897,7 +897,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
-int sysctl_zone_distribute_mode_handler(struct ctl_table *, int,
+int sysctl_zone_pagecache_interleave_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 44329b0..2b522cf 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,7 +318,7 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
-extern unsigned __bitwise__ zone_distribute_mode;
+extern unsigned int zone_pagecache_interleave;
 
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b75c08f..385d7cb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1350,11 +1350,11 @@ static struct ctl_table vm_table[] = {
 	},
 #endif
 	{
-		.procname	= "zone_distribute_mode",
-		.data		= &zone_distribute_mode,
-		.maxlen		= sizeof(zone_distribute_mode),
+		.procname	= "pagecache_interleave",
+		.data		= &zone_pagecache_interleave,
+		.maxlen		= sizeof(zone_pagecache_interleave),
 		.mode		= 0644,
-		.proc_handler	= sysctl_zone_distribute_mode_handler,
+		.proc_handler	= sysctl_zone_pagecache_interleave_handler,
 		.extra1		= &zero,
 	},
 #ifdef CONFIG_NUMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2a2229..b6c8e63 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1872,7 +1872,8 @@ static inline void init_zone_allows_reclaim(int nid)
 #endif	/* CONFIG_NUMA */
 
 /* Controls how page ages are distributed across zones automatically */
-unsigned __bitwise__ zone_distribute_mode __read_mostly;
+static unsigned __bitwise__ zone_distribute_mode __read_mostly;
+unsigned int zone_pagecache_interleave;
 
 /* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */
 #define DISTRIBUTE_DISABLE	(0)
@@ -1891,7 +1892,7 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
 /* Only these GFP flags are affected by the fair zone allocation policy */
 #define DISTRIBUTE_GFP_MASK	((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
 
-int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
+int sysctl_zone_pagecache_interleave_handler(ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
 	int rc;
@@ -1900,16 +1901,9 @@ int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
 	if (rc)
 		return rc;
 
-	/* If you are an admin reading this comment, what were you thinking? */
-	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) ==
-							DISTRIBUTE_STUPID_ANON))
-		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON;
-	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) ==
-							DISTRIBUTE_STUPID_FILE))
-		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE;
-	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) ==
-							DISTRIBUTE_STUPID_SLAB))
-		zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB;
+	zone_distribute_mode = DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB;
+	if (zone_pagecache_interleave)
+		zone_distribute_mode |= DISTRIBUTE_REMOTE_FILE;
 
 	return 0;
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
                   ` (5 preceding siblings ...)
  2013-12-17 16:48 ` [PATCH 6/6] mm: page_alloc: add vm.pagecache_interleave to control default mempolicy for page cache Mel Gorman
@ 2013-12-17 20:02 ` Johannes Weiner
  2013-12-18  6:17   ` Johannes Weiner
  2013-12-18 14:51   ` Michal Hocko
  6 siblings, 2 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-12-17 20:02 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

Hi Mel,

On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote:
> This series is currently untested and is being posted to sync up discussions
> on the treatment of page cache pages, particularly the sysv part. I have
> not thought it through in detail but postings patches is the easiest way
> to highlight where I think a problem might be.
>
> Changelog since v2
> o Drop an accounting patch, behaviour is deliberate
> o Special case tmpfs and shmem pages for discussion
> 
> Changelog since v1
> o Fix lot of brain damage in the configurable policy patch
> o Yoink a page cache annotation patch
> o Only account batch pages against allocations eligible for the fair policy
> o Add patch that default distributes file pages on remote nodes
> 
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of how
> the page allocator and kswapd interacted on the per-zone LRU lists.

Not just that, it was about ensuring predictable cache replacement and
maximizing the cache's effectiveness.  This implicitely fixed the
kswapd interaction bug, but that was not the sole reason (I realize
that the original changelog is incomplete and I apologize for that).

I have had offline discussions with Andrea back then and his first
suggestion was too to make this a zone fairness placement that is
exclusive to the local node, but eventually he agreed that the problem
applies just as much on the global level and that we should apply
fairness throughout the system as long as we honor zone_reclaim_mode
and hard bindings.  During our discussions now, it turned out that
zone_reclaim_mode is a terrible predictor for preferred locality, but
we also more or less agreed that the locality issues in the first
place are not really applicable to cache loads dominated by IO cost.

So I think the main discrepancy between the original patch and what we
truly want is that aging fairness is really only relevant for actual
cache backed by secondary storage, because cache replacement is an
ongoing operation that involves IO.  As opposed to memory types that
involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all
(slab, kernel allocations), in which case we prefer NUMA locality.

> Unfortunately a side-effect missed during review was that it's now very
> easy to allocate remote memory on NUMA machines. The problem is that
> it is not a simple case of just restoring local allocation policies as
> there are genuine reasons why global page aging may be prefereable. It's
> still a major change to default behaviour so this patch makes the policy
> configurable and sets what I think is a sensible default.
> 
> The patches are on top of some NUMA balancing patches currently in -mm.
> It's untested and posted to discuss patches 4 and 6.

It might be easier in dealing with -stable if we start with the
critical fix(es) to restore sane functionality as much and as compact
as possible and then place the cleanups on top?

In my local tree, I have the following as the first patch:

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: page_alloc: restrict fair allocator policy to page cache

81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged
in order to ensure predictable page cache replacement and to maximize
the cache's effectiveness of reducing IO regardless of zone or node
topology.

However, it was overzealous in round-robin placing every type of
allocation over all allowable nodes, instead of preferring locality,
which resulted in severe regressions on certain NUMA workloads that
have nothing to do with page cache.

This patch drastically reduces the impact of the original change by
having the round-robin placement policy only apply to page cache
backed by secondary storage, and no longer to anonymous memory, shmem,
tmpfs, slab allocations.

This still changes the long-standing behavior of page cache adhering
to the configured memory policy and preferring local allocations per
default, so make it configurable in case somebody relies on it.
However, we also expect the majority of users to prefer maximium cache
effectiveness and a predictable replacement behavior over memory
locality, so reflect this in the default setting of the sysctl.
---
 Documentation/sysctl/vm.txt             | 21 +++++++++++++++++
 Documentation/vm/numa_memory_policy.txt |  8 +++++++
 include/linux/gfp.h                     |  4 +++-
 include/linux/pagemap.h                 |  2 +-
 include/linux/swap.h                    |  2 ++
 kernel/sysctl.c                         |  8 +++++++
 mm/filemap.c                            |  2 ++
 mm/page_alloc.c                         | 41 +++++++++++++++++++++++++--------
 8 files changed, 76 insertions(+), 12 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb7b64a..50d250f7470f 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
 - memory_failure_early_kill
 - memory_failure_recovery
 - min_free_kbytes
+- pagecache_mempolicy_mode
 - min_slab_ratio
 - min_unmapped_ratio
 - mmap_min_addr
@@ -404,6 +405,26 @@ Setting this too high will OOM your machine instantly.
 
 =============================================================
 
+pagecache_mempolicy_mode:
+
+This is available only on NUMA kernels.
+
+Per default, the configured memory policy is applicable to anonymous
+memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
+interleaving fashion over all allowed nodes (hardbindings and
+zone_reclaim_mode excluded).
+
+The assumption is that, when it comes to pagecache, users generally
+prefer predictable replacement behavior regardless of NUMA topology
+and maximizing the cache's effectiveness in reducing IO over memory
+locality.
+
+This behavior can be changed by enabling pagecache_mempolicy_mode, in
+which case page cache allocations will be placed according to the
+configured memory policy (Documentation/vm/numa_memory_policy.txt).
+
+=============================================================
+
 min_slab_ratio:
 
 This is available only on NUMA kernels.
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 4e7da6543424..64d48b6378db 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -16,6 +16,14 @@ programming interface that a NUMA-aware application can take advantage of.  When
 both cpusets and policies are applied to a task, the restrictions of the cpuset
 takes priority.  See "MEMORY POLICIES AND CPUSETS" below for more details.
 
+Note that, per default, the memory policies as described below apply to process
+memory and shmem/tmpfs/ramfs only.  Pagecache backed by secondary storage will
+be interleaved fairly over all allowable nodes (respecting hardbindings and
+zone_reclaim_mode) in order to maximize the cache's effectiveness in reducing IO
+and to ensure predictable cache replacement.  Special setups that require
+pagecache to adhere to the configured memory policy can change this behavior by
+enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt).
+
 MEMORY POLICY CONCEPTS
 
 Scope of Memory Policies
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd491f7e8..f69e4cb78ccf 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_PAGECACHE	0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -92,6 +93,7 @@ struct vm_area_struct;
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 #define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE)   /* Page cache allocation */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75a078b..bda48453af8e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp | __GFP_PAGECACHE, 0);
 }
 #endif
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6c219f..3458994b0881 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -320,11 +320,13 @@ extern unsigned long vm_total_pages;
 
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
+extern int pagecache_mempolicy_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
+#define pagecache_mempolicy_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
 	return 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a604726d0b..a8c56c1dc98e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 	{
+		.procname	= "pagecache_mempolicy_mode",
+		.data		= &pagecache_mempolicy_mode,
+		.maxlen		= sizeof(pagecache_mempolicy_mode),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "min_unmapped_ratio",
 		.data		= &sysctl_min_unmapped_ratio,
 		.maxlen		= sizeof(sysctl_min_unmapped_ratio),
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a92021c..5bb922506906 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	int n;
 	struct page *page;
 
+	gfp |= __GFP_PAGECACHE;
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f075ed0..b28370932950 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,15 @@ again:
 					  get_pageblock_migratetype(page));
 	}
 
+	/*
+	 * All allocations eat into the round-robin batch, even
+	 * allocations that are not subject to round-robin placement
+	 * themselves.  This makes sure that allocations that ARE
+	 * subject to round-robin placement compensate for the
+	 * allocations that aren't, to have equal placement overall.
+	 */
 	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
@@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
 
 #ifdef CONFIG_NUMA
 /*
+ * pagecache_mempolicy_mode - whether page cache should honor the
+ * configured memory policy and allocate from the zonelist in order of
+ * preference, or whether it should be interleaved fairly over all
+ * allowed zones in the given zonelist to maximize cache effects and
+ * ensure predictable cache replacement.
+ */
+int pagecache_mempolicy_mode __read_mostly;
+
+/*
  * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
  * skip over zones that are not allowed by the cpuset, or that have
  * been recently (in last second) found to be nearly full.  See further
@@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
 
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
-	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+	return local_zone->node == zone->node;
 }
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
@@ -1908,22 +1925,25 @@ zonelist_scan:
 		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
 			goto try_this_zone;
 		/*
-		 * Distribute pages in proportion to the individual
-		 * zone size to ensure fair page aging.  The zone a
-		 * page was allocated in should have no effect on the
-		 * time the page has in memory before being reclaimed.
+		 * Distribute page cache pages in proportion to the
+		 * individual zone size to ensure fair page aging.
+		 * The zone a page was allocated in should have no
+		 * effect on the time the page has in memory before
+		 * being reclaimed.
 		 *
-		 * When zone_reclaim_mode is enabled, try to stay in
-		 * local zones in the fastpath.  If that fails, the
+		 * When pagecache_mempolicy_mode or zone_reclaim_mode
+		 * is enabled, try to allocate from zones within the
+		 * preferred node in the fastpath.  If that fails, the
 		 * slowpath is entered, which will do another pass
 		 * starting with the local zones, but ultimately fall
 		 * back to remote zones that do not partake in the
 		 * fairness round-robin cycle of this zonelist.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & __GFP_PAGECACHE)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
-			if (zone_reclaim_mode &&
+			if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
 			    !zone_local(preferred_zone, zone))
 				continue;
 		}
@@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * thrash fairness information for zones that are not
 		 * actually part of this zonelist's round-robin cycle.
 		 */
-		if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+		if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
+		    !zone_local(preferred_zone, zone))
 			continue;
 		mod_zone_page_state(zone, NR_ALLOC_BATCH,
 				    high_wmark_pages(zone) -
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-17 20:02 ` [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Johannes Weiner
@ 2013-12-18  6:17   ` Johannes Weiner
  2013-12-18 13:47     ` Rik van Riel
  2013-12-18 15:00     ` Mel Gorman
  2013-12-18 14:51   ` Michal Hocko
  1 sibling, 2 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18  6:17 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote:
> Hi Mel,
> 
> On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote:
> > This series is currently untested and is being posted to sync up discussions
> > on the treatment of page cache pages, particularly the sysv part. I have
> > not thought it through in detail but postings patches is the easiest way
> > to highlight where I think a problem might be.
> >
> > Changelog since v2
> > o Drop an accounting patch, behaviour is deliberate
> > o Special case tmpfs and shmem pages for discussion
> > 
> > Changelog since v1
> > o Fix lot of brain damage in the configurable policy patch
> > o Yoink a page cache annotation patch
> > o Only account batch pages against allocations eligible for the fair policy
> > o Add patch that default distributes file pages on remote nodes
> > 
> > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > bug whereby new pages could be reclaimed before old pages because of how
> > the page allocator and kswapd interacted on the per-zone LRU lists.
> 
> Not just that, it was about ensuring predictable cache replacement and
> maximizing the cache's effectiveness.  This implicitely fixed the
> kswapd interaction bug, but that was not the sole reason (I realize
> that the original changelog is incomplete and I apologize for that).
> 
> I have had offline discussions with Andrea back then and his first
> suggestion was too to make this a zone fairness placement that is
> exclusive to the local node, but eventually he agreed that the problem
> applies just as much on the global level and that we should apply
> fairness throughout the system as long as we honor zone_reclaim_mode
> and hard bindings.  During our discussions now, it turned out that
> zone_reclaim_mode is a terrible predictor for preferred locality, but
> we also more or less agreed that the locality issues in the first
> place are not really applicable to cache loads dominated by IO cost.
> 
> So I think the main discrepancy between the original patch and what we
> truly want is that aging fairness is really only relevant for actual
> cache backed by secondary storage, because cache replacement is an
> ongoing operation that involves IO.  As opposed to memory types that
> involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all
> (slab, kernel allocations), in which case we prefer NUMA locality.
> 
> > Unfortunately a side-effect missed during review was that it's now very
> > easy to allocate remote memory on NUMA machines. The problem is that
> > it is not a simple case of just restoring local allocation policies as
> > there are genuine reasons why global page aging may be prefereable. It's
> > still a major change to default behaviour so this patch makes the policy
> > configurable and sets what I think is a sensible default.
> > 
> > The patches are on top of some NUMA balancing patches currently in -mm.
> > It's untested and posted to discuss patches 4 and 6.
> 
> It might be easier in dealing with -stable if we start with the
> critical fix(es) to restore sane functionality as much and as compact
> as possible and then place the cleanups on top?
> 
> In my local tree, I have the following as the first patch:

Updated version with your tmpfs __GFP_PAGECACHE parts added and
documentation, changelog updated as necessary.  I remain unconvinced
that tmpfs pages should be round-robined, but I agree with you that it
is the conservative change to do for 3.12 and 3.12 and we can figure
out the rest later.  I sure hope that this doesn't drive most people
on NUMA to disable pagecache interleaving right away as I expect most
tmpfs workloads to see little to no reclaim and prefer locality... :/

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache

81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged
in order to ensure predictable pagecache replacement and to maximize
the cache's effectiveness of reducing IO regardless of zone or node
topology.

However, it was overzealous in round-robin placing every type of
allocation over all allowable nodes, instead of preferring locality,
which resulted in severe regressions on certain NUMA workloads that
have nothing to do with pagecache.

This patch drastically reduces the impact of the original change by
having the round-robin placement policy only apply to pagecache
allocations and no longer to anonymous memory, shmem, slab and other
types of kernel allocations.

This still changes the long-standing behavior of pagecache adhering to
the configured memory policy and preferring local allocations per
default, so make it configurable in case somebody relies on it.
However, we also expect the majority of users to prefer maximium cache
effectiveness and a predictable replacement behavior over memory
locality, so reflect this in the default setting of the sysctl.

No-signoff-without-Mel's
Cc: <stable@kernel.org> # 3.12
---
 Documentation/sysctl/vm.txt             | 20 ++++++++++++++++
 Documentation/vm/numa_memory_policy.txt |  7 ++++++
 include/linux/gfp.h                     |  4 +++-
 include/linux/pagemap.h                 |  2 +-
 include/linux/swap.h                    |  2 ++
 kernel/sysctl.c                         |  8 +++++++
 mm/filemap.c                            |  2 ++
 mm/page_alloc.c                         | 41 +++++++++++++++++++++++++--------
 mm/shmem.c                              | 14 +++++++++++
 9 files changed, 88 insertions(+), 12 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb7b64a..308c342f62ad 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
 - memory_failure_early_kill
 - memory_failure_recovery
 - min_free_kbytes
+- pagecache_mempolicy_mode
 - min_slab_ratio
 - min_unmapped_ratio
 - mmap_min_addr
@@ -404,6 +405,25 @@ Setting this too high will OOM your machine instantly.
 
 =============================================================
 
+pagecache_mempolicy_mode:
+
+This is available only on NUMA kernels.
+
+Per default, pagecache is allocated in an interleaving fashion over
+all allowed nodes (hardbindings and zone_reclaim_mode excluded),
+regardless of the selected memory policy.
+
+The assumption is that, when it comes to pagecache, users generally
+prefer predictable replacement behavior regardless of NUMA topology
+and maximizing the cache's effectiveness in reducing IO over memory
+locality.
+
+This behavior can be changed by enabling pagecache_mempolicy_mode, in
+which case page cache allocations will be placed according to the
+configured memory policy (Documentation/vm/numa_memory_policy.txt).
+
+=============================================================
+
 min_slab_ratio:
 
 This is available only on NUMA kernels.
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 4e7da6543424..72247e565908 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -16,6 +16,13 @@ programming interface that a NUMA-aware application can take advantage of.  When
 both cpusets and policies are applied to a task, the restrictions of the cpuset
 takes priority.  See "MEMORY POLICIES AND CPUSETS" below for more details.
 
+Note that, per default, the memory policies do not apply to pagecache.  Instead
+it will be interleaved fairly over all allowable nodes (respecting hardbindings
+and zone_reclaim_mode) in order to maximize the cache's effectiveness in
+reducing IO and to ensure predictable cache replacement.  Special setups that
+require pagecache to adhere to the configured memory policy can change this
+behavior by enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt).
+
 MEMORY POLICY CONCEPTS
 
 Scope of Memory Policies
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd491f7e8..f69e4cb78ccf 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_PAGECACHE	0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -92,6 +93,7 @@ struct vm_area_struct;
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 #define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE)   /* Page cache allocation */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75a078b..bda48453af8e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp | __GFP_PAGECACHE, 0);
 }
 #endif
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6c219f..3458994b0881 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -320,11 +320,13 @@ extern unsigned long vm_total_pages;
 
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
+extern int pagecache_mempolicy_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
+#define pagecache_mempolicy_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
 	return 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a604726d0b..a8c56c1dc98e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 	{
+		.procname	= "pagecache_mempolicy_mode",
+		.data		= &pagecache_mempolicy_mode,
+		.maxlen		= sizeof(pagecache_mempolicy_mode),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "min_unmapped_ratio",
 		.data		= &sysctl_min_unmapped_ratio,
 		.maxlen		= sizeof(sysctl_min_unmapped_ratio),
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a92021c..5bb922506906 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
 	int n;
 	struct page *page;
 
+	gfp |= __GFP_PAGECACHE;
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f075ed0..f7c0ecb5bb8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,15 @@ again:
 					  get_pageblock_migratetype(page));
 	}
 
+	/*
+	 * All allocations eat into the round-robin batch, even
+	 * allocations that are not subject to round-robin placement
+	 * themselves.  This makes sure that allocations that ARE
+	 * subject to round-robin placement compensate for the
+	 * allocations that aren't, to have equal placement overall.
+	 */
 	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
@@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
 
 #ifdef CONFIG_NUMA
 /*
+ * pagecache_mempolicy_mode - whether pagecache allocations should
+ * honor the configured memory policy and allocate from the zonelist
+ * in order of preference, or whether they should interleave fairly
+ * over all allowed zones in the given zonelist to maximize cache
+ * effects and ensure predictable cache replacement.
+ */
+int pagecache_mempolicy_mode __read_mostly;
+
+/*
  * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
  * skip over zones that are not allowed by the cpuset, or that have
  * been recently (in last second) found to be nearly full.  See further
@@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
 
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
-	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+	return local_zone->node == zone->node;
 }
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
@@ -1908,22 +1925,25 @@ zonelist_scan:
 		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
 			goto try_this_zone;
 		/*
-		 * Distribute pages in proportion to the individual
-		 * zone size to ensure fair page aging.  The zone a
-		 * page was allocated in should have no effect on the
-		 * time the page has in memory before being reclaimed.
+		 * Distribute pagecache pages in proportion to the
+		 * individual zone size to ensure fair page aging.
+		 * The zone a page was allocated in should have no
+		 * effect on the time the page has in memory before
+		 * being reclaimed.
 		 *
-		 * When zone_reclaim_mode is enabled, try to stay in
-		 * local zones in the fastpath.  If that fails, the
+		 * When pagecache_mempolicy_mode or zone_reclaim_mode
+		 * is enabled, try to allocate from zones within the
+		 * preferred node in the fastpath.  If that fails, the
 		 * slowpath is entered, which will do another pass
 		 * starting with the local zones, but ultimately fall
 		 * back to remote zones that do not partake in the
 		 * fairness round-robin cycle of this zonelist.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & __GFP_PAGECACHE)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
-			if (zone_reclaim_mode &&
+			if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
 			    !zone_local(preferred_zone, zone))
 				continue;
 		}
@@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * thrash fairness information for zones that are not
 		 * actually part of this zonelist's round-robin cycle.
 		 */
-		if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+		if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
+		    !zone_local(preferred_zone, zone))
 			continue;
 		mod_zone_page_state(zone, NR_ALLOC_BATCH,
 				    high_wmark_pages(zone) -
diff --git a/mm/shmem.c b/mm/shmem.c
index 8297623fcaed..02d7a9c03463 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
 	return page;
 }
 
+/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */
+static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info)
+{
+	/* If no internal shm_mount then it must be tmpfs */
+	if (IS_ERR(shm_mnt))
+		return true;
+
+	/* Consider it to be tmpfs if the superblock is not the internal mount */
+	return info->vfs_inode.i_sb != shm_mnt->mnt_sb;
+}
+
 static struct page *shmem_alloc_page(gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
 {
@@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp,
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 
+	if (shmem_inode_on_tmpfs(info))
+		gfp |= __GFP_PAGECACHE;
+
 	page = alloc_page_vma(gfp, &pvma, 0);
 
 	/* Drop reference taken by mpol_shared_policy_lookup() */
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18  6:17   ` Johannes Weiner
@ 2013-12-18 13:47     ` Rik van Riel
  2013-12-18 14:17       ` Johannes Weiner
  2013-12-18 15:00     ` Mel Gorman
  1 sibling, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2013-12-18 13:47 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Mel Gorman, Andrew Morton, Dave Hansen, Linux-MM, LKML

On 12/18/2013 01:17 AM, Johannes Weiner wrote:

> Updated version with your tmpfs __GFP_PAGECACHE parts added and
> documentation, changelog updated as necessary.  I remain unconvinced
> that tmpfs pages should be round-robined, but I agree with you that it
> is the conservative change to do for 3.12 and 3.12 and we can figure
> out the rest later.  I sure hope that this doesn't drive most people
> on NUMA to disable pagecache interleaving right away as I expect most
> tmpfs workloads to see little to no reclaim and prefer locality... :/

Actually, I suspect most tmpfs heavy workloads will be things like
databases with shared memory segments. Those tend to benefit from
having all of the system's memory bandwidth available. The worker
threads/processes tend to live all over the system, too...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18 13:47     ` Rik van Riel
@ 2013-12-18 14:17       ` Johannes Weiner
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18 14:17 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mel Gorman, Andrew Morton, Dave Hansen, Linux-MM, LKML

On Wed, Dec 18, 2013 at 08:47:45AM -0500, Rik van Riel wrote:
> On 12/18/2013 01:17 AM, Johannes Weiner wrote:
> 
> > Updated version with your tmpfs __GFP_PAGECACHE parts added and
> > documentation, changelog updated as necessary.  I remain unconvinced
> > that tmpfs pages should be round-robined, but I agree with you that it
> > is the conservative change to do for 3.12 and 3.12 and we can figure
> > out the rest later.  I sure hope that this doesn't drive most people
> > on NUMA to disable pagecache interleaving right away as I expect most
> > tmpfs workloads to see little to no reclaim and prefer locality... :/
> 
> Actually, I suspect most tmpfs heavy workloads will be things like
> databases with shared memory segments. Those tend to benefit from
> having all of the system's memory bandwidth available. The worker
> threads/processes tend to live all over the system, too...

Shared memory segments are explicitely excluded from the interleaving,
though.  The distinction is between the internal tmpfs mount that sysv
shmem uses (mempolicy) and tmpfs mounts that use the actual filesystem
interface (pagecache interleave).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18  6:17   ` Johannes Weiner
  2013-12-18 13:47     ` Rik van Riel
@ 2013-12-18 15:00     ` Mel Gorman
  2013-12-18 16:09       ` Mel Gorman
  2013-12-18 19:48       ` Johannes Weiner
  1 sibling, 2 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-18 15:00 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Wed, Dec 18, 2013 at 01:17:50AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote:
> > Hi Mel,
> > 
> > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote:
> > > This series is currently untested and is being posted to sync up discussions
> > > on the treatment of page cache pages, particularly the sysv part. I have
> > > not thought it through in detail but postings patches is the easiest way
> > > to highlight where I think a problem might be.
> > >
> > > Changelog since v2
> > > o Drop an accounting patch, behaviour is deliberate
> > > o Special case tmpfs and shmem pages for discussion
> > > 
> > > Changelog since v1
> > > o Fix lot of brain damage in the configurable policy patch
> > > o Yoink a page cache annotation patch
> > > o Only account batch pages against allocations eligible for the fair policy
> > > o Add patch that default distributes file pages on remote nodes
> > > 
> > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > bug whereby new pages could be reclaimed before old pages because of how
> > > the page allocator and kswapd interacted on the per-zone LRU lists.
> > 
> > Not just that, it was about ensuring predictable cache replacement and
> > maximizing the cache's effectiveness.  This implicitely fixed the
> > kswapd interaction bug, but that was not the sole reason (I realize
> > that the original changelog is incomplete and I apologize for that).
> > 
> > I have had offline discussions with Andrea back then and his first
> > suggestion was too to make this a zone fairness placement that is
> > exclusive to the local node, but eventually he agreed that the problem
> > applies just as much on the global level and that we should apply
> > fairness throughout the system as long as we honor zone_reclaim_mode
> > and hard bindings.  During our discussions now, it turned out that
> > zone_reclaim_mode is a terrible predictor for preferred locality, but
> > we also more or less agreed that the locality issues in the first
> > place are not really applicable to cache loads dominated by IO cost.
> > 
> > So I think the main discrepancy between the original patch and what we
> > truly want is that aging fairness is really only relevant for actual
> > cache backed by secondary storage, because cache replacement is an
> > ongoing operation that involves IO.  As opposed to memory types that
> > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all
> > (slab, kernel allocations), in which case we prefer NUMA locality.
> > 
> > > Unfortunately a side-effect missed during review was that it's now very
> > > easy to allocate remote memory on NUMA machines. The problem is that
> > > it is not a simple case of just restoring local allocation policies as
> > > there are genuine reasons why global page aging may be prefereable. It's
> > > still a major change to default behaviour so this patch makes the policy
> > > configurable and sets what I think is a sensible default.
> > > 
> > > The patches are on top of some NUMA balancing patches currently in -mm.
> > > It's untested and posted to discuss patches 4 and 6.
> > 
> > It might be easier in dealing with -stable if we start with the
> > critical fix(es) to restore sane functionality as much and as compact
> > as possible and then place the cleanups on top?
> > 
> > In my local tree, I have the following as the first patch:
> 
> Updated version with your tmpfs __GFP_PAGECACHE parts added and
> documentation, changelog updated as necessary.  I remain unconvinced
> that tmpfs pages should be round-robined, but I agree with you that it
> is the conservative change to do for 3.12 and 3.12 and we can figure
> out the rest later. 

Assume you with 3.12 and 3.13 here.

> I sure hope that this doesn't drive most people
> on NUMA to disable pagecache interleaving right away as I expect most
> tmpfs workloads to see little to no reclaim and prefer locality... :/
> 

I hope you're right but I expect the experience will be like
zone_reclaim_mode. We're going to be looking out for bug reports that
are "fixed" by disabling pagecache locality and pushing back on them by
fixing the real problem.

This was the experience with zone_reclaim_mode when it started going
wrong. It was also the experience with THP for a very long time.
Disabling THP was a workaround for all sorts of problems and it was very
important to fix them and push back on anyone documenting disabling THP
as a standard workaround.

> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache
> 

Monolithic patch with multiple changes but meh. I'm not pushed because I
know what the breakout looks like. FWIW, I had intended the entire of my
broken-out series for 3.12 and 3.13 once it got ironed out. I find the
series easier to understand but of course I would.

> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged
> in order to ensure predictable pagecache replacement and to maximize
> the cache's effectiveness of reducing IO regardless of zone or node
> topology.
> 
> However, it was overzealous in round-robin placing every type of
> allocation over all allowable nodes, instead of preferring locality,
> which resulted in severe regressions on certain NUMA workloads that
> have nothing to do with pagecache.
> 
> This patch drastically reduces the impact of the original change by
> having the round-robin placement policy only apply to pagecache
> allocations and no longer to anonymous memory, shmem, slab and other
> types of kernel allocations.
> 
> This still changes the long-standing behavior of pagecache adhering to
> the configured memory policy and preferring local allocations per
> default, so make it configurable in case somebody relies on it.
> However, we also expect the majority of users to prefer maximium cache
> effectiveness and a predictable replacement behavior over memory
> locality, so reflect this in the default setting of the sysctl.
> 
> No-signoff-without-Mel's
> Cc: <stable@kernel.org> # 3.12
> ---
>  Documentation/sysctl/vm.txt             | 20 ++++++++++++++++
>  Documentation/vm/numa_memory_policy.txt |  7 ++++++
>  include/linux/gfp.h                     |  4 +++-
>  include/linux/pagemap.h                 |  2 +-
>  include/linux/swap.h                    |  2 ++
>  kernel/sysctl.c                         |  8 +++++++
>  mm/filemap.c                            |  2 ++
>  mm/page_alloc.c                         | 41 +++++++++++++++++++++++++--------
>  mm/shmem.c                              | 14 +++++++++++
>  9 files changed, 88 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 1fbd4eb7b64a..308c342f62ad 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
>  - memory_failure_early_kill
>  - memory_failure_recovery
>  - min_free_kbytes
> +- pagecache_mempolicy_mode
>  - min_slab_ratio
>  - min_unmapped_ratio
>  - mmap_min_addr

Sure about the name?

This is a boolean and "mode" implies it might be a bitmask. That said, I
recognise that my own naming also sucked because complaining about yours
I can see that mine also sucks.

> @@ -404,6 +405,25 @@ Setting this too high will OOM your machine instantly.
>  
>  =============================================================
>  
> +pagecache_mempolicy_mode:
> +
> +This is available only on NUMA kernels.
> +
> +Per default, pagecache is allocated in an interleaving fashion over
> +all allowed nodes (hardbindings and zone_reclaim_mode excluded),
> +regardless of the selected memory policy.
> +
> +The assumption is that, when it comes to pagecache, users generally
> +prefer predictable replacement behavior regardless of NUMA topology
> +and maximizing the cache's effectiveness in reducing IO over memory
> +locality.
> +
> +This behavior can be changed by enabling pagecache_mempolicy_mode, in
> +which case page cache allocations will be placed according to the
> +configured memory policy (Documentation/vm/numa_memory_policy.txt).
> +

Ok this indicates that pagecache will still be interleaved on zones local
to the node the process is allocating on. Good because that preserves a
very important aspect of your original patch.

The current description feels a little backwards though -- "Enable this
to *not* interleave pagecache". This documented behaviour says to me
that pagecache_obey_mempolicy might be a better name if enabling it uses
the system default memory policy.  However, even that might put us in a
corner. Ultimately we want this to be controllable on a per-process basis
using memory policies.

Merging what I have in v3, unreleased v4 and this thing I ended up with
this. The observation about cpusets was raised by Michal Hocko on IRC.

---8<---
mpol_interleave_files

This is available only on NUMA kernels.

Historically, the default behaviour of the system is to allocate memory
local to the process. The behaviour was usually modified through the use
of memory policies while zone_reclaim_mode controls how strict the local
memory allocation policy is.

Issues arise when the allocating process is frequently running on the same
node. The kernels memory reclaim daemon runs one instance per NUMA node.
A consequence is that relatively new memory may be reclaimed by kswapd when
the allocating process is running on a specific node. The user-visible
impact is that the system appears to do more IO than necessary when a
workload is accessing files that are larger than a given NUMA node.

To address this problem, the default system memory policy is modified by
this tunable.

When this tunable is enabled, the system default memory policy will
interleave batches of file-backed pages over all allowed zones and nodes.
The assumption is that, when it comes to file pages that users generally
prefer predictable replacement behavior regardless of NUMA topology and
maximizing the page cache's effectiveness in reducing IO over memory
locality.

The tunable zone_reclaim_mode overrides this and enabling zone_reclaim_mode
functionally disables mpol_interleave_pagecache.

A process running within a memory cpuset will obey the cpuset policy and
ignore mpol_interleave_files.

At the time of writing, this parameter cannot be overridden by a process
using set_mempolicy to set the task memory policy. Similarly, numactl
setting the task memory policy will not override this setting. This may
change in the future.

The tunable is default enabled and has two recognised parameters;

0: Use the MPOL_LOCAL policy as the system-wide default
1: Batch interleave file-backed allocations over all allowed nodes

One enabled, the downside is that some file accesses will now be to remote
memory even though the local node had available resources. This will hurt
workloads with small or short lived files that fit easily within one node.
The upside is that workloads working on files larger than a NUMA node will
not reclaim active pages prematurely.
---8<---

> +=============================================================
> +
>  min_slab_ratio:
>  
>  This is available only on NUMA kernels.
> diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
> index 4e7da6543424..72247e565908 100644
> --- a/Documentation/vm/numa_memory_policy.txt
> +++ b/Documentation/vm/numa_memory_policy.txt
> @@ -16,6 +16,13 @@ programming interface that a NUMA-aware application can take advantage of.  When
>  both cpusets and policies are applied to a task, the restrictions of the cpuset
>  takes priority.  See "MEMORY POLICIES AND CPUSETS" below for more details.
>  
> +Note that, per default, the memory policies do not apply to pagecache.  Instead
> +it will be interleaved fairly over all allowable nodes (respecting hardbindings
> +and zone_reclaim_mode) in order to maximize the cache's effectiveness in
> +reducing IO and to ensure predictable cache replacement.  Special setups that
> +require pagecache to adhere to the configured memory policy can change this
> +behavior by enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt).
> +

Manual pages should also be updated.

>  MEMORY POLICY CONCEPTS
>  
>  Scope of Memory Policies
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 9b4dd491f7e8..f69e4cb78ccf 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -35,6 +35,7 @@ struct vm_area_struct;
>  #define ___GFP_NO_KSWAPD	0x400000u
>  #define ___GFP_OTHER_NODE	0x800000u
>  #define ___GFP_WRITE		0x1000000u
> +#define ___GFP_PAGECACHE	0x2000000u
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>  
>  /*
> @@ -92,6 +93,7 @@ struct vm_area_struct;
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
>  #define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
>  #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
> +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE)   /* Page cache allocation */
>  
>  /*
>   * This may seem redundant, but it's a way of annotating false positives vs.
> @@ -99,7 +101,7 @@ struct vm_area_struct;
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /* This equals 0, but use constants in case they ever change */
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75a078b..bda48453af8e 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
>  #else
>  static inline struct page *__page_cache_alloc(gfp_t gfp)
>  {
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp | __GFP_PAGECACHE, 0);
>  }
>  #endif
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6c219f..3458994b0881 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -320,11 +320,13 @@ extern unsigned long vm_total_pages;
>  
>  #ifdef CONFIG_NUMA
>  extern int zone_reclaim_mode;
> +extern int pagecache_mempolicy_mode;
>  extern int sysctl_min_unmapped_ratio;
>  extern int sysctl_min_slab_ratio;
>  extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
>  #else
>  #define zone_reclaim_mode 0
> +#define pagecache_mempolicy_mode 0
>  static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
>  {
>  	return 0;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 34a604726d0b..a8c56c1dc98e 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= &zero,
>  	},
>  	{
> +		.procname	= "pagecache_mempolicy_mode",
> +		.data		= &pagecache_mempolicy_mode,
> +		.maxlen		= sizeof(pagecache_mempolicy_mode),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +		.extra1		= &zero,
> +	},
> +	{
>  		.procname	= "min_unmapped_ratio",
>  		.data		= &sysctl_min_unmapped_ratio,
>  		.maxlen		= sizeof(sysctl_min_unmapped_ratio),
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b7749a92021c..5bb922506906 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
>  	int n;
>  	struct page *page;
>  
> +	gfp |= __GFP_PAGECACHE;
> +
>  	if (cpuset_do_page_mem_spread()) {
>  		unsigned int cpuset_mems_cookie;
>  		do {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 580a5f075ed0..f7c0ecb5bb8b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1547,7 +1547,15 @@ again:
>  					  get_pageblock_migratetype(page));
>  	}
>  
> +	/*
> +	 * All allocations eat into the round-robin batch, even
> +	 * allocations that are not subject to round-robin placement
> +	 * themselves.  This makes sure that allocations that ARE
> +	 * subject to round-robin placement compensate for the
> +	 * allocations that aren't, to have equal placement overall.
> +	 */
>  	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
> +
>  	__count_zone_vm_events(PGALLOC, zone, 1 << order);
>  	zone_statistics(preferred_zone, zone, gfp_flags);
>  	local_irq_restore(flags);

Thanks.

> @@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
>  
>  #ifdef CONFIG_NUMA
>  /*
> + * pagecache_mempolicy_mode - whether pagecache allocations should
> + * honor the configured memory policy and allocate from the zonelist
> + * in order of preference, or whether they should interleave fairly
> + * over all allowed zones in the given zonelist to maximize cache
> + * effects and ensure predictable cache replacement.
> + */
> +int pagecache_mempolicy_mode __read_mostly;
> +
> +/*
>   * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
>   * skip over zones that are not allowed by the cpuset, or that have
>   * been recently (in last second) found to be nearly full.  See further
> @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>  
>  static bool zone_local(struct zone *local_zone, struct zone *zone)
>  {
> -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> +	return local_zone->node == zone->node;
>  }

Does that not break on !CONFIG_NUMA?

It's why I used zone_to_nid

>  
>  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> @@ -1908,22 +1925,25 @@ zonelist_scan:
>  		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
>  			goto try_this_zone;
>  		/*
> -		 * Distribute pages in proportion to the individual
> -		 * zone size to ensure fair page aging.  The zone a
> -		 * page was allocated in should have no effect on the
> -		 * time the page has in memory before being reclaimed.
> +		 * Distribute pagecache pages in proportion to the
> +		 * individual zone size to ensure fair page aging.
> +		 * The zone a page was allocated in should have no
> +		 * effect on the time the page has in memory before
> +		 * being reclaimed.
>  		 *
> -		 * When zone_reclaim_mode is enabled, try to stay in
> -		 * local zones in the fastpath.  If that fails, the
> +		 * When pagecache_mempolicy_mode or zone_reclaim_mode
> +		 * is enabled, try to allocate from zones within the
> +		 * preferred node in the fastpath.  If that fails, the
>  		 * slowpath is entered, which will do another pass
>  		 * starting with the local zones, but ultimately fall
>  		 * back to remote zones that do not partake in the
>  		 * fairness round-robin cycle of this zonelist.
>  		 */
> -		if (alloc_flags & ALLOC_WMARK_LOW) {
> +		if ((alloc_flags & ALLOC_WMARK_LOW) &&
> +		    (gfp_mask & __GFP_PAGECACHE)) {
>  			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
>  				continue;

NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or
pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in
some cases. Looks a bit fishy even though I can't quite put my finger on it.

I also got details wrong here in the v3 of the series. In an unreleased
v4 of the series I had corrected the treatment of slab pages in line
with your wishes and reused the broken out helper in prepare_slowpath to
keep the decision in sync.

It's still in development but even if it gets rejected it'll act as a
comparison point to yours.

> -			if (zone_reclaim_mode &&
> +			if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
>  			    !zone_local(preferred_zone, zone))
>  				continue;
>  		}

Documention says "enabling pagecache_mempolicy_mode, in which case page cache
allocations will be placed according to the configured memory policy". Should
that be !pagecache_mempolicy_mode? I'm getting confused with the double nots.

Breaking this out would be more comprehensible.

On a semi-related note, we might encounter a problem later where the
interleaving causes us to skip over usable zones and zones with available
batches are !zone_dirty_ok. We'd fall back to the slowpatch resetting the
batches so it will not be particularly visible but there might be some
interactions there.

> @@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * thrash fairness information for zones that are not
>  		 * actually part of this zonelist's round-robin cycle.
>  		 */
> -		if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
> +		if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
> +		    !zone_local(preferred_zone, zone))
>  			continue;
>  		mod_zone_page_state(zone, NR_ALLOC_BATCH,
>  				    high_wmark_pages(zone) -
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8297623fcaed..02d7a9c03463 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>  	return page;
>  }
>  
> +/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */
> +static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info)
> +{
> +	/* If no internal shm_mount then it must be tmpfs */
> +	if (IS_ERR(shm_mnt))
> +		return true;
> +
> +	/* Consider it to be tmpfs if the superblock is not the internal mount */
> +	return info->vfs_inode.i_sb != shm_mnt->mnt_sb;
> +}
> +
>  static struct page *shmem_alloc_page(gfp_t gfp,
>  			struct shmem_inode_info *info, pgoff_t index)
>  {
> @@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp,
>  	pvma.vm_ops = NULL;
>  	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
>  
> +	if (shmem_inode_on_tmpfs(info))
> +		gfp |= __GFP_PAGECACHE;
> +
>  	page = alloc_page_vma(gfp, &pvma, 0);
>  
>  	/* Drop reference taken by mpol_shared_policy_lookup() */

For what it's worth, this is what I've currently kicked off testes for

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-pgalloc-interleave-zones-v4r12

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18 15:00     ` Mel Gorman
@ 2013-12-18 16:09       ` Mel Gorman
  2013-12-18 19:48       ` Johannes Weiner
  1 sibling, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-18 16:09 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Wed, Dec 18, 2013 at 03:00:38PM +0000, Mel Gorman wrote:
> 
> For what it's worth, this is what I've currently kicked off testes for
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-pgalloc-interleave-zones-v4r12
> 

Pushed a dirty tree by accident. Now mm-pgalloc-interleave-zones-v4r13

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18 15:00     ` Mel Gorman
  2013-12-18 16:09       ` Mel Gorman
@ 2013-12-18 19:48       ` Johannes Weiner
  2013-12-19 11:20         ` Mel Gorman
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18 19:48 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Wed, Dec 18, 2013 at 03:00:38PM +0000, Mel Gorman wrote:
> On Wed, Dec 18, 2013 at 01:17:50AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote:
> > > Hi Mel,
> > > 
> > > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote:
> > > > This series is currently untested and is being posted to sync up discussions
> > > > on the treatment of page cache pages, particularly the sysv part. I have
> > > > not thought it through in detail but postings patches is the easiest way
> > > > to highlight where I think a problem might be.
> > > >
> > > > Changelog since v2
> > > > o Drop an accounting patch, behaviour is deliberate
> > > > o Special case tmpfs and shmem pages for discussion
> > > > 
> > > > Changelog since v1
> > > > o Fix lot of brain damage in the configurable policy patch
> > > > o Yoink a page cache annotation patch
> > > > o Only account batch pages against allocations eligible for the fair policy
> > > > o Add patch that default distributes file pages on remote nodes
> > > > 
> > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > bug whereby new pages could be reclaimed before old pages because of how
> > > > the page allocator and kswapd interacted on the per-zone LRU lists.
> > > 
> > > Not just that, it was about ensuring predictable cache replacement and
> > > maximizing the cache's effectiveness.  This implicitely fixed the
> > > kswapd interaction bug, but that was not the sole reason (I realize
> > > that the original changelog is incomplete and I apologize for that).
> > > 
> > > I have had offline discussions with Andrea back then and his first
> > > suggestion was too to make this a zone fairness placement that is
> > > exclusive to the local node, but eventually he agreed that the problem
> > > applies just as much on the global level and that we should apply
> > > fairness throughout the system as long as we honor zone_reclaim_mode
> > > and hard bindings.  During our discussions now, it turned out that
> > > zone_reclaim_mode is a terrible predictor for preferred locality, but
> > > we also more or less agreed that the locality issues in the first
> > > place are not really applicable to cache loads dominated by IO cost.
> > > 
> > > So I think the main discrepancy between the original patch and what we
> > > truly want is that aging fairness is really only relevant for actual
> > > cache backed by secondary storage, because cache replacement is an
> > > ongoing operation that involves IO.  As opposed to memory types that
> > > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all
> > > (slab, kernel allocations), in which case we prefer NUMA locality.
> > > 
> > > > Unfortunately a side-effect missed during review was that it's now very
> > > > easy to allocate remote memory on NUMA machines. The problem is that
> > > > it is not a simple case of just restoring local allocation policies as
> > > > there are genuine reasons why global page aging may be prefereable. It's
> > > > still a major change to default behaviour so this patch makes the policy
> > > > configurable and sets what I think is a sensible default.
> > > > 
> > > > The patches are on top of some NUMA balancing patches currently in -mm.
> > > > It's untested and posted to discuss patches 4 and 6.
> > > 
> > > It might be easier in dealing with -stable if we start with the
> > > critical fix(es) to restore sane functionality as much and as compact
> > > as possible and then place the cleanups on top?
> > > 
> > > In my local tree, I have the following as the first patch:
> > 
> > Updated version with your tmpfs __GFP_PAGECACHE parts added and
> > documentation, changelog updated as necessary.  I remain unconvinced
> > that tmpfs pages should be round-robined, but I agree with you that it
> > is the conservative change to do for 3.12 and 3.12 and we can figure
> > out the rest later. 
> 
> Assume you with 3.12 and 3.13 here.

Yes :)

> > ---
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache
> > 
> 
> Monolithic patch with multiple changes but meh. I'm not pushed because I
> know what the breakout looks like. FWIW, I had intended the entire of my
> broken-out series for 3.12 and 3.13 once it got ironed out. I find the
> series easier to understand but of course I would.

And of course I can live without the cleanups to make code I wrote
more readable ;-) I'm happy to defer on this, let's keep logical
changes separated.

> > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged
> > in order to ensure predictable pagecache replacement and to maximize
> > the cache's effectiveness of reducing IO regardless of zone or node
> > topology.
> > 
> > However, it was overzealous in round-robin placing every type of
> > allocation over all allowable nodes, instead of preferring locality,
> > which resulted in severe regressions on certain NUMA workloads that
> > have nothing to do with pagecache.
> > 
> > This patch drastically reduces the impact of the original change by
> > having the round-robin placement policy only apply to pagecache
> > allocations and no longer to anonymous memory, shmem, slab and other
> > types of kernel allocations.
> > 
> > This still changes the long-standing behavior of pagecache adhering to
> > the configured memory policy and preferring local allocations per
> > default, so make it configurable in case somebody relies on it.
> > However, we also expect the majority of users to prefer maximium cache
> > effectiveness and a predictable replacement behavior over memory
> > locality, so reflect this in the default setting of the sysctl.
> > 
> > No-signoff-without-Mel's
> > Cc: <stable@kernel.org> # 3.12
> > ---
> >  Documentation/sysctl/vm.txt             | 20 ++++++++++++++++
> >  Documentation/vm/numa_memory_policy.txt |  7 ++++++
> >  include/linux/gfp.h                     |  4 +++-
> >  include/linux/pagemap.h                 |  2 +-
> >  include/linux/swap.h                    |  2 ++
> >  kernel/sysctl.c                         |  8 +++++++
> >  mm/filemap.c                            |  2 ++
> >  mm/page_alloc.c                         | 41 +++++++++++++++++++++++++--------
> >  mm/shmem.c                              | 14 +++++++++++
> >  9 files changed, 88 insertions(+), 12 deletions(-)
> > 
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 1fbd4eb7b64a..308c342f62ad 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
> >  - memory_failure_early_kill
> >  - memory_failure_recovery
> >  - min_free_kbytes
> > +- pagecache_mempolicy_mode
> >  - min_slab_ratio
> >  - min_unmapped_ratio
> >  - mmap_min_addr
> 
> Sure about the name?
> 
> This is a boolean and "mode" implies it might be a bitmask. That said, I
> recognise that my own naming also sucked because complaining about yours
> I can see that mine also sucks.

Is it because of how we use zone_reclaim_mode?  I don't see anything
wrong with a "mode" toggle that switches between only two modes of
operation instead of three or more.  But English being a second
language and all...

> > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> >  
> >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> >  {
> > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > +	return local_zone->node == zone->node;
> >  }
> 
> Does that not break on !CONFIG_NUMA?
> 
> It's why I used zone_to_nid

There is a separate definition for !CONFIG_NUMA, it fit nicely next to
the zlc stuff.

> >  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> > @@ -1908,22 +1925,25 @@ zonelist_scan:
> >  		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
> >  			goto try_this_zone;
> >  		/*
> > -		 * Distribute pages in proportion to the individual
> > -		 * zone size to ensure fair page aging.  The zone a
> > -		 * page was allocated in should have no effect on the
> > -		 * time the page has in memory before being reclaimed.
> > +		 * Distribute pagecache pages in proportion to the
> > +		 * individual zone size to ensure fair page aging.
> > +		 * The zone a page was allocated in should have no
> > +		 * effect on the time the page has in memory before
> > +		 * being reclaimed.
> >  		 *
> > -		 * When zone_reclaim_mode is enabled, try to stay in
> > -		 * local zones in the fastpath.  If that fails, the
> > +		 * When pagecache_mempolicy_mode or zone_reclaim_mode
> > +		 * is enabled, try to allocate from zones within the
> > +		 * preferred node in the fastpath.  If that fails, the
> >  		 * slowpath is entered, which will do another pass
> >  		 * starting with the local zones, but ultimately fall
> >  		 * back to remote zones that do not partake in the
> >  		 * fairness round-robin cycle of this zonelist.
> >  		 */
> > -		if (alloc_flags & ALLOC_WMARK_LOW) {
> > +		if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > +		    (gfp_mask & __GFP_PAGECACHE)) {
> >  			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> >  				continue;
> 
> NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or
> pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in
> some cases. Looks a bit fishy even though I can't quite put my finger on it.
> 
> I also got details wrong here in the v3 of the series. In an unreleased
> v4 of the series I had corrected the treatment of slab pages in line
> with your wishes and reused the broken out helper in prepare_slowpath to
> keep the decision in sync.
> 
> It's still in development but even if it gets rejected it'll act as a
> comparison point to yours.
> 
> > -			if (zone_reclaim_mode &&
> > +			if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
> >  			    !zone_local(preferred_zone, zone))
> >  				continue;
> >  		}
> 
> Documention says "enabling pagecache_mempolicy_mode, in which case page cache
> allocations will be placed according to the configured memory policy". Should
> that be !pagecache_mempolicy_mode? I'm getting confused with the double nots.

Yes, it's a bit weird.

We want to consider the round-robin batches for local zones but at the
same time avoid exhausted batches from pushing the allocation off-node
when either of those modes are enabled.  So in the fastpath we filter
for both and in the slowpath, once kswapd has been woken at the same
time that the batches have been reset to launch the new aging cycle,
we try in order of zonelist preference.

However, to answer your question above, if the slowpath still has to
fall back to a remote zone, we don't want to reset its batch because
we didn't verify it was actually exhausted in the fastpath and we
could risk cutting short the aging cycle for that particular zone.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18 19:48       ` Johannes Weiner
@ 2013-12-19 11:20         ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-19 11:20 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Wed, Dec 18, 2013 at 02:48:13PM -0500, Johannes Weiner wrote:
> > <SNIP>
> > 
> > Sure about the name?
> > 
> > This is a boolean and "mode" implies it might be a bitmask. That said, I
> > recognise that my own naming also sucked because complaining about yours
> > I can see that mine also sucks.
> 
> Is it because of how we use zone_reclaim_mode? I don't see anything
> wrong with a "mode" toggle that switches between only two modes of
> operation instead of three or more.  But English being a second
> language and all...
> 

It's not just zone_reclaim_mode. Most references to mode in the VM (but
not all because who needs consistentcy) refer to either a mask or multiple
potential values. isolate_mode_t, gfp masks referred to as mode, memory
policies described as mode, migration modes etc.

Intuitively, I expect "mode" to not be a binary value.

> > > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > >  
> > >  static bool zone_local(struct zone *local_zone, struct zone *zone)
> > >  {
> > > -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > +	return local_zone->node == zone->node;
> > >  }
> > 
> > Does that not break on !CONFIG_NUMA?
> > 
> > It's why I used zone_to_nid
> 
> There is a separate definition for !CONFIG_NUMA, it fit nicely next to
> the zlc stuff.
> 

Ah, fair enough.

> > >  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> > > @@ -1908,22 +1925,25 @@ zonelist_scan:
> > >  		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
> > >  			goto try_this_zone;
> > >  		/*
> > > -		 * Distribute pages in proportion to the individual
> > > -		 * zone size to ensure fair page aging.  The zone a
> > > -		 * page was allocated in should have no effect on the
> > > -		 * time the page has in memory before being reclaimed.
> > > +		 * Distribute pagecache pages in proportion to the
> > > +		 * individual zone size to ensure fair page aging.
> > > +		 * The zone a page was allocated in should have no
> > > +		 * effect on the time the page has in memory before
> > > +		 * being reclaimed.
> > >  		 *
> > > -		 * When zone_reclaim_mode is enabled, try to stay in
> > > -		 * local zones in the fastpath.  If that fails, the
> > > +		 * When pagecache_mempolicy_mode or zone_reclaim_mode
> > > +		 * is enabled, try to allocate from zones within the
> > > +		 * preferred node in the fastpath.  If that fails, the
> > >  		 * slowpath is entered, which will do another pass
> > >  		 * starting with the local zones, but ultimately fall
> > >  		 * back to remote zones that do not partake in the
> > >  		 * fairness round-robin cycle of this zonelist.
> > >  		 */
> > > -		if (alloc_flags & ALLOC_WMARK_LOW) {
> > > +		if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > > +		    (gfp_mask & __GFP_PAGECACHE)) {
> > >  			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> > >  				continue;
> > 
> > NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or
> > pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in
> > some cases. Looks a bit fishy even though I can't quite put my finger on it.
> > 
> > I also got details wrong here in the v3 of the series. In an unreleased
> > v4 of the series I had corrected the treatment of slab pages in line
> > with your wishes and reused the broken out helper in prepare_slowpath to
> > keep the decision in sync.
> > 
> > It's still in development but even if it gets rejected it'll act as a
> > comparison point to yours.
> > 
> > > -			if (zone_reclaim_mode &&
> > > +			if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
> > >  			    !zone_local(preferred_zone, zone))
> > >  				continue;
> > >  		}
> > 
> > Documention says "enabling pagecache_mempolicy_mode, in which case page cache
> > allocations will be placed according to the configured memory policy". Should
> > that be !pagecache_mempolicy_mode? I'm getting confused with the double nots.
> 
> Yes, it's a bit weird.
> 
> We want to consider the round-robin batches for local zones but at the
> same time avoid exhausted batches from pushing the allocation off-node
> when either of those modes are enabled.  So in the fastpath we filter
> for both and in the slowpath, once kswapd has been woken at the same
> time that the batches have been reset to launch the new aging cycle,
> we try in order of zonelist preference.
> 
> However, to answer your question above, if the slowpath still has to
> fall back to a remote zone, we don't want to reset its batch because
> we didn't verify it was actually exhausted in the fastpath and we
> could risk cutting short the aging cycle for that particular zone.

Understood, thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-17 20:02 ` [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Johannes Weiner
  2013-12-18  6:17   ` Johannes Weiner
@ 2013-12-18 14:51   ` Michal Hocko
  2013-12-18 15:18     ` Johannes Weiner
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-12-18 14:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
	LKML

On Tue 17-12-13 15:02:10, Johannes Weiner wrote:
[...]
> +pagecache_mempolicy_mode:
> +
> +This is available only on NUMA kernels.
> +
> +Per default, the configured memory policy is applicable to anonymous
> +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
> +interleaving fashion over all allowed nodes (hardbindings and
> +zone_reclaim_mode excluded).
> +
> +The assumption is that, when it comes to pagecache, users generally
> +prefer predictable replacement behavior regardless of NUMA topology
> +and maximizing the cache's effectiveness in reducing IO over memory
> +locality.

Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing
semantically? The setting is per-cpuset rather than global which makes
it harder to use but essentially it tries to distribute page cache pages
across all the nodes.

This is really getting confusing. We have zone_reclaim_mode to keep
memory local in general, pagecache_mempolicy_mode to keep page cache
local and PF_SPREAD_PAGE to spread the page cache around nodes.

> +
> +This behavior can be changed by enabling pagecache_mempolicy_mode, in
> +which case page cache allocations will be placed according to the
> +configured memory policy (Documentation/vm/numa_memory_policy.txt).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18 14:51   ` Michal Hocko
@ 2013-12-18 15:18     ` Johannes Weiner
  2013-12-18 16:20       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18 15:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
	LKML

On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote:
> On Tue 17-12-13 15:02:10, Johannes Weiner wrote:
> [...]
> > +pagecache_mempolicy_mode:
> > +
> > +This is available only on NUMA kernels.
> > +
> > +Per default, the configured memory policy is applicable to anonymous
> > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
> > +interleaving fashion over all allowed nodes (hardbindings and
> > +zone_reclaim_mode excluded).
> > +
> > +The assumption is that, when it comes to pagecache, users generally
> > +prefer predictable replacement behavior regardless of NUMA topology
> > +and maximizing the cache's effectiveness in reducing IO over memory
> > +locality.
> 
> Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing
> semantically? The setting is per-cpuset rather than global which makes
> it harder to use but essentially it tries to distribute page cache pages
> across all the nodes.
>
> This is really getting confusing. We have zone_reclaim_mode to keep
> memory local in general, pagecache_mempolicy_mode to keep page cache
> local and PF_SPREAD_PAGE to spread the page cache around nodes.

zone_reclaim_mode is a global setting to go through great lengths to
stay on local nodes, intended to be used depending on the hardware,
not the workload.

Mempolicy on the other hand is to optimize placement for maximum
locality depending on access patterns of a workload or even just the
subset of a workload.  I'm trying to change whether this applies to
page cache (due to different locality / cache effectiveness tradeoff)
and we want to provide pagecache_mempolicy_mode to revert in the field
in case this is a mistake.

PF_SPREAD_PAGE becomes implied per default and should eventually be
removed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18 15:18     ` Johannes Weiner
@ 2013-12-18 16:20       ` Michal Hocko
  2013-12-18 19:20         ` Johannes Weiner
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-12-18 16:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
	LKML

On Wed 18-12-13 10:18:46, Johannes Weiner wrote:
> On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote:
> > On Tue 17-12-13 15:02:10, Johannes Weiner wrote:
> > [...]
> > > +pagecache_mempolicy_mode:
> > > +
> > > +This is available only on NUMA kernels.
> > > +
> > > +Per default, the configured memory policy is applicable to anonymous
> > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
> > > +interleaving fashion over all allowed nodes (hardbindings and
> > > +zone_reclaim_mode excluded).
> > > +
> > > +The assumption is that, when it comes to pagecache, users generally
> > > +prefer predictable replacement behavior regardless of NUMA topology
> > > +and maximizing the cache's effectiveness in reducing IO over memory
> > > +locality.
> > 
> > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing
> > semantically? The setting is per-cpuset rather than global which makes
> > it harder to use but essentially it tries to distribute page cache pages
> > across all the nodes.
> >
> > This is really getting confusing. We have zone_reclaim_mode to keep
> > memory local in general, pagecache_mempolicy_mode to keep page cache
> > local and PF_SPREAD_PAGE to spread the page cache around nodes.
> 
> zone_reclaim_mode is a global setting to go through great lengths to
> stay on local nodes, intended to be used depending on the hardware,
> not the workload.
> 
> Mempolicy on the other hand is to optimize placement for maximum
> locality depending on access patterns of a workload or even just the
> subset of a workload.  I'm trying to change whether this applies to
> page cache (due to different locality / cache effectiveness tradeoff)
> and we want to provide pagecache_mempolicy_mode to revert in the field
> in case this is a mistake.
> 
> PF_SPREAD_PAGE becomes implied per default and should eventually be
> removed.

I guess many loads do not care about page cache locality and the default
spreading would be OK for them but what about those that do care?
Currently we have a per-process (cpuset in fact) flag but this will
change it to all or nothing. Is this really a good step?
Btw. I do not mind having PF_SPREAD_PAGE enabled by default.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18 16:20       ` Michal Hocko
@ 2013-12-18 19:20         ` Johannes Weiner
  2013-12-19 12:59           ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18 19:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
	LKML

On Wed, Dec 18, 2013 at 05:20:50PM +0100, Michal Hocko wrote:
> On Wed 18-12-13 10:18:46, Johannes Weiner wrote:
> > On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote:
> > > On Tue 17-12-13 15:02:10, Johannes Weiner wrote:
> > > [...]
> > > > +pagecache_mempolicy_mode:
> > > > +
> > > > +This is available only on NUMA kernels.
> > > > +
> > > > +Per default, the configured memory policy is applicable to anonymous
> > > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
> > > > +interleaving fashion over all allowed nodes (hardbindings and
> > > > +zone_reclaim_mode excluded).
> > > > +
> > > > +The assumption is that, when it comes to pagecache, users generally
> > > > +prefer predictable replacement behavior regardless of NUMA topology
> > > > +and maximizing the cache's effectiveness in reducing IO over memory
> > > > +locality.
> > > 
> > > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing
> > > semantically? The setting is per-cpuset rather than global which makes
> > > it harder to use but essentially it tries to distribute page cache pages
> > > across all the nodes.
> > >
> > > This is really getting confusing. We have zone_reclaim_mode to keep
> > > memory local in general, pagecache_mempolicy_mode to keep page cache
> > > local and PF_SPREAD_PAGE to spread the page cache around nodes.

You are right that the user interface we are exposing is kind of
cruddy and I'm less and less convinced that this is the right
direction.

> > zone_reclaim_mode is a global setting to go through great lengths to
> > stay on local nodes, intended to be used depending on the hardware,
> > not the workload.
> > 
> > Mempolicy on the other hand is to optimize placement for maximum
> > locality depending on access patterns of a workload or even just the
> > subset of a workload.  I'm trying to change whether this applies to
> > page cache (due to different locality / cache effectiveness tradeoff)
> > and we want to provide pagecache_mempolicy_mode to revert in the field
> > in case this is a mistake.
> > 
> > PF_SPREAD_PAGE becomes implied per default and should eventually be
> > removed.
> 
> I guess many loads do not care about page cache locality and the default
> spreading would be OK for them but what about those that do care?

Mel suggested that the page cache spreading be implemented as just
another memory policy and I rejected it on the grounds that we have
can have strange aging artifacts if it's not the default.

But you are right that there might be usecases that really have high
cache locality and don't incur any reclaim.  The aging artifacts are
non-existent to them but they would care about the NUMA locality.

And basically, the same aging artifacts apply to anon e.g., just that
the trade-off balance is different, as reclaim is much less common.
And we do offer interleaving for anon as well.  So the situation is
not all that different that I had myself convinced it would be...

So the more I'm thinking about it, the more I'm leaning towards making
it a mempolicy after all, provided that we can set a sane default.

Maybe we can make the new default a hybrid policy that keeps anon,
shmem, slab, kernel, etc. local but interleaves pagecache.  This
should make sense to most usecases while providing the ability for
custom placement policies per-process or per-VMA without having to
make the decision on a global level or through an unusual interface.

> Currently we have a per-process (cpuset in fact) flag but this will
> change it to all or nothing. Is this really a good step?
> Btw. I do not mind having PF_SPREAD_PAGE enabled by default.

I don't want to muck around with cpusets too much, tbh...  but I agree
that the behavior of PF_SPREAD_PAGE should be the default.  Except it
should honor zone_reclaim_mode and round-robin nodes that are within
RECLAIM_DISTANCE of the local one.

I will have spotty access to internet starting tomorrow night until
New Year's.  Is there a chance we can maybe revert the NUMA aspects of
the original patch for now and leave it as a node-local zone fairness
thing?  The NUMA behavior was so broken on 3.12 that I doubt that
people have come to rely on the cache fairness on such machines in
that one release.  So we should be able to release 3.12-stable and
3.13 with node-local zone fairness without regressing anybody, and
then give the NUMA aspect of it another try in 3.14.

Something like the following should restore NUMA behavior while still
fixing the kswapd vs. page allocator interaction bug of thrashing on
the highest zone.  PS: zone_local() is in a CONFIG_NUMA block, which
is why accessing zone->node is safe :-)

---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fac451a..317ea747d2cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1822,7 +1822,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)

 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
-	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+	return local_zone->node == zone->node;
 }

 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
@@ -1919,18 +1919,17 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		 * page was allocated in should have no effect on the
 		 * time the page has in memory before being reclaimed.
 		 *
-		 * When zone_reclaim_mode is enabled, try to stay in
-		 * local zones in the fastpath.  If that fails, the
-		 * slowpath is entered, which will do another pass
-		 * starting with the local zones, but ultimately fall
-		 * back to remote zones that do not partake in the
-		 * fairness round-robin cycle of this zonelist.
+		 * Try to stay in local zones in the fastpath.  If
+		 * that fails, the slowpath is entered, which will do
+		 * another pass starting with the local zones, but
+		 * ultimately fall back to remote zones that do not
+		 * partake in the fairness round-robin cycle of this
+		 * zonelist.
 		 */
 		if (alloc_flags & ALLOC_WMARK_LOW) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
-			if (zone_reclaim_mode &&
-			    !zone_local(preferred_zone, zone))
+			if (!zone_local(preferred_zone, zone))
 				continue;
 		}
 		/*
@@ -2396,7 +2395,7 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * thrash fairness information for zones that are not
 		 * actually part of this zonelist's round-robin cycle.
 		 */
-		if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+		if (!zone_local(preferred_zone, zone))
 			continue;
 		mod_zone_page_state(zone, NR_ALLOC_BATCH,
 				    high_wmark_pages(zone) -

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
  2013-12-18 19:20         ` Johannes Weiner
@ 2013-12-19 12:59           ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2013-12-19 12:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
	LKML

On Wed 18-12-13 14:20:15, Johannes Weiner wrote:
> On Wed, Dec 18, 2013 at 05:20:50PM +0100, Michal Hocko wrote:
[...]
> > Currently we have a per-process (cpuset in fact) flag but this will
> > change it to all or nothing. Is this really a good step?
> > Btw. I do not mind having PF_SPREAD_PAGE enabled by default.
> 
> I don't want to muck around with cpusets too much, tbh...  but I agree
> that the behavior of PF_SPREAD_PAGE should be the default.  Except it
> should honor zone_reclaim_mode and round-robin nodes that are within
> RECLAIM_DISTANCE of the local one.

Agreed.

> I will have spotty access to internet starting tomorrow night until
> New Year's.  Is there a chance we can maybe revert the NUMA aspects of
> the original patch for now and leave it as a node-local zone fairness
> thing?

Yes, that sounds perfectly reasonable to me.

> The NUMA behavior was so broken on 3.12 that I doubt that
> people have come to rely on the cache fairness on such machines in
> that one release.  So we should be able to release 3.12-stable and
> 3.13 with node-local zone fairness without regressing anybody, and
> then give the NUMA aspect of it another try in 3.14.
> 
> Something like the following should restore NUMA behavior while still
> fixing the kswapd vs. page allocator interaction bug of thrashing on
> the highest zone. 

Yes, it looks good to me. I guess zone_local could have stayed as it
was because it shouldn't be a big deal to fall-back to a different node
if the distance is LOCAL, but taking a conservative approach is not
harmfull.

> PS: zone_local() is in a CONFIG_NUMA block, which
> is why accessing zone->node is safe :-)
> 
> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dd886fac451a..317ea747d2cd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1822,7 +1822,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>  
>  static bool zone_local(struct zone *local_zone, struct zone *zone)
>  {
> -	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> +	return local_zone->node == zone->node;
>  }
>  
>  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> @@ -1919,18 +1919,17 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  		 * page was allocated in should have no effect on the
>  		 * time the page has in memory before being reclaimed.
>  		 *
> -		 * When zone_reclaim_mode is enabled, try to stay in
> -		 * local zones in the fastpath.  If that fails, the
> -		 * slowpath is entered, which will do another pass
> -		 * starting with the local zones, but ultimately fall
> -		 * back to remote zones that do not partake in the
> -		 * fairness round-robin cycle of this zonelist.
> +		 * Try to stay in local zones in the fastpath.  If
> +		 * that fails, the slowpath is entered, which will do
> +		 * another pass starting with the local zones, but
> +		 * ultimately fall back to remote zones that do not
> +		 * partake in the fairness round-robin cycle of this
> +		 * zonelist.
>  		 */
>  		if (alloc_flags & ALLOC_WMARK_LOW) {
>  			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
>  				continue;
> -			if (zone_reclaim_mode &&
> -			    !zone_local(preferred_zone, zone))
> +			if (!zone_local(preferred_zone, zone))
>  				continue;
>  		}
>  		/*
> @@ -2396,7 +2395,7 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * thrash fairness information for zones that are not
>  		 * actually part of this zonelist's round-robin cycle.
>  		 */
> -		if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
> +		if (!zone_local(preferred_zone, zone))
>  			continue;
>  		mod_zone_page_state(zone, NR_ALLOC_BATCH,
>  				    high_wmark_pages(zone) -
> 
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 0/6] Configurable fair allocation zone policy v4
@ 2013-12-18 19:41 Mel Gorman
  2013-12-18 19:41 ` [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2013-12-18 19:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

This is still a work in progress. I know Johannes is maintaining his own
patch which takes a different approach and has different priorities. Michal
Hocko has raised concerns potentially affecting both. I'm releasing this so
there is a basis of comparison with Johannes' patch. It's not necessarily
the final shape of what we want to merge but the test results highlight
the current behaviour has regressed performance for basic workloads.

A big concern is the semantics and tradeoffs of the tunable are quite
involved.  Basically no matter what workload you get right, there will be
a workload that will be wrong. This might indicate that this really needs
to be controlled via memory policies or some means of detecting online
which policy should be used on a per-process basis.

By default, this series does *not* interleave pagecache across nodes but
it will interleave between local zones.

Changelog since V3
o Add documentation
o Bring tunable in line with Johannes
o Common code when deciding to update the batch count and skip zones

Changelog since v2
o Drop an accounting patch, behaviour is deliberate
o Special case tmpfs and shmem pages for discussion

Changelog since v1
o Fix lot of brain damage in the configurable policy patch
o Yoink a page cache annotation patch
o Only account batch pages against allocations eligible for the fair policy
o Add patch that default distributes file pages on remote nodes

Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of how
the page allocator and kswapd interacted on the per-zone LRU lists.

Unfortunately a side-effect missed during review was that it's now very
easy to allocate remote memory on NUMA machines. The problem is that
it is not a simple case of just restoring local allocation policies as
there are genuine reasons why global page aging may be prefereable. It's
still a major change to default behaviour so this patch makes the policy
configurable and sets what I think is a sensible default.

The patches are on top of some NUMA balancing patches currently in -mm.

3.13-rc3	 vanilla
instrument-v4	 NUMA balancing patches currently in mmotm
configuratble-v4 This series

Benchmarks are just on some basic workloads, the simple stuff we normally
expect to get right.

kernbench
                          3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                             vanilla       instrument-v4r1    configurable-v4r13
User    min        1417.32 (  0.00%)     1410.24 (  0.50%)     1410.78 (  0.46%)
User    mean       1419.10 (  0.00%)     1416.85 (  0.16%)     1412.68 (  0.45%)
User    stddev        2.25 (  0.00%)        4.51 (-100.53%)        1.03 ( 54.04%)
User    max        1422.92 (  0.00%)     1420.94 (  0.14%)     1413.68 (  0.65%)
User    range         5.60 (  0.00%)       10.70 (-91.07%)        2.90 ( 48.21%)
System  min         114.83 (  0.00%)      114.40 (  0.37%)      111.02 (  3.32%)
System  mean        115.89 (  0.00%)      116.25 ( -0.31%)      111.97 (  3.38%)
System  stddev        0.63 (  0.00%)        0.95 (-49.62%)        0.72 (-13.87%)
System  max         116.81 (  0.00%)      116.90 ( -0.08%)      113.15 (  3.13%)
System  range         1.98 (  0.00%)        2.50 (-26.26%)        2.13 ( -7.58%)
Elapsed min          42.90 (  0.00%)       43.23 ( -0.77%)       43.07 ( -0.40%)
Elapsed mean         43.58 (  0.00%)       44.09 ( -1.17%)       43.85 ( -0.62%)
Elapsed stddev        0.74 (  0.00%)        0.53 ( 28.21%)        0.41 ( 44.30%)
Elapsed max          44.52 (  0.00%)       44.67 ( -0.34%)       44.25 (  0.61%)
Elapsed range         1.62 (  0.00%)        1.44 ( 11.11%)        1.18 ( 27.16%)
CPU     min        3451.00 (  0.00%)     3434.00 (  0.49%)     3440.00 (  0.32%)
CPU     mean       3522.40 (  0.00%)     3477.20 (  1.28%)     3476.40 (  1.31%)
CPU     stddev       54.34 (  0.00%)       50.01 (  7.98%)       35.54 ( 34.59%)
CPU     max        3570.00 (  0.00%)     3556.00 (  0.39%)     3542.00 (  0.78%)
CPU     range       119.00 (  0.00%)      122.00 ( -2.52%)      102.00 ( 14.29%)

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v4r1configurable-v4r13
User         8540.49     8519.33     8500.40
System        706.31      708.21      682.39
Elapsed       307.58      312.84      308.71

Elapsed time is roughly flat but there is a big reduction in system CPU time. Page
allocation is a smallish component of this workload but the cost of zeroing remote
pages still hurts. 


                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v4r1configurable-v4r13
NUMA alloc hit                73783951    73924585    93533721
NUMA alloc miss               20013534    19894773           0
NUMA interleave hit                  0           0           0
NUMA alloc local              73783935    73924577    93533714
NUMA page range updates        5933989     5922672     5903323
NUMA huge PMD updates               89         117         184
NUMA PTE updates               5888510     5862885     5809299
NUMA hint faults               2436205     2420282     2384045
NUMA hint local faults         1877025     1814204     2182988
NUMA hint local percent             77          74          91
NUMA pages migrated             457186      480840      198041
AutoNUMA cost                    12231       12152       11965

Note that NUMA alloc misses are reduced to 0. This is consistent throughout for all
benchmarks so I will not mention it again.

It is interesting to note the NUMA hinting faults that were local. It's increased
quite a lot by just allocating the memory local.

While the original intent of the patch was to improve caching when the
workload consumes all of memory, it's worth remembering that some workloads
can be kept cache hot on local nodes.

vmr-stream
                          3.13.0-rc3           3.13.0-rc3           3.13.0-rc3
                            vanilla        instrument-v4r1     configurable-v4r13
Add      5M        3809.80 (  0.00%)     3803.54 ( -0.16%)     3985.49 (  4.61%)
Copy     5M        3360.75 (  0.00%)     3367.38 (  0.20%)     3473.86 (  3.37%)
Scale    5M        3160.39 (  0.00%)     3160.14 ( -0.01%)     3392.39 (  7.34%)
Triad    5M        3533.04 (  0.00%)     3529.78 ( -0.09%)     3851.26 (  9.01%)
Add      7M        3789.82 (  0.00%)     3796.19 (  0.17%)     4004.91 (  5.68%)
Copy     7M        3345.85 (  0.00%)     3368.66 (  0.68%)     3478.26 (  3.96%)
Scale    7M        3176.00 (  0.00%)     3173.44 ( -0.08%)     3393.89 (  6.86%)
Triad    7M        3528.85 (  0.00%)     3531.91 (  0.09%)     3856.74 (  9.29%)
Add      8M        3801.60 (  0.00%)     3774.43 ( -0.71%)     3995.96 (  5.11%)
Copy     8M        3364.64 (  0.00%)     3345.99 ( -0.55%)     3478.86 (  3.39%)
Scale    8M        3169.34 (  0.00%)     3146.35 ( -0.73%)     3396.65 (  7.17%)
Triad    8M        3531.38 (  0.00%)     3509.46 ( -0.62%)     3857.63 (  9.24%)
Add      10M       3807.95 (  0.00%)     3808.50 (  0.01%)     3970.14 (  4.26%)
Copy     10M       3365.64 (  0.00%)     3377.66 (  0.36%)     3473.70 (  3.21%)
Scale    10M       3172.71 (  0.00%)     3166.78 ( -0.19%)     3396.09 (  7.04%)
Triad    10M       3536.15 (  0.00%)     3534.14 ( -0.06%)     3857.11 (  9.08%)
Add      14M       3787.56 (  0.00%)     3805.37 (  0.47%)     4002.47 (  5.67%)
Copy     14M       3345.19 (  0.00%)     3359.46 (  0.43%)     3484.76 (  4.17%)
Scale    14M       3154.55 (  0.00%)     3166.89 (  0.39%)     3400.32 (  7.79%)
Triad    14M       3522.09 (  0.00%)     3534.64 (  0.36%)     3864.75 (  9.73%)
Add      17M       3806.34 (  0.00%)     3796.89 ( -0.25%)     3956.47 (  3.94%)
Copy     17M       3368.39 (  0.00%)     3357.90 ( -0.31%)     3469.40 (  3.00%)
Scale    17M       3169.18 (  0.00%)     3168.00 ( -0.04%)     3390.47 (  6.98%)
Triad    17M       3535.05 (  0.00%)     3529.55 ( -0.16%)     3849.59 (  8.90%)
Add      21M       3795.31 (  0.00%)     3788.25 ( -0.19%)     4042.81 (  6.52%)
Copy     21M       3353.43 (  0.00%)     3360.22 (  0.20%)     3486.10 (  3.96%)
Scale    21M       3160.96 (  0.00%)     3164.97 (  0.13%)     3399.16 (  7.54%)
Triad    21M       3530.45 (  0.00%)     3519.72 ( -0.30%)     3860.71 (  9.35%)
Add      28M       3803.11 (  0.00%)     3788.38 ( -0.39%)     3996.59 (  5.09%)
Copy     28M       3361.16 (  0.00%)     3357.71 ( -0.10%)     3477.19 (  3.45%)
Scale    28M       3160.43 (  0.00%)     3155.64 ( -0.15%)     3396.09 (  7.46%)
Triad    28M       3533.66 (  0.00%)     3522.53 ( -0.32%)     3856.54 (  9.14%)
Add      35M       3792.86 (  0.00%)     3803.38 (  0.28%)     3990.04 (  5.20%)
Copy     35M       3344.24 (  0.00%)     3370.46 (  0.78%)     3476.15 (  3.94%)
Scale    35M       3160.14 (  0.00%)     3174.84 (  0.47%)     3397.43 (  7.51%)
Triad    35M       3531.94 (  0.00%)     3534.84 (  0.08%)     3857.45 (  9.22%)
Add      42M       3803.39 (  0.00%)     3790.25 ( -0.35%)     4017.97 (  5.64%)
Copy     42M       3360.64 (  0.00%)     3354.22 ( -0.19%)     3481.62 (  3.60%)
Scale    42M       3158.64 (  0.00%)     3157.63 ( -0.03%)     3397.54 (  7.56%)
Triad    42M       3529.99 (  0.00%)     3523.89 ( -0.17%)     3860.30 (  9.36%)
Add      56M       3778.07 (  0.00%)     3808.17 (  0.80%)     3964.91 (  4.95%)
Copy     56M       3348.68 (  0.00%)     3367.71 (  0.57%)     3470.75 (  3.65%)
Scale    56M       3169.25 (  0.00%)     3168.97 ( -0.01%)     3390.06 (  6.97%)
Triad    56M       3517.62 (  0.00%)     3537.10 (  0.55%)     3849.49 (  9.43%)
Add      71M       3811.71 (  0.00%)     3814.51 (  0.07%)     4002.20 (  5.00%)
Copy     71M       3370.59 (  0.00%)     3369.28 ( -0.04%)     3482.09 (  3.31%)
Scale    71M       3168.70 (  0.00%)     3174.30 (  0.18%)     3401.24 (  7.34%)
Triad    71M       3536.14 (  0.00%)     3538.14 (  0.06%)     3866.68 (  9.35%)
Add      85M       3805.94 (  0.00%)     3794.21 ( -0.31%)     4017.41 (  5.56%)
Copy     85M       3354.76 (  0.00%)     3352.10 ( -0.08%)     3481.19 (  3.77%)
Scale    85M       3162.20 (  0.00%)     3156.13 ( -0.19%)     3397.90 (  7.45%)
Triad    85M       3538.76 (  0.00%)     3529.06 ( -0.27%)     3859.79 (  9.07%)
Add      113M      3803.66 (  0.00%)     3797.09 ( -0.17%)     4024.26 (  5.80%)
Copy     113M      3348.32 (  0.00%)     3361.90 (  0.41%)     3482.30 (  4.00%)
Scale    113M      3177.09 (  0.00%)     3161.76 ( -0.48%)     3396.35 (  6.90%)
Triad    113M      3536.06 (  0.00%)     3527.87 ( -0.23%)     3858.76 (  9.13%)
Add      142M      3814.65 (  0.00%)     3800.76 ( -0.36%)     3971.52 (  4.11%)
Copy     142M      3353.31 (  0.00%)     3355.70 (  0.07%)     3476.61 (  3.68%)
Scale    142M      3186.05 (  0.00%)     3179.90 ( -0.19%)     3393.11 (  6.50%)
Triad    142M      3545.41 (  0.00%)     3537.84 ( -0.21%)     3855.31 (  8.74%)
Add      170M      3787.71 (  0.00%)     3793.38 (  0.15%)     3996.19 (  5.50%)
Copy     170M      3351.50 (  0.00%)     3355.90 (  0.13%)     3479.89 (  3.83%)
Scale    170M      3158.38 (  0.00%)     3162.04 (  0.12%)     3395.01 (  7.49%)
Triad    170M      3521.84 (  0.00%)     3524.25 (  0.07%)     3856.88 (  9.51%)
Add      227M      3794.46 (  0.00%)     3727.56 ( -1.76%)     3985.47 (  5.03%)
Copy     227M      3368.15 (  0.00%)     3277.24 ( -2.70%)     3471.96 (  3.08%)
Scale    227M      3160.18 (  0.00%)     3091.04 ( -2.19%)     3391.20 (  7.31%)
Triad    227M      3525.39 (  0.00%)     3494.23 ( -0.88%)     3850.97 (  9.24%)
Add      284M      3804.29 (  0.00%)     3810.34 (  0.16%)     3945.32 (  3.71%)
Copy     284M      3366.21 (  0.00%)     3349.90 ( -0.48%)     3464.13 (  2.91%)
Scale    284M      3174.61 (  0.00%)     3164.87 ( -0.31%)     3388.84 (  6.75%)
Triad    284M      3538.50 (  0.00%)     3541.71 (  0.09%)     3846.27 (  8.70%)
Add      341M      3805.26 (  0.00%)     3803.54 ( -0.05%)     4043.00 (  6.25%)
Copy     341M      3366.98 (  0.00%)     3357.66 ( -0.28%)     3485.41 (  3.52%)
Scale    341M      3159.11 (  0.00%)     3171.15 (  0.38%)     3401.61 (  7.68%)
Triad    341M      3530.80 (  0.00%)     3536.21 (  0.15%)     3863.89 (  9.43%)
Add      455M      3791.15 (  0.00%)     3781.78 ( -0.25%)     4002.92 (  5.59%)
Copy     455M      3353.30 (  0.00%)     3341.88 ( -0.34%)     3477.74 (  3.71%)
Scale    455M      3161.21 (  0.00%)     3157.15 ( -0.13%)     3395.44 (  7.41%)
Triad    455M      3527.90 (  0.00%)     3522.57 ( -0.15%)     3855.85 (  9.30%)
Add      568M      3779.79 (  0.00%)     3794.61 (  0.39%)     4001.91 (  5.88%)
Copy     568M      3349.93 (  0.00%)     3353.04 (  0.09%)     3483.56 (  3.99%)
Scale    568M      3163.69 (  0.00%)     3156.21 ( -0.24%)     3399.73 (  7.46%)
Triad    568M      3518.65 (  0.00%)     3524.57 (  0.17%)     3863.36 (  9.80%)
Add      682M      3801.06 (  0.00%)     3786.21 ( -0.39%)     3988.66 (  4.94%)
Copy     682M      3363.64 (  0.00%)     3354.10 ( -0.28%)     3478.52 (  3.42%)
Scale    682M      3151.89 (  0.00%)     3161.41 (  0.30%)     3396.46 (  7.76%)
Triad    682M      3528.97 (  0.00%)     3526.30 ( -0.08%)     3858.57 (  9.34%)
Add      910M      3778.97 (  0.00%)     3783.73 (  0.13%)     4015.78 (  6.27%)
Copy     910M      3345.09 (  0.00%)     3354.44 (  0.28%)     3481.58 (  4.08%)
Scale    910M      3164.46 (  0.00%)     3160.67 ( -0.12%)     3398.66 (  7.40%)
Triad    910M      3516.19 (  0.00%)     3525.40 (  0.26%)     3861.40 (  9.82%)
Add      1137M     3812.17 (  0.00%)     3781.70 ( -0.80%)     3992.81 (  4.74%)
Copy     1137M     3367.52 (  0.00%)     3343.53 ( -0.71%)     3477.47 (  3.27%)
Scale    1137M     3158.62 (  0.00%)     3160.31 (  0.05%)     3395.51 (  7.50%)
Triad    1137M     3536.97 (  0.00%)     3517.86 ( -0.54%)     3815.71 (  7.88%)
Add      1365M     3806.51 (  0.00%)     3807.60 (  0.03%)     3983.99 (  4.66%)
Copy     1365M     3360.43 (  0.00%)     3365.77 (  0.16%)     3470.61 (  3.28%)
Scale    1365M     3155.95 (  0.00%)     3160.24 (  0.14%)     3393.61 (  7.53%)
Triad    1365M     3534.18 (  0.00%)     3533.07 ( -0.03%)     3853.01 (  9.02%)
Add      1820M     3797.86 (  0.00%)     3804.61 (  0.18%)     4000.76 (  5.34%)
Copy     1820M     3362.09 (  0.00%)     3356.15 ( -0.18%)     3483.01 (  3.60%)
Scale    1820M     3170.20 (  0.00%)     3169.60 ( -0.02%)     3403.05 (  7.34%)
Triad    1820M     3531.00 (  0.00%)     3538.39 (  0.21%)     3864.64 (  9.45%)
Add      2275M     3810.31 (  0.00%)     3789.25 ( -0.55%)     3995.44 (  4.86%)
Copy     2275M     3373.60 (  0.00%)     3359.38 ( -0.42%)     3478.62 (  3.11%)
Scale    2275M     3174.64 (  0.00%)     3170.69 ( -0.12%)     3395.45 (  6.96%)
Triad    2275M     3537.57 (  0.00%)     3525.99 ( -0.33%)     3855.61 (  8.99%)
Add      2730M     3801.09 (  0.00%)     3792.84 ( -0.22%)     3961.20 (  4.21%)
Copy     2730M     3357.18 (  0.00%)     3346.35 ( -0.32%)     3457.14 (  2.98%)
Scale    2730M     3177.66 (  0.00%)     3172.64 ( -0.16%)     3371.72 (  6.11%)
Triad    2730M     3539.59 (  0.00%)     3527.66 ( -0.34%)     3824.14 (  8.04%)
Add      3640M     3816.88 (  0.00%)     3789.31 ( -0.72%)     4004.54 (  4.92%)
Copy     3640M     3375.91 (  0.00%)     3356.54 ( -0.57%)     3477.94 (  3.02%)
Scale    3640M     3167.22 (  0.00%)     3150.31 ( -0.53%)     3394.38 (  7.17%)
Triad    3640M     3546.45 (  0.00%)     3524.38 ( -0.62%)     3854.74 (  8.69%)
Add      4551M     3799.05 (  0.00%)     3784.66 ( -0.38%)     3974.45 (  4.62%)
Copy     4551M     3355.66 (  0.00%)     3351.35 ( -0.13%)     3471.85 (  3.46%)
Scale    4551M     3171.91 (  0.00%)     3160.79 ( -0.35%)     3393.06 (  6.97%)
Triad    4551M     3531.61 (  0.00%)     3518.76 ( -0.36%)     3855.49 (  9.17%)
Add      5461M     3801.60 (  0.00%)     3797.07 ( -0.12%)     3996.19 (  5.12%)
Copy     5461M     3360.29 (  0.00%)     3352.70 ( -0.23%)     3479.24 (  3.54%)
Scale    5461M     3161.18 (  0.00%)     3162.49 (  0.04%)     3396.26 (  7.44%)
Triad    5461M     3532.35 (  0.00%)     3529.30 ( -0.09%)     3853.59 (  9.09%)
Add      7281M     3800.80 (  0.00%)     3790.72 ( -0.27%)     3995.88 (  5.13%)
Copy     7281M     3359.99 (  0.00%)     3345.74 ( -0.42%)     3478.84 (  3.54%)
Scale    7281M     3168.68 (  0.00%)     3146.70 ( -0.69%)     3395.73 (  7.17%)
Triad    7281M     3533.59 (  0.00%)     3520.89 ( -0.36%)     3856.70 (  9.14%)
Add      9102M     3790.67 (  0.00%)     3797.54 (  0.18%)     4002.46 (  5.59%)
Copy     9102M     3345.80 (  0.00%)     3348.46 (  0.08%)     3481.31 (  4.05%)
Scale    9102M     3174.65 (  0.00%)     3161.95 ( -0.40%)     3401.10 (  7.13%)
Triad    9102M     3529.51 (  0.00%)     3529.58 (  0.00%)     3864.64 (  9.49%)
Add      10922M     3807.96 (  0.00%)     3796.07 ( -0.31%)     3947.40 (  3.66%)
Copy     10922M     3350.99 (  0.00%)     3357.85 (  0.20%)     3434.74 (  2.50%)
Scale    10922M     3164.74 (  0.00%)     3157.76 ( -0.22%)     3351.72 (  5.91%)
Triad    10922M     3536.69 (  0.00%)     3535.49 ( -0.03%)     3797.59 (  7.38%)
Add      14563M     3786.28 (  0.00%)     3809.36 (  0.61%)     4003.59 (  5.74%)
Copy     14563M     3352.51 (  0.00%)     3371.05 (  0.55%)     3477.73 (  3.73%)
Scale    14563M     3171.95 (  0.00%)     3175.20 (  0.10%)     3394.29 (  7.01%)
Triad    14563M     3522.50 (  0.00%)     3542.29 (  0.56%)     3854.10 (  9.41%)
Add      18204M     3809.56 (  0.00%)     3798.86 ( -0.28%)     3966.36 (  4.12%)
Copy     18204M     3365.06 (  0.00%)     3351.54 ( -0.40%)     3467.60 (  3.05%)
Scale    18204M     3171.25 (  0.00%)     3162.93 ( -0.26%)     3388.46 (  6.85%)
Triad    18204M     3539.90 (  0.00%)     3531.01 ( -0.25%)     3848.78 (  8.73%)
Add      21845M     3798.46 (  0.00%)     3796.16 ( -0.06%)     4024.09 (  5.94%)
Copy     21845M     3362.14 (  0.00%)     3365.10 (  0.09%)     3481.89 (  3.56%)
Scale    21845M     3170.99 (  0.00%)     3152.22 ( -0.59%)     3396.85 (  7.12%)
Triad    21845M     3534.49 (  0.00%)     3528.31 ( -0.17%)     3859.74 (  9.20%)
Add      29127M     3819.69 (  0.00%)     3794.61 ( -0.66%)     3958.76 (  3.64%)
Copy     29127M     3384.67 (  0.00%)     3364.16 ( -0.61%)     3461.19 (  2.26%)
Scale    29127M     3158.68 (  0.00%)     3155.56 ( -0.10%)     3377.98 (  6.94%)
Triad    29127M     3538.17 (  0.00%)     3525.06 ( -0.37%)     3833.34 (  8.34%)
Add      36408M     3806.95 (  0.00%)     3793.50 ( -0.35%)     3980.34 (  4.55%)
Copy     36408M     3361.11 (  0.00%)     3359.84 ( -0.04%)     3477.96 (  3.48%)
Scale    36408M     3165.87 (  0.00%)     3161.61 ( -0.13%)     3398.84 (  7.36%)
Triad    36408M     3536.86 (  0.00%)     3530.39 ( -0.18%)     3862.01 (  9.19%)
Add      43690M     3799.39 (  0.00%)     3799.45 (  0.00%)     4021.62 (  5.85%)
Copy     43690M     3359.26 (  0.00%)     3355.00 ( -0.13%)     3481.97 (  3.65%)
Scale    43690M     3175.35 (  0.00%)     3167.00 ( -0.26%)     3398.43 (  7.03%)
Triad    43690M     3535.26 (  0.00%)     3533.82 ( -0.04%)     3858.45 (  9.14%)
Add      58254M     3799.66 (  0.00%)     3788.90 ( -0.28%)     3986.25 (  4.91%)
Copy     58254M     3355.12 (  0.00%)     3348.89 ( -0.19%)     3473.90 (  3.54%)
Scale    58254M     3170.94 (  0.00%)     3148.96 ( -0.69%)     3392.93 (  7.00%)
Triad    58254M     3537.26 (  0.00%)     3519.99 ( -0.49%)     3853.76 (  8.95%)
Add      72817M     3815.26 (  0.00%)     3801.86 ( -0.35%)     4011.21 (  5.14%)
Copy     72817M     3362.18 (  0.00%)     3363.51 (  0.04%)     3478.56 (  3.46%)
Scale    72817M     3175.73 (  0.00%)     3162.25 ( -0.42%)     3392.44 (  6.82%)
Triad    72817M     3546.44 (  0.00%)     3534.43 ( -0.34%)     3851.14 (  8.59%)
Add      87381M     3519.93 (  0.00%)     3515.94 ( -0.11%)     3838.55 (  9.05%)
Copy     87381M     3175.29 (  0.00%)     3182.78 (  0.24%)     3262.29 (  2.74%)
Scale    87381M     2848.76 (  0.00%)     2836.37 ( -0.43%)     3172.85 ( 11.38%)
Triad    87381M     3465.19 (  0.00%)     3463.32 ( -0.05%)     3773.55 (  8.90%)

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v4r1configurable-v4r13
User         1144.35     1154.47     1086.52
System         55.28       56.98       49.29
E lapsed      1207.64     1220.98     1145.16

This is a memory streaming benchmark. It benefits heavily from using local memory
so there are fairly sizable gains throughout.

                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v4r1configurable-v4r13
NUMA alloc hit                 1238820     1341467     2102140
NUMA alloc miss                 691541      787339           0
NUMA interleave hit                  0           0           0
NUMA alloc local               1238815     1341465     2102133
NUMA page range updates       24916702    24981923    24979446
NUMA huge PMD updates            48025       48138       48138
NUMA PTE updates                375927      383405      380928
NUMA hint faults                373397      380787      378304
NUMA hint local faults          142051      133068      368551
NUMA hint local percent             38          34          97
NUMA pages migrated              83407      105492       12060
AutoNUMA cost                     2042        2080        2066

Similarly the NUMA hinting faults were mostly local

pft
                        3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                           vanilla       instrument-v4r1    configurable-v4r13
User       1       0.6980 (  0.00%)       0.6830 (  2.15%)       0.6830 (  2.15%)
User       2       0.7040 (  0.00%)       0.7220 ( -2.56%)       0.7260 ( -3.12%)
User       3       0.6910 (  0.00%)       0.7200 ( -4.20%)       0.6800 (  1.59%)
User       4       0.7250 (  0.00%)       0.7290 ( -0.55%)       0.7370 ( -1.66%)
User       5       0.7590 (  0.00%)       0.7810 ( -2.90%)       0.7520 (  0.92%)
User       6       0.8130 (  0.00%)       0.8130 (  0.00%)       0.7400 (  8.98%)
User       7       0.8210 (  0.00%)       0.7990 (  2.68%)       0.7960 (  3.05%)
User       8       0.8390 (  0.00%)       0.8110 (  3.34%)       0.8020 (  4.41%)
System     1       9.1230 (  0.00%)       9.1630 ( -0.44%)       8.2640 (  9.42%)
System     2       9.3990 (  0.00%)       9.3730 (  0.28%)       8.4570 ( 10.02%)
System     3       9.1460 (  0.00%)       9.1070 (  0.43%)       8.6270 (  5.67%)
System     4       8.9160 (  0.00%)       8.7960 (  1.35%)       8.7380 (  2.00%)
System     5       9.5900 (  0.00%)       9.5420 (  0.50%)       8.9600 (  6.57%)
System     6       9.8640 (  0.00%)       9.8200 (  0.45%)       9.2530 (  6.19%)
System     7       9.9860 (  0.00%)       9.8140 (  1.72%)       9.3720 (  6.15%)
System     8       9.8570 (  0.00%)       9.8380 (  0.19%)       9.2860 (  5.79%)
Elapsed    1       9.8240 (  0.00%)       9.8500 ( -0.26%)       8.9530 (  8.87%)
Elapsed    2       5.0870 (  0.00%)       5.0670 (  0.39%)       4.6120 (  9.34%)
Elapsed    3       3.3220 (  0.00%)       3.3070 (  0.45%)       3.1320 (  5.72%)
Elapsed    4       2.4440 (  0.00%)       2.4080 (  1.47%)       2.4030 (  1.68%)
Elapsed    5       2.1500 (  0.00%)       2.1550 ( -0.23%)       1.9970 (  7.12%)
Elapsed    6       1.8290 (  0.00%)       1.8230 (  0.33%)       1.7040 (  6.83%)
Elapsed    7       1.5760 (  0.00%)       1.5470 (  1.84%)       1.4910 (  5.39%)
Elapsed    8       1.3660 (  0.00%)       1.3440 (  1.61%)       1.2830 (  6.08%)
Faults/cpu 1  336505.5875 (  0.00%)  335646.1191 ( -0.26%)  369269.9491 (  9.74%)
Faults/cpu 2  327139.2186 (  0.00%)  327337.4309 (  0.06%)  359879.3760 ( 10.01%)
Faults/cpu 3  336004.1324 (  0.00%)  336283.8915 (  0.08%)  355077.6062 (  5.68%)
Faults/cpu 4  342824.1564 (  0.00%)  346956.7616 (  1.21%)  348805.5389 (  1.74%)
Faults/cpu 5  319553.7707 (  0.00%)  320266.6891 (  0.22%)  340232.2510 (  6.47%)
Faults/cpu 6  309614.5554 (  0.00%)  310923.0881 (  0.42%)  330752.5617 (  6.83%)
Faults/cpu 7  306159.2969 (  0.00%)  311474.2294 (  1.74%)  325141.8868 (  6.20%)
Faults/cpu 8  309077.4966 (  0.00%)  310491.4673 (  0.46%)  327802.7845 (  6.06%)
Faults/sec 1  336364.5575 (  0.00%)  335493.4899 ( -0.26%)  369125.5968 (  9.74%)
Faults/sec 2  649713.2290 (  0.00%)  652201.7336 (  0.38%)  716621.5272 ( 10.30%)
Faults/sec 3  994812.3119 (  0.00%)  999330.4234 (  0.45%) 1055728.0701 (  6.12%)
Faults/sec 4 1352137.4832 (  0.00%) 1372667.4485 (  1.52%) 1375014.2113 (  1.69%)
Faults/sec 5 1538115.0421 (  0.00%) 1533647.8496 ( -0.29%) 1654330.8228 (  7.56%)
Faults/sec 6 1807211.7324 (  0.00%) 1814037.7599 (  0.38%) 1940810.7735 (  7.39%)
Faults/sec 7 2101840.1872 (  0.00%) 2132966.7624 (  1.48%) 2220066.9942 (  5.62%)
Faults/sec 8 2421813.7208 (  0.00%) 2458797.5104 (  1.53%) 2580986.1397 (  6.57%)

This is page fault microbenchmark. Again, heavy gains from having local memory.

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v4r1configurable-v4r13
User           60.57       61.53       60.20
System        868.16      861.36      809.22
Elapsed       336.19      335.31      315.19

And big reduction in system CPU time.

                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v4r1configurable-v4r13
NUMA alloc hit               187243902   187334720   264999707
NUMA alloc miss               77736695    77665156           0
NUMA interleave hit                  0           0           0
NUMA alloc local             187243902   187334720   264999706
NUMA page range updates      136246380   134086920   124386461
NUMA huge PMD updates                0           0           0
NUMA PTE updates             136246380   134086920   124386461
NUMA hint faults                   512         804         524
NUMA hint local faults             248         339         349
NUMA hint local percent             48          42          66
NUMA pages migrated                169          53         115
AutoNUMA cost                      956         942         873

NUMA hinting faults were not as local as I'd hope but it's a separate issue.

ebizzy
                     3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                        vanilla       instrument-v4r1    configurable-v4r13
Mean   1      3213.33 (  0.00%)     3170.67 ( -1.33%)     3215.33 (  0.06%)
Mean   2      2291.33 (  0.00%)     2310.00 (  0.81%)     2364.00 (  3.17%)
Mean   3      2234.67 (  0.00%)     2300.33 (  2.94%)     2289.00 (  2.43%)
Mean   4      2224.33 (  0.00%)     2237.00 (  0.57%)     2238.00 (  0.61%)
Mean   5      2256.33 (  0.00%)     2249.00 ( -0.33%)     2319.67 (  2.81%)
Mean   6      2233.00 (  0.00%)     2253.67 (  0.93%)     2228.67 ( -0.19%)
Mean   7      2212.33 (  0.00%)     2258.33 (  2.08%)     2237.00 (  1.11%)
Mean   8      2224.67 (  0.00%)     2232.33 (  0.34%)     2235.33 (  0.48%)
Mean   12     2213.33 (  0.00%)     2223.33 (  0.45%)     2230.00 (  0.75%)
Mean   16     2221.00 (  0.00%)     2247.67 (  1.20%)     2237.00 (  0.72%)
Mean   20     2215.00 (  0.00%)     2247.33 (  1.46%)     2244.33 (  1.32%)
Mean   24     2175.00 (  0.00%)     2181.00 (  0.28%)     2225.00 (  2.30%)
Mean   28     2110.00 (  0.00%)     2156.67 (  2.21%)     2140.33 (  1.44%)
Mean   32     2077.67 (  0.00%)     2081.33 (  0.18%)     2092.33 (  0.71%)
Mean   36     2016.33 (  0.00%)     2042.67 (  1.31%)     2086.33 (  3.47%)
Mean   40     1984.00 (  0.00%)     1988.00 (  0.20%)     2032.67 (  2.45%)
Mean   44     1943.33 (  0.00%)     1960.00 (  0.86%)     1993.33 (  2.57%)
Mean   48     1925.00 (  0.00%)     1935.00 (  0.52%)     1990.67 (  3.41%)
Range  1        62.00 (  0.00%)       74.00 (-19.35%)       59.00 (  4.84%)
Range  2        70.00 (  0.00%)       32.00 ( 54.29%)      146.00 (-108.57%)
Range  3        39.00 (  0.00%)       48.00 (-23.08%)       70.00 (-79.49%)
Range  4       100.00 (  0.00%)      127.00 (-27.00%)       74.00 ( 26.00%)
Range  5        65.00 (  0.00%)       52.00 ( 20.00%)      100.00 (-53.85%)
Range  6        25.00 (  0.00%)       17.00 ( 32.00%)       81.00 (-224.00%)
Range  7        55.00 (  0.00%)       19.00 ( 65.45%)       44.00 ( 20.00%)
Range  8         9.00 (  0.00%)       43.00 (-377.78%)       15.00 (-66.67%)
Range  12       52.00 (  0.00%)       10.00 ( 80.77%)       22.00 ( 57.69%)
Range  16       47.00 (  0.00%)       55.00 (-17.02%)       28.00 ( 40.43%)
Range  20        9.00 (  0.00%)       68.00 (-655.56%)       27.00 (-200.00%)
Range  24       44.00 (  0.00%)       81.00 (-84.09%)       54.00 (-22.73%)
Range  28       28.00 (  0.00%)       46.00 (-64.29%)       80.00 (-185.71%)
Range  32       23.00 (  0.00%)       22.00 (  4.35%)       11.00 ( 52.17%)
Range  36        9.00 (  0.00%)       20.00 (-122.22%)       75.00 (-733.33%)
Range  40       31.00 (  0.00%)        5.00 ( 83.87%)       10.00 ( 67.74%)
Range  44       16.00 (  0.00%)       15.00 (  6.25%)       18.00 (-12.50%)
Range  48        7.00 (  0.00%)       18.00 (-157.14%)       16.00 (-128.57%)
Stddev 1        25.42 (  0.00%)       30.83 (-21.26%)       24.50 (  3.62%)
Stddev 2        29.68 (  0.00%)       13.37 ( 54.96%)       61.26 (-106.40%)
Stddev 3        18.15 (  0.00%)       19.87 ( -9.46%)       28.89 (-59.14%)
Stddev 4        41.28 (  0.00%)       55.76 (-35.06%)       34.65 ( 16.06%)
Stddev 5        27.18 (  0.00%)       21.65 ( 20.36%)       44.03 (-61.99%)
Stddev 6        10.80 (  0.00%)        7.04 ( 34.83%)       37.28 (-245.12%)
Stddev 7        23.10 (  0.00%)        8.73 ( 62.20%)       18.02 ( 21.99%)
Stddev 8         3.68 (  0.00%)       19.60 (-432.39%)        6.13 (-66.45%)
Stddev 12       23.84 (  0.00%)        4.19 ( 82.42%)        9.93 ( 58.33%)
Stddev 16       20.22 (  0.00%)       23.47 (-16.10%)       11.52 ( 43.02%)
Stddev 20        3.74 (  0.00%)       27.86 (-644.61%)       11.15 (-197.88%)
Stddev 24       18.18 (  0.00%)       35.19 (-93.49%)       22.23 (-22.23%)
Stddev 28       11.78 (  0.00%)       20.81 (-76.69%)       32.66 (-177.38%)
Stddev 32        9.74 (  0.00%)        9.74 (  0.00%)        4.64 ( 52.34%)
Stddev 36        3.86 (  0.00%)        8.99 (-133.08%)       33.83 (-776.65%)
Stddev 40       14.17 (  0.00%)        2.16 ( 84.75%)        4.19 ( 70.42%)
Stddev 44        7.54 (  0.00%)        7.07 (  6.25%)        8.26 ( -9.51%)
Stddev 48        2.94 (  0.00%)        7.48 (-154.20%)        6.94 (-135.88%)

Performance is only slightly improved here, it's doing a lot of remote copies
anyway as a side-effect of the type of workload it is. There is high spread
on the performance of individual threads but that bug is known and being handled
elsewhere.

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v4r1configurable-v4r13
User          491.24      494.38      500.23
System        874.62      874.43      870.83
Elapsed      1082.00     1082.29     1082.14

                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v4r1configurable-v4r13
NUMA alloc hit               238904205   238016877   315741530
NUMA alloc miss               71969773    75135651           0
NUMA interleave hit                  0           0           0
NUMA alloc local             238904198   238016873   315741524
NUMA page range updates         157577      171845      163950
NUMA huge PMD updates               33          38          17
NUMA PTE updates                140714      152427      155263
NUMA hint faults                 39395       60301       56019
NUMA hint local faults           17294       30974       33723
NUMA hint local percent             43          51          60
NUMA pages migrated               7183        8818       11221
AutoNUMA cost                      198         302         281

Local hinting is not great again but the workload is doing a lot of remote
references so it's somewhat expected.

 Documentation/sysctl/vm.txt |  51 +++++++++++++
 include/linux/gfp.h         |   4 +-
 include/linux/mmzone.h      |   2 +
 include/linux/pagemap.h     |   2 +-
 include/linux/swap.h        |   1 +
 kernel/sysctl.c             |   8 ++
 mm/filemap.c                |   3 +-
 mm/page_alloc.c             | 182 ++++++++++++++++++++++++++++++++++++++------
 mm/shmem.c                  |  14 ++++
 9 files changed, 239 insertions(+), 28 deletions(-)

-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  2013-12-18 19:41 [RFC PATCH 0/6] Configurable fair allocation zone policy v4 Mel Gorman
@ 2013-12-18 19:41 ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-18 19:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Michal Hocko, Linux-MM,
	LKML, Mel Gorman

From: Johannes Weiner <hannes@cmpxchg.org>

Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator
and slab.

The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone.  It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.

Only round-robin allocations that are usually freed through page
reclaim or slab shrinking.

Cc: <stable@kernel.org> # 3.12
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f0..f861d02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ zonelist_scan:
 		 * back to remote zones that do not partake in the
 		 * fairness round-robin cycle of this zonelist.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & GFP_MOVABLE_MASK)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
 			if (zone_reclaim_mode &&
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-12-19 12:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
2013-12-17 16:48 ` [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
2013-12-17 16:48 ` [PATCH 2/6] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
2013-12-17 16:48 ` [PATCH 3/6] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
2013-12-17 16:48 ` [PATCH 4/6] mm: Annotate page cache allocations Mel Gorman
2013-12-17 16:48 ` [PATCH 5/6] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
2013-12-17 16:48 ` [PATCH 6/6] mm: page_alloc: add vm.pagecache_interleave to control default mempolicy for page cache Mel Gorman
2013-12-17 20:02 ` [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Johannes Weiner
2013-12-18  6:17   ` Johannes Weiner
2013-12-18 13:47     ` Rik van Riel
2013-12-18 14:17       ` Johannes Weiner
2013-12-18 15:00     ` Mel Gorman
2013-12-18 16:09       ` Mel Gorman
2013-12-18 19:48       ` Johannes Weiner
2013-12-19 11:20         ` Mel Gorman
2013-12-18 14:51   ` Michal Hocko
2013-12-18 15:18     ` Johannes Weiner
2013-12-18 16:20       ` Michal Hocko
2013-12-18 19:20         ` Johannes Weiner
2013-12-19 12:59           ` Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2013-12-18 19:41 [RFC PATCH 0/6] Configurable fair allocation zone policy v4 Mel Gorman
2013-12-18 19:41 ` [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).