* [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
2013-12-17 16:48 ` [PATCH 2/6] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
` (5 subsequent siblings)
6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
Mel Gorman
From: Johannes Weiner <hannes@cmpxchg.org>
Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). That
change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator
and slab.
The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone. It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.
Only round-robin allocations that are usually freed through page
reclaim or slab shrinking.
Cc: <stable@kernel.org>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f0..f861d02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ zonelist_scan:
* back to remote zones that do not partake in the
* fairness round-robin cycle of this zonelist.
*/
- if (alloc_flags & ALLOC_WMARK_LOW) {
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ (gfp_mask & GFP_MOVABLE_MASK)) {
if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
continue;
if (zone_reclaim_mode &&
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/6] mm: page_alloc: Break out zone page aging distribution into its own helper
2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
2013-12-17 16:48 ` [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
2013-12-17 16:48 ` [PATCH 3/6] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
` (4 subsequent siblings)
6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
Mel Gorman
This patch moves the decision on whether to round-robin allocations between
zones and nodes into its own helper functions. It'll make some later patches
easier to understand and it will be automatically inlined.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 42 insertions(+), 21 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f861d02..64020eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1872,6 +1872,42 @@ static inline void init_zone_allows_reclaim(int nid)
#endif /* CONFIG_NUMA */
/*
+ * Distribute pages in proportion to the individual zone size to ensure fair
+ * page aging. The zone a page was allocated in should have no effect on the
+ * time the page has in memory before being reclaimed.
+ *
+ * Returns true if this zone should be skipped to spread the page ages to
+ * other zones.
+ */
+static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
+ struct zone *zone, int alloc_flags)
+{
+ /* Only round robin in the allocator fast path */
+ if (!(alloc_flags & ALLOC_WMARK_LOW))
+ return false;
+
+ /* Only round robin pages likely to be LRU or reclaimable slab */
+ if (!(gfp_mask & GFP_MOVABLE_MASK))
+ return false;
+
+ /* Distribute to the next zone if this zone has exhausted its batch */
+ if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+ return true;
+
+ /*
+ * When zone_reclaim_mode is enabled, try to stay in local zones in the
+ * fastpath. If that fails, the slowpath is entered, which will do
+ * another pass starting with the local zones, but ultimately fall back
+ * back to remote zones that do not partake in the fairness round-robin
+ * cycle of this zonelist.
+ */
+ if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+ return true;
+
+ return false;
+}
+
+/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
*/
@@ -1907,27 +1943,12 @@ zonelist_scan:
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
goto try_this_zone;
- /*
- * Distribute pages in proportion to the individual
- * zone size to ensure fair page aging. The zone a
- * page was allocated in should have no effect on the
- * time the page has in memory before being reclaimed.
- *
- * When zone_reclaim_mode is enabled, try to stay in
- * local zones in the fastpath. If that fails, the
- * slowpath is entered, which will do another pass
- * starting with the local zones, but ultimately fall
- * back to remote zones that do not partake in the
- * fairness round-robin cycle of this zonelist.
- */
- if ((alloc_flags & ALLOC_WMARK_LOW) &&
- (gfp_mask & GFP_MOVABLE_MASK)) {
- if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
- continue;
- if (zone_reclaim_mode &&
- !zone_local(preferred_zone, zone))
- continue;
- }
+
+ /* Distribute pages to ensure fair page aging */
+ if (zone_distribute_age(gfp_mask, preferred_zone, zone,
+ alloc_flags))
+ continue;
+
/*
* When allocating a page cache page for writing, we
* want to get it from a zone that is within its dirty
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 3/6] mm: page_alloc: Use zone node IDs to approximate locality
2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
2013-12-17 16:48 ` [PATCH 1/6] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
2013-12-17 16:48 ` [PATCH 2/6] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
2013-12-17 16:48 ` [PATCH 4/6] mm: Annotate page cache allocations Mel Gorman
` (3 subsequent siblings)
6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
Mel Gorman
zone_local is using node_distance which is a more expensive call than
necessary. On x86, it's another function call in the allocator fast path
and increases cache footprint. This patch makes the assumption zones on a
local node will share the same node ID. The necessary information should
already be cache hot.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64020eb..fd9677e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
- return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+ return zone_to_nid(zone) == numa_node_id();
}
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 4/6] mm: Annotate page cache allocations
2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
` (2 preceding siblings ...)
2013-12-17 16:48 ` [PATCH 3/6] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
2013-12-17 16:48 ` [PATCH 5/6] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
` (2 subsequent siblings)
6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
Mel Gorman
The fair zone allocation policy needs to distinguish between anonymous,
slab and file-backed pages. This patch annotates many of the page cache
allocations by adjusting __page_cache_alloc. This does not guarantee
that all page cache allocations are being properly annotated. One case
for special consideration is shmem. sysv shared memory and MAP_SHARED
anonymous pages are backed by this and they should be treated as anon by
the fair allocation policy. It is also used by tmpfs which arguably should
be treated as file by the fair allocation policy.
The primary top-level shmem allocation function is shmem_getpage_gfp
which ultimately uses alloc_pages_vma() and not __page_cache_alloc. This
is correct for sysv and MAP_SHARED but tmpfs is still treated as anonymous.
This patch special cases shmem to annotate tmpfs allocations as files for
the fair zone allocation policy.
NOTE: At time of writing it has not been double checked that it annotates
the different shmem request types. Furthermore, this patch was
originally base on a patch from Johannes and does not have his
signed-off-by. Without his signed-off, I cannot sign it off
Cannot-sign-off-without-Johannes
---
include/linux/gfp.h | 4 +++-
include/linux/pagemap.h | 2 +-
mm/filemap.c | 2 ++
mm/shmem.c | 14 ++++++++++++++
4 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..f69e4cb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+#define ___GFP_PAGECACHE 0x2000000u
/* If the above are modified, __GFP_BITS_SHIFT may need updating */
/*
@@ -92,6 +93,7 @@ struct vm_area_struct;
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
-#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..bda4845 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
#else
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp | __GFP_PAGECACHE, 0);
}
#endif
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..5bb9225 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
int n;
struct page *page;
+ gfp |= __GFP_PAGECACHE;
+
if (cpuset_do_page_mem_spread()) {
unsigned int cpuset_mems_cookie;
do {
diff --git a/mm/shmem.c b/mm/shmem.c
index 8297623..02d7a9c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
return page;
}
+/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */
+static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info)
+{
+ /* If no internal shm_mount then it must be tmpfs */
+ if (IS_ERR(shm_mnt))
+ return true;
+
+ /* Consider it to be tmpfs if the superblock is not the internal mount */
+ return info->vfs_inode.i_sb != shm_mnt->mnt_sb;
+}
+
static struct page *shmem_alloc_page(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
@@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp,
pvma.vm_ops = NULL;
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
+ if (shmem_inode_on_tmpfs(info))
+ gfp |= __GFP_PAGECACHE;
+
page = alloc_page_vma(gfp, &pvma, 0);
/* Drop reference taken by mpol_shared_policy_lookup() */
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 5/6] mm: page_alloc: Make zone distribution page aging policy configurable
2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
` (3 preceding siblings ...)
2013-12-17 16:48 ` [PATCH 4/6] mm: Annotate page cache allocations Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
2013-12-17 16:48 ` [PATCH 6/6] mm: page_alloc: add vm.pagecache_interleave to control default mempolicy for page cache Mel Gorman
2013-12-17 20:02 ` [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Johannes Weiner
6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
Mel Gorman
Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of
how the page allocator and kswapd interacted on the per-zone LRU lists.
Unfortunately it was missed during review that a consequence is that
we also round-robin between NUMA nodes. This is bad for two reasons
1. It alters the semantics of MPOL_LOCAL without telling anyone
2. It incurs an immediate remote memory performance hit in exchange
for a potential performance gain when memory needs to be reclaimed
later
No cookies for the reviewers on this one.
This patch makes the behaviour of the fair zone allocator policy
configurable. By default it will only distribute pages that are going
to exist on the LRU between zones local to the allocating process. This
preserves the historical semantics of MPOL_LOCAL.
By default, slab pages are not distributed between zones after this patch is
applied. It can be argued that they should get similar treatment but they
have different lifecycles to LRU pages, the shrinkers are not zone-aware
and the interaction between the page allocator and kswapd is different
for slabs. If it turns out to be an almost universal win, we can change
the default.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/vm.txt | 32 ++++++++++++++
include/linux/mmzone.h | 2 +
include/linux/swap.h | 2 +
kernel/sysctl.c | 8 ++++
mm/page_alloc.c | 102 ++++++++++++++++++++++++++++++++++++++------
5 files changed, 134 insertions(+), 12 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb..8eaa562 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
- swappiness
- user_reserve_kbytes
- vfs_cache_pressure
+- zone_distribute_mode
- zone_reclaim_mode
==============================================================
@@ -724,6 +725,37 @@ causes the kernel to prefer to reclaim dentries and inodes.
==============================================================
+zone_distribute_mode
+
+Pages allocation and reclaim are managed on a per-zone basis. When the
+system needs to reclaim memory, candidate pages are selected from these
+per-zone lists. Historically, a potential consequence was that recently
+allocated pages were considered reclaim candidates. From a zone-local
+perspective, page aging was preserved but from a system-wide perspective
+there was an age inversion problem.
+
+A similar problem occurs on a node level where young pages may be reclaimed
+from the local node instead of allocating remote memory. Unforuntately, the
+cost of accessing remote nodes is higher so the system must choose by default
+between favouring page aging or node locality. zone_distribute_mode controls
+how the system will distribute page ages between zones.
+
+0 = Never round-robin based on age
+
+Otherwise the values are ORed together
+
+1 = Distribute anon pages between zones local to the allocating node
+2 = Distribute file pages between zones local to the allocating node
+4 = Distribute slab pages between zones local to the allocating node
+
+The following three flags effectively alter MPOL_DEFAULT, be careful.
+
+8 = Distribute anon pages between zones remote to the allocating node
+16 = Distribute file pages between zones remote to the allocating node
+32 = Distribute slab pages between zones remote to the allocating node
+
+==============================================================
+
zone_reclaim_mode:
Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b835d3f..20a75e3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -897,6 +897,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int sysctl_zone_distribute_mode_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
extern int numa_zonelist_order_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..44329b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,6 +318,8 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;
+extern unsigned __bitwise__ zone_distribute_mode;
+
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a6047..b75c08f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1349,6 +1349,14 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
#endif
+ {
+ .procname = "zone_distribute_mode",
+ .data = &zone_distribute_mode,
+ .maxlen = sizeof(zone_distribute_mode),
+ .mode = 0644,
+ .proc_handler = sysctl_zone_distribute_mode_handler,
+ .extra1 = &zero,
+ },
#ifdef CONFIG_NUMA
{
.procname = "zone_reclaim_mode",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd9677e..c2a2229 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1871,6 +1871,49 @@ static inline void init_zone_allows_reclaim(int nid)
}
#endif /* CONFIG_NUMA */
+/* Controls how page ages are distributed across zones automatically */
+unsigned __bitwise__ zone_distribute_mode __read_mostly;
+
+/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */
+#define DISTRIBUTE_DISABLE (0)
+#define DISTRIBUTE_LOCAL_ANON (1UL << 0)
+#define DISTRIBUTE_LOCAL_FILE (1UL << 1)
+#define DISTRIBUTE_LOCAL_SLAB (1UL << 2)
+#define DISTRIBUTE_REMOTE_ANON (1UL << 3)
+#define DISTRIBUTE_REMOTE_FILE (1UL << 4)
+#define DISTRIBUTE_REMOTE_SLAB (1UL << 5)
+
+#define DISTRIBUTE_STUPID_ANON (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_REMOTE_ANON)
+#define DISTRIBUTE_STUPID_FILE (DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_REMOTE_FILE)
+#define DISTRIBUTE_STUPID_SLAB (DISTRIBUTE_LOCAL_SLAB|DISTRIBUTE_REMOTE_SLAB)
+#define DISTRIBUTE_DEFAULT (DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB)
+
+/* Only these GFP flags are affected by the fair zone allocation policy */
+#define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
+
+int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int rc;
+
+ rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (rc)
+ return rc;
+
+ /* If you are an admin reading this comment, what were you thinking? */
+ if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) ==
+ DISTRIBUTE_STUPID_ANON))
+ zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON;
+ if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) ==
+ DISTRIBUTE_STUPID_FILE))
+ zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE;
+ if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) ==
+ DISTRIBUTE_STUPID_SLAB))
+ zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB;
+
+ return 0;
+}
+
/*
* Distribute pages in proportion to the individual zone size to ensure fair
* page aging. The zone a page was allocated in should have no effect on the
@@ -1882,26 +1925,60 @@ static inline void init_zone_allows_reclaim(int nid)
static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
struct zone *zone, int alloc_flags)
{
+ bool zone_is_local;
+ bool is_file, is_slab, is_anon;
+
/* Only round robin in the allocator fast path */
if (!(alloc_flags & ALLOC_WMARK_LOW))
return false;
- /* Only round robin pages likely to be LRU or reclaimable slab */
- if (!(gfp_mask & GFP_MOVABLE_MASK))
+ /* Only a subset of GFP flags are considered for fair zone policy */
+ if (!(gfp_mask & DISTRIBUTE_GFP_MASK))
return false;
- /* Distribute to the next zone if this zone has exhausted its batch */
- if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
- return true;
-
/*
- * When zone_reclaim_mode is enabled, try to stay in local zones in the
- * fastpath. If that fails, the slowpath is entered, which will do
- * another pass starting with the local zones, but ultimately fall back
- * back to remote zones that do not partake in the fairness round-robin
- * cycle of this zonelist.
+ * Classify the type of allocation. From this point on, the fair zone
+ * allocation policy is being applied. If the allocation does not meet
+ * the criteria the zone must be skipped.
*/
- if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+ is_file = gfp_mask & __GFP_PAGECACHE;
+ is_slab = gfp_mask & __GFP_RECLAIMABLE;
+ is_anon = (!is_file && !is_slab);
+ WARN_ON_ONCE(is_slab && is_file);
+
+ zone_is_local = zone_local(preferred_zone, zone);
+ if (zone_local(preferred_zone, zone)) {
+ /* Distribute between zones local to the node if requested */
+ if (is_anon && (zone_distribute_mode & DISTRIBUTE_LOCAL_ANON))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_LOCAL_FILE))
+ goto check_batch;
+ } else {
+ /*
+ * When zone_reclaim_mode is enabled, stick to local zones. If
+ * that fails, the slowpath is entered, which will do another
+ * pass starting with the local zones, but ultimately fall back
+ * back to remote zones that do not partake in the fairness
+ * round-robin cycle of this zonelist.
+ */
+ if (zone_reclaim_mode)
+ return false;
+
+ if (is_anon && (zone_distribute_mode & DISTRIBUTE_REMOTE_ANON))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+ goto check_batch;
+ if (is_file && (zone_distribute_mode & DISTRIBUTE_REMOTE_FILE))
+ goto check_batch;
+ }
+
+ return true;
+
+check_batch:
+ /* Distribute to the next zone if this zone has exhausted its batch */
+ if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
return true;
return false;
@@ -3797,6 +3874,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
__build_all_zonelists(NULL);
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
+ zone_distribute_mode = DISTRIBUTE_DEFAULT;
} else {
#ifdef CONFIG_MEMORY_HOTPLUG
if (zone)
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 6/6] mm: page_alloc: add vm.pagecache_interleave to control default mempolicy for page cache
2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
` (4 preceding siblings ...)
2013-12-17 16:48 ` [PATCH 5/6] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
@ 2013-12-17 16:48 ` Mel Gorman
2013-12-17 20:02 ` [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Johannes Weiner
6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-17 16:48 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
Mel Gorman
This patch introduces a vm.pagecache_interleave sysctl that allows the
administrator to alter the default memory allocation policy for file-backed
pages. It removes a more configurable interface that is expected to be
too complex to expose to users and give an unnecessarily level of control.
By default it is disabled but there is strong evidence that users on NUMA
machines will want to enable this. The default is expected to change
once the documention is in sync. Ideally it would also be possible to
control on a per-process basis by allowing processes to select either an
MPOL_LOCAL or MPOL_INTERLEAVE_PAGECACHE memory policy as memory policies
are the traditional way for controlling allocation behaviour.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/vm.txt | 61 +++++++++++++++++++++------------------------
include/linux/mmzone.h | 2 +-
include/linux/swap.h | 2 +-
kernel/sysctl.c | 8 +++---
mm/page_alloc.c | 18 +++++--------
5 files changed, 41 insertions(+), 50 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 8eaa562..655ed0a 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -49,6 +49,7 @@ Currently, these files are in /proc/sys/vm:
- oom_kill_allocating_task
- overcommit_memory
- overcommit_ratio
+- pagecache_interleave
- page-cluster
- panic_on_oom
- percpu_pagelist_fraction
@@ -56,7 +57,6 @@ Currently, these files are in /proc/sys/vm:
- swappiness
- user_reserve_kbytes
- vfs_cache_pressure
-- zone_distribute_mode
- zone_reclaim_mode
==============================================================
@@ -608,6 +608,34 @@ of physical RAM. See above.
==============================================================
+pagecache_interleave:
+
+This setting is only relevant to NUMA machines.
+
+Historically, the default behaviour of the system is to allocate memory
+local to the process. The behaviour is usually modified through the use
+of memory policies while zone_reclaim_mode controls how strict the local
+memory allocation policy is.
+
+Issues arise when the allocating process is frequently running on the same
+node. The kernels memory reclaim daemon runs one instance per NUMA node.
+A consequence is that relatively new memory may be reclaimed by kswapd when
+the allocating process is running on a specific node. The user-visible
+impact is that the system appears to do more IO than necessary when a
+workload is accessing files that are larger than a given NUMA node.
+
+One way of addressing this is to use the interleave memory policy but that
+is not always possible.
+
+Another option is to enable this setting. When enabled, the default
+memory allocation changes from MPOL_LOCAL to interleaving file-backed
+pages by default. The downside is that some file accesses will now be
+to remote memory even though the local node had available resources.
+The upside is that workloads working on files larger than a NUMA node
+will not reclaim active pages prematurely.
+
+==============================================================
+
page-cluster
page-cluster controls the number of pages up to which consecutive pages
@@ -725,37 +753,6 @@ causes the kernel to prefer to reclaim dentries and inodes.
==============================================================
-zone_distribute_mode
-
-Pages allocation and reclaim are managed on a per-zone basis. When the
-system needs to reclaim memory, candidate pages are selected from these
-per-zone lists. Historically, a potential consequence was that recently
-allocated pages were considered reclaim candidates. From a zone-local
-perspective, page aging was preserved but from a system-wide perspective
-there was an age inversion problem.
-
-A similar problem occurs on a node level where young pages may be reclaimed
-from the local node instead of allocating remote memory. Unforuntately, the
-cost of accessing remote nodes is higher so the system must choose by default
-between favouring page aging or node locality. zone_distribute_mode controls
-how the system will distribute page ages between zones.
-
-0 = Never round-robin based on age
-
-Otherwise the values are ORed together
-
-1 = Distribute anon pages between zones local to the allocating node
-2 = Distribute file pages between zones local to the allocating node
-4 = Distribute slab pages between zones local to the allocating node
-
-The following three flags effectively alter MPOL_DEFAULT, be careful.
-
-8 = Distribute anon pages between zones remote to the allocating node
-16 = Distribute file pages between zones remote to the allocating node
-32 = Distribute slab pages between zones remote to the allocating node
-
-==============================================================
-
zone_reclaim_mode:
Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 20a75e3..2fb9e2d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -897,7 +897,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
-int sysctl_zone_distribute_mode_handler(struct ctl_table *, int,
+int sysctl_zone_pagecache_interleave_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
extern int numa_zonelist_order_handler(struct ctl_table *, int,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 44329b0..2b522cf 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,7 +318,7 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;
-extern unsigned __bitwise__ zone_distribute_mode;
+extern unsigned int zone_pagecache_interleave;
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b75c08f..385d7cb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1350,11 +1350,11 @@ static struct ctl_table vm_table[] = {
},
#endif
{
- .procname = "zone_distribute_mode",
- .data = &zone_distribute_mode,
- .maxlen = sizeof(zone_distribute_mode),
+ .procname = "pagecache_interleave",
+ .data = &zone_pagecache_interleave,
+ .maxlen = sizeof(zone_pagecache_interleave),
.mode = 0644,
- .proc_handler = sysctl_zone_distribute_mode_handler,
+ .proc_handler = sysctl_zone_pagecache_interleave_handler,
.extra1 = &zero,
},
#ifdef CONFIG_NUMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2a2229..b6c8e63 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1872,7 +1872,8 @@ static inline void init_zone_allows_reclaim(int nid)
#endif /* CONFIG_NUMA */
/* Controls how page ages are distributed across zones automatically */
-unsigned __bitwise__ zone_distribute_mode __read_mostly;
+static unsigned __bitwise__ zone_distribute_mode __read_mostly;
+unsigned int zone_pagecache_interleave;
/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */
#define DISTRIBUTE_DISABLE (0)
@@ -1891,7 +1892,7 @@ unsigned __bitwise__ zone_distribute_mode __read_mostly;
/* Only these GFP flags are affected by the fair zone allocation policy */
#define DISTRIBUTE_GFP_MASK ((GFP_MOVABLE_MASK|__GFP_PAGECACHE))
-int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
+int sysctl_zone_pagecache_interleave_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
int rc;
@@ -1900,16 +1901,9 @@ int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
if (rc)
return rc;
- /* If you are an admin reading this comment, what were you thinking? */
- if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_ANON) ==
- DISTRIBUTE_STUPID_ANON))
- zone_distribute_mode &= ~DISTRIBUTE_REMOTE_ANON;
- if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_FILE) ==
- DISTRIBUTE_STUPID_FILE))
- zone_distribute_mode &= ~DISTRIBUTE_REMOTE_FILE;
- if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID_SLAB) ==
- DISTRIBUTE_STUPID_SLAB))
- zone_distribute_mode &= ~DISTRIBUTE_REMOTE_SLAB;
+ zone_distribute_mode = DISTRIBUTE_LOCAL_ANON|DISTRIBUTE_LOCAL_FILE|DISTRIBUTE_LOCAL_SLAB;
+ if (zone_pagecache_interleave)
+ zone_distribute_mode |= DISTRIBUTE_REMOTE_FILE;
return 0;
}
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-17 16:48 [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Mel Gorman
` (5 preceding siblings ...)
2013-12-17 16:48 ` [PATCH 6/6] mm: page_alloc: add vm.pagecache_interleave to control default mempolicy for page cache Mel Gorman
@ 2013-12-17 20:02 ` Johannes Weiner
2013-12-18 6:17 ` Johannes Weiner
2013-12-18 14:51 ` Michal Hocko
6 siblings, 2 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-12-17 20:02 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
Hi Mel,
On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote:
> This series is currently untested and is being posted to sync up discussions
> on the treatment of page cache pages, particularly the sysv part. I have
> not thought it through in detail but postings patches is the easiest way
> to highlight where I think a problem might be.
>
> Changelog since v2
> o Drop an accounting patch, behaviour is deliberate
> o Special case tmpfs and shmem pages for discussion
>
> Changelog since v1
> o Fix lot of brain damage in the configurable policy patch
> o Yoink a page cache annotation patch
> o Only account batch pages against allocations eligible for the fair policy
> o Add patch that default distributes file pages on remote nodes
>
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of how
> the page allocator and kswapd interacted on the per-zone LRU lists.
Not just that, it was about ensuring predictable cache replacement and
maximizing the cache's effectiveness. This implicitely fixed the
kswapd interaction bug, but that was not the sole reason (I realize
that the original changelog is incomplete and I apologize for that).
I have had offline discussions with Andrea back then and his first
suggestion was too to make this a zone fairness placement that is
exclusive to the local node, but eventually he agreed that the problem
applies just as much on the global level and that we should apply
fairness throughout the system as long as we honor zone_reclaim_mode
and hard bindings. During our discussions now, it turned out that
zone_reclaim_mode is a terrible predictor for preferred locality, but
we also more or less agreed that the locality issues in the first
place are not really applicable to cache loads dominated by IO cost.
So I think the main discrepancy between the original patch and what we
truly want is that aging fairness is really only relevant for actual
cache backed by secondary storage, because cache replacement is an
ongoing operation that involves IO. As opposed to memory types that
involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all
(slab, kernel allocations), in which case we prefer NUMA locality.
> Unfortunately a side-effect missed during review was that it's now very
> easy to allocate remote memory on NUMA machines. The problem is that
> it is not a simple case of just restoring local allocation policies as
> there are genuine reasons why global page aging may be prefereable. It's
> still a major change to default behaviour so this patch makes the policy
> configurable and sets what I think is a sensible default.
>
> The patches are on top of some NUMA balancing patches currently in -mm.
> It's untested and posted to discuss patches 4 and 6.
It might be easier in dealing with -stable if we start with the
critical fix(es) to restore sane functionality as much and as compact
as possible and then place the cleanups on top?
In my local tree, I have the following as the first patch:
---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: page_alloc: restrict fair allocator policy to page cache
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged
in order to ensure predictable page cache replacement and to maximize
the cache's effectiveness of reducing IO regardless of zone or node
topology.
However, it was overzealous in round-robin placing every type of
allocation over all allowable nodes, instead of preferring locality,
which resulted in severe regressions on certain NUMA workloads that
have nothing to do with page cache.
This patch drastically reduces the impact of the original change by
having the round-robin placement policy only apply to page cache
backed by secondary storage, and no longer to anonymous memory, shmem,
tmpfs, slab allocations.
This still changes the long-standing behavior of page cache adhering
to the configured memory policy and preferring local allocations per
default, so make it configurable in case somebody relies on it.
However, we also expect the majority of users to prefer maximium cache
effectiveness and a predictable replacement behavior over memory
locality, so reflect this in the default setting of the sysctl.
---
Documentation/sysctl/vm.txt | 21 +++++++++++++++++
Documentation/vm/numa_memory_policy.txt | 8 +++++++
include/linux/gfp.h | 4 +++-
include/linux/pagemap.h | 2 +-
include/linux/swap.h | 2 ++
kernel/sysctl.c | 8 +++++++
mm/filemap.c | 2 ++
mm/page_alloc.c | 41 +++++++++++++++++++++++++--------
8 files changed, 76 insertions(+), 12 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb7b64a..50d250f7470f 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
+- pagecache_mempolicy_mode
- min_slab_ratio
- min_unmapped_ratio
- mmap_min_addr
@@ -404,6 +405,26 @@ Setting this too high will OOM your machine instantly.
=============================================================
+pagecache_mempolicy_mode:
+
+This is available only on NUMA kernels.
+
+Per default, the configured memory policy is applicable to anonymous
+memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
+interleaving fashion over all allowed nodes (hardbindings and
+zone_reclaim_mode excluded).
+
+The assumption is that, when it comes to pagecache, users generally
+prefer predictable replacement behavior regardless of NUMA topology
+and maximizing the cache's effectiveness in reducing IO over memory
+locality.
+
+This behavior can be changed by enabling pagecache_mempolicy_mode, in
+which case page cache allocations will be placed according to the
+configured memory policy (Documentation/vm/numa_memory_policy.txt).
+
+=============================================================
+
min_slab_ratio:
This is available only on NUMA kernels.
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 4e7da6543424..64d48b6378db 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -16,6 +16,14 @@ programming interface that a NUMA-aware application can take advantage of. When
both cpusets and policies are applied to a task, the restrictions of the cpuset
takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.
+Note that, per default, the memory policies as described below apply to process
+memory and shmem/tmpfs/ramfs only. Pagecache backed by secondary storage will
+be interleaved fairly over all allowable nodes (respecting hardbindings and
+zone_reclaim_mode) in order to maximize the cache's effectiveness in reducing IO
+and to ensure predictable cache replacement. Special setups that require
+pagecache to adhere to the configured memory policy can change this behavior by
+enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt).
+
MEMORY POLICY CONCEPTS
Scope of Memory Policies
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd491f7e8..f69e4cb78ccf 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+#define ___GFP_PAGECACHE 0x2000000u
/* If the above are modified, __GFP_BITS_SHIFT may need updating */
/*
@@ -92,6 +93,7 @@ struct vm_area_struct;
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
-#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75a078b..bda48453af8e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
#else
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp | __GFP_PAGECACHE, 0);
}
#endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6c219f..3458994b0881 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -320,11 +320,13 @@ extern unsigned long vm_total_pages;
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
+extern int pagecache_mempolicy_mode;
extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
#else
#define zone_reclaim_mode 0
+#define pagecache_mempolicy_mode 0
static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
{
return 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a604726d0b..a8c56c1dc98e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
{
+ .procname = "pagecache_mempolicy_mode",
+ .data = &pagecache_mempolicy_mode,
+ .maxlen = sizeof(pagecache_mempolicy_mode),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ },
+ {
.procname = "min_unmapped_ratio",
.data = &sysctl_min_unmapped_ratio,
.maxlen = sizeof(sysctl_min_unmapped_ratio),
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a92021c..5bb922506906 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
int n;
struct page *page;
+ gfp |= __GFP_PAGECACHE;
+
if (cpuset_do_page_mem_spread()) {
unsigned int cpuset_mems_cookie;
do {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f075ed0..b28370932950 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,15 @@ again:
get_pageblock_migratetype(page));
}
+ /*
+ * All allocations eat into the round-robin batch, even
+ * allocations that are not subject to round-robin placement
+ * themselves. This makes sure that allocations that ARE
+ * subject to round-robin placement compensate for the
+ * allocations that aren't, to have equal placement overall.
+ */
__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(preferred_zone, zone, gfp_flags);
local_irq_restore(flags);
@@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
#ifdef CONFIG_NUMA
/*
+ * pagecache_mempolicy_mode - whether page cache should honor the
+ * configured memory policy and allocate from the zonelist in order of
+ * preference, or whether it should be interleaved fairly over all
+ * allowed zones in the given zonelist to maximize cache effects and
+ * ensure predictable cache replacement.
+ */
+int pagecache_mempolicy_mode __read_mostly;
+
+/*
* zlc_setup - Setup for "zonelist cache". Uses cached zone data to
* skip over zones that are not allowed by the cpuset, or that have
* been recently (in last second) found to be nearly full. See further
@@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
- return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+ return local_zone->node == zone->node;
}
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
@@ -1908,22 +1925,25 @@ zonelist_scan:
if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
goto try_this_zone;
/*
- * Distribute pages in proportion to the individual
- * zone size to ensure fair page aging. The zone a
- * page was allocated in should have no effect on the
- * time the page has in memory before being reclaimed.
+ * Distribute page cache pages in proportion to the
+ * individual zone size to ensure fair page aging.
+ * The zone a page was allocated in should have no
+ * effect on the time the page has in memory before
+ * being reclaimed.
*
- * When zone_reclaim_mode is enabled, try to stay in
- * local zones in the fastpath. If that fails, the
+ * When pagecache_mempolicy_mode or zone_reclaim_mode
+ * is enabled, try to allocate from zones within the
+ * preferred node in the fastpath. If that fails, the
* slowpath is entered, which will do another pass
* starting with the local zones, but ultimately fall
* back to remote zones that do not partake in the
* fairness round-robin cycle of this zonelist.
*/
- if (alloc_flags & ALLOC_WMARK_LOW) {
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ (gfp_mask & __GFP_PAGECACHE)) {
if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
continue;
- if (zone_reclaim_mode &&
+ if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
!zone_local(preferred_zone, zone))
continue;
}
@@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
* thrash fairness information for zones that are not
* actually part of this zonelist's round-robin cycle.
*/
- if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+ if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
+ !zone_local(preferred_zone, zone))
continue;
mod_zone_page_state(zone, NR_ALLOC_BATCH,
high_wmark_pages(zone) -
--
1.8.4.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-17 20:02 ` [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Johannes Weiner
@ 2013-12-18 6:17 ` Johannes Weiner
2013-12-18 13:47 ` Rik van Riel
2013-12-18 15:00 ` Mel Gorman
2013-12-18 14:51 ` Michal Hocko
1 sibling, 2 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18 6:17 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote:
> Hi Mel,
>
> On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote:
> > This series is currently untested and is being posted to sync up discussions
> > on the treatment of page cache pages, particularly the sysv part. I have
> > not thought it through in detail but postings patches is the easiest way
> > to highlight where I think a problem might be.
> >
> > Changelog since v2
> > o Drop an accounting patch, behaviour is deliberate
> > o Special case tmpfs and shmem pages for discussion
> >
> > Changelog since v1
> > o Fix lot of brain damage in the configurable policy patch
> > o Yoink a page cache annotation patch
> > o Only account batch pages against allocations eligible for the fair policy
> > o Add patch that default distributes file pages on remote nodes
> >
> > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > bug whereby new pages could be reclaimed before old pages because of how
> > the page allocator and kswapd interacted on the per-zone LRU lists.
>
> Not just that, it was about ensuring predictable cache replacement and
> maximizing the cache's effectiveness. This implicitely fixed the
> kswapd interaction bug, but that was not the sole reason (I realize
> that the original changelog is incomplete and I apologize for that).
>
> I have had offline discussions with Andrea back then and his first
> suggestion was too to make this a zone fairness placement that is
> exclusive to the local node, but eventually he agreed that the problem
> applies just as much on the global level and that we should apply
> fairness throughout the system as long as we honor zone_reclaim_mode
> and hard bindings. During our discussions now, it turned out that
> zone_reclaim_mode is a terrible predictor for preferred locality, but
> we also more or less agreed that the locality issues in the first
> place are not really applicable to cache loads dominated by IO cost.
>
> So I think the main discrepancy between the original patch and what we
> truly want is that aging fairness is really only relevant for actual
> cache backed by secondary storage, because cache replacement is an
> ongoing operation that involves IO. As opposed to memory types that
> involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all
> (slab, kernel allocations), in which case we prefer NUMA locality.
>
> > Unfortunately a side-effect missed during review was that it's now very
> > easy to allocate remote memory on NUMA machines. The problem is that
> > it is not a simple case of just restoring local allocation policies as
> > there are genuine reasons why global page aging may be prefereable. It's
> > still a major change to default behaviour so this patch makes the policy
> > configurable and sets what I think is a sensible default.
> >
> > The patches are on top of some NUMA balancing patches currently in -mm.
> > It's untested and posted to discuss patches 4 and 6.
>
> It might be easier in dealing with -stable if we start with the
> critical fix(es) to restore sane functionality as much and as compact
> as possible and then place the cleanups on top?
>
> In my local tree, I have the following as the first patch:
Updated version with your tmpfs __GFP_PAGECACHE parts added and
documentation, changelog updated as necessary. I remain unconvinced
that tmpfs pages should be round-robined, but I agree with you that it
is the conservative change to do for 3.12 and 3.12 and we can figure
out the rest later. I sure hope that this doesn't drive most people
on NUMA to disable pagecache interleaving right away as I expect most
tmpfs workloads to see little to no reclaim and prefer locality... :/
---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged
in order to ensure predictable pagecache replacement and to maximize
the cache's effectiveness of reducing IO regardless of zone or node
topology.
However, it was overzealous in round-robin placing every type of
allocation over all allowable nodes, instead of preferring locality,
which resulted in severe regressions on certain NUMA workloads that
have nothing to do with pagecache.
This patch drastically reduces the impact of the original change by
having the round-robin placement policy only apply to pagecache
allocations and no longer to anonymous memory, shmem, slab and other
types of kernel allocations.
This still changes the long-standing behavior of pagecache adhering to
the configured memory policy and preferring local allocations per
default, so make it configurable in case somebody relies on it.
However, we also expect the majority of users to prefer maximium cache
effectiveness and a predictable replacement behavior over memory
locality, so reflect this in the default setting of the sysctl.
No-signoff-without-Mel's
Cc: <stable@kernel.org> # 3.12
---
Documentation/sysctl/vm.txt | 20 ++++++++++++++++
Documentation/vm/numa_memory_policy.txt | 7 ++++++
include/linux/gfp.h | 4 +++-
include/linux/pagemap.h | 2 +-
include/linux/swap.h | 2 ++
kernel/sysctl.c | 8 +++++++
mm/filemap.c | 2 ++
mm/page_alloc.c | 41 +++++++++++++++++++++++++--------
mm/shmem.c | 14 +++++++++++
9 files changed, 88 insertions(+), 12 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb7b64a..308c342f62ad 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
+- pagecache_mempolicy_mode
- min_slab_ratio
- min_unmapped_ratio
- mmap_min_addr
@@ -404,6 +405,25 @@ Setting this too high will OOM your machine instantly.
=============================================================
+pagecache_mempolicy_mode:
+
+This is available only on NUMA kernels.
+
+Per default, pagecache is allocated in an interleaving fashion over
+all allowed nodes (hardbindings and zone_reclaim_mode excluded),
+regardless of the selected memory policy.
+
+The assumption is that, when it comes to pagecache, users generally
+prefer predictable replacement behavior regardless of NUMA topology
+and maximizing the cache's effectiveness in reducing IO over memory
+locality.
+
+This behavior can be changed by enabling pagecache_mempolicy_mode, in
+which case page cache allocations will be placed according to the
+configured memory policy (Documentation/vm/numa_memory_policy.txt).
+
+=============================================================
+
min_slab_ratio:
This is available only on NUMA kernels.
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 4e7da6543424..72247e565908 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -16,6 +16,13 @@ programming interface that a NUMA-aware application can take advantage of. When
both cpusets and policies are applied to a task, the restrictions of the cpuset
takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.
+Note that, per default, the memory policies do not apply to pagecache. Instead
+it will be interleaved fairly over all allowable nodes (respecting hardbindings
+and zone_reclaim_mode) in order to maximize the cache's effectiveness in
+reducing IO and to ensure predictable cache replacement. Special setups that
+require pagecache to adhere to the configured memory policy can change this
+behavior by enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt).
+
MEMORY POLICY CONCEPTS
Scope of Memory Policies
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd491f7e8..f69e4cb78ccf 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+#define ___GFP_PAGECACHE 0x2000000u
/* If the above are modified, __GFP_BITS_SHIFT may need updating */
/*
@@ -92,6 +93,7 @@ struct vm_area_struct;
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
+#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -99,7 +101,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
-#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75a078b..bda48453af8e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
#else
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp | __GFP_PAGECACHE, 0);
}
#endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6c219f..3458994b0881 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -320,11 +320,13 @@ extern unsigned long vm_total_pages;
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
+extern int pagecache_mempolicy_mode;
extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
#else
#define zone_reclaim_mode 0
+#define pagecache_mempolicy_mode 0
static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
{
return 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a604726d0b..a8c56c1dc98e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
{
+ .procname = "pagecache_mempolicy_mode",
+ .data = &pagecache_mempolicy_mode,
+ .maxlen = sizeof(pagecache_mempolicy_mode),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ },
+ {
.procname = "min_unmapped_ratio",
.data = &sysctl_min_unmapped_ratio,
.maxlen = sizeof(sysctl_min_unmapped_ratio),
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a92021c..5bb922506906 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
int n;
struct page *page;
+ gfp |= __GFP_PAGECACHE;
+
if (cpuset_do_page_mem_spread()) {
unsigned int cpuset_mems_cookie;
do {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f075ed0..f7c0ecb5bb8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,15 @@ again:
get_pageblock_migratetype(page));
}
+ /*
+ * All allocations eat into the round-robin batch, even
+ * allocations that are not subject to round-robin placement
+ * themselves. This makes sure that allocations that ARE
+ * subject to round-robin placement compensate for the
+ * allocations that aren't, to have equal placement overall.
+ */
__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(preferred_zone, zone, gfp_flags);
local_irq_restore(flags);
@@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
#ifdef CONFIG_NUMA
/*
+ * pagecache_mempolicy_mode - whether pagecache allocations should
+ * honor the configured memory policy and allocate from the zonelist
+ * in order of preference, or whether they should interleave fairly
+ * over all allowed zones in the given zonelist to maximize cache
+ * effects and ensure predictable cache replacement.
+ */
+int pagecache_mempolicy_mode __read_mostly;
+
+/*
* zlc_setup - Setup for "zonelist cache". Uses cached zone data to
* skip over zones that are not allowed by the cpuset, or that have
* been recently (in last second) found to be nearly full. See further
@@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
- return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+ return local_zone->node == zone->node;
}
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
@@ -1908,22 +1925,25 @@ zonelist_scan:
if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
goto try_this_zone;
/*
- * Distribute pages in proportion to the individual
- * zone size to ensure fair page aging. The zone a
- * page was allocated in should have no effect on the
- * time the page has in memory before being reclaimed.
+ * Distribute pagecache pages in proportion to the
+ * individual zone size to ensure fair page aging.
+ * The zone a page was allocated in should have no
+ * effect on the time the page has in memory before
+ * being reclaimed.
*
- * When zone_reclaim_mode is enabled, try to stay in
- * local zones in the fastpath. If that fails, the
+ * When pagecache_mempolicy_mode or zone_reclaim_mode
+ * is enabled, try to allocate from zones within the
+ * preferred node in the fastpath. If that fails, the
* slowpath is entered, which will do another pass
* starting with the local zones, but ultimately fall
* back to remote zones that do not partake in the
* fairness round-robin cycle of this zonelist.
*/
- if (alloc_flags & ALLOC_WMARK_LOW) {
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ (gfp_mask & __GFP_PAGECACHE)) {
if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
continue;
- if (zone_reclaim_mode &&
+ if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
!zone_local(preferred_zone, zone))
continue;
}
@@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
* thrash fairness information for zones that are not
* actually part of this zonelist's round-robin cycle.
*/
- if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+ if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
+ !zone_local(preferred_zone, zone))
continue;
mod_zone_page_state(zone, NR_ALLOC_BATCH,
high_wmark_pages(zone) -
diff --git a/mm/shmem.c b/mm/shmem.c
index 8297623fcaed..02d7a9c03463 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
return page;
}
+/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */
+static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info)
+{
+ /* If no internal shm_mount then it must be tmpfs */
+ if (IS_ERR(shm_mnt))
+ return true;
+
+ /* Consider it to be tmpfs if the superblock is not the internal mount */
+ return info->vfs_inode.i_sb != shm_mnt->mnt_sb;
+}
+
static struct page *shmem_alloc_page(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
@@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp,
pvma.vm_ops = NULL;
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
+ if (shmem_inode_on_tmpfs(info))
+ gfp |= __GFP_PAGECACHE;
+
page = alloc_page_vma(gfp, &pvma, 0);
/* Drop reference taken by mpol_shared_policy_lookup() */
--
1.8.4.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 6:17 ` Johannes Weiner
@ 2013-12-18 13:47 ` Rik van Riel
2013-12-18 14:17 ` Johannes Weiner
2013-12-18 15:00 ` Mel Gorman
1 sibling, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2013-12-18 13:47 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Mel Gorman, Andrew Morton, Dave Hansen, Linux-MM, LKML
On 12/18/2013 01:17 AM, Johannes Weiner wrote:
> Updated version with your tmpfs __GFP_PAGECACHE parts added and
> documentation, changelog updated as necessary. I remain unconvinced
> that tmpfs pages should be round-robined, but I agree with you that it
> is the conservative change to do for 3.12 and 3.12 and we can figure
> out the rest later. I sure hope that this doesn't drive most people
> on NUMA to disable pagecache interleaving right away as I expect most
> tmpfs workloads to see little to no reclaim and prefer locality... :/
Actually, I suspect most tmpfs heavy workloads will be things like
databases with shared memory segments. Those tend to benefit from
having all of the system's memory bandwidth available. The worker
threads/processes tend to live all over the system, too...
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 13:47 ` Rik van Riel
@ 2013-12-18 14:17 ` Johannes Weiner
0 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18 14:17 UTC (permalink / raw)
To: Rik van Riel; +Cc: Mel Gorman, Andrew Morton, Dave Hansen, Linux-MM, LKML
On Wed, Dec 18, 2013 at 08:47:45AM -0500, Rik van Riel wrote:
> On 12/18/2013 01:17 AM, Johannes Weiner wrote:
>
> > Updated version with your tmpfs __GFP_PAGECACHE parts added and
> > documentation, changelog updated as necessary. I remain unconvinced
> > that tmpfs pages should be round-robined, but I agree with you that it
> > is the conservative change to do for 3.12 and 3.12 and we can figure
> > out the rest later. I sure hope that this doesn't drive most people
> > on NUMA to disable pagecache interleaving right away as I expect most
> > tmpfs workloads to see little to no reclaim and prefer locality... :/
>
> Actually, I suspect most tmpfs heavy workloads will be things like
> databases with shared memory segments. Those tend to benefit from
> having all of the system's memory bandwidth available. The worker
> threads/processes tend to live all over the system, too...
Shared memory segments are explicitely excluded from the interleaving,
though. The distinction is between the internal tmpfs mount that sysv
shmem uses (mempolicy) and tmpfs mounts that use the actual filesystem
interface (pagecache interleave).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 6:17 ` Johannes Weiner
2013-12-18 13:47 ` Rik van Riel
@ 2013-12-18 15:00 ` Mel Gorman
2013-12-18 16:09 ` Mel Gorman
2013-12-18 19:48 ` Johannes Weiner
1 sibling, 2 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-18 15:00 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Wed, Dec 18, 2013 at 01:17:50AM -0500, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote:
> > Hi Mel,
> >
> > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote:
> > > This series is currently untested and is being posted to sync up discussions
> > > on the treatment of page cache pages, particularly the sysv part. I have
> > > not thought it through in detail but postings patches is the easiest way
> > > to highlight where I think a problem might be.
> > >
> > > Changelog since v2
> > > o Drop an accounting patch, behaviour is deliberate
> > > o Special case tmpfs and shmem pages for discussion
> > >
> > > Changelog since v1
> > > o Fix lot of brain damage in the configurable policy patch
> > > o Yoink a page cache annotation patch
> > > o Only account batch pages against allocations eligible for the fair policy
> > > o Add patch that default distributes file pages on remote nodes
> > >
> > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > bug whereby new pages could be reclaimed before old pages because of how
> > > the page allocator and kswapd interacted on the per-zone LRU lists.
> >
> > Not just that, it was about ensuring predictable cache replacement and
> > maximizing the cache's effectiveness. This implicitely fixed the
> > kswapd interaction bug, but that was not the sole reason (I realize
> > that the original changelog is incomplete and I apologize for that).
> >
> > I have had offline discussions with Andrea back then and his first
> > suggestion was too to make this a zone fairness placement that is
> > exclusive to the local node, but eventually he agreed that the problem
> > applies just as much on the global level and that we should apply
> > fairness throughout the system as long as we honor zone_reclaim_mode
> > and hard bindings. During our discussions now, it turned out that
> > zone_reclaim_mode is a terrible predictor for preferred locality, but
> > we also more or less agreed that the locality issues in the first
> > place are not really applicable to cache loads dominated by IO cost.
> >
> > So I think the main discrepancy between the original patch and what we
> > truly want is that aging fairness is really only relevant for actual
> > cache backed by secondary storage, because cache replacement is an
> > ongoing operation that involves IO. As opposed to memory types that
> > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all
> > (slab, kernel allocations), in which case we prefer NUMA locality.
> >
> > > Unfortunately a side-effect missed during review was that it's now very
> > > easy to allocate remote memory on NUMA machines. The problem is that
> > > it is not a simple case of just restoring local allocation policies as
> > > there are genuine reasons why global page aging may be prefereable. It's
> > > still a major change to default behaviour so this patch makes the policy
> > > configurable and sets what I think is a sensible default.
> > >
> > > The patches are on top of some NUMA balancing patches currently in -mm.
> > > It's untested and posted to discuss patches 4 and 6.
> >
> > It might be easier in dealing with -stable if we start with the
> > critical fix(es) to restore sane functionality as much and as compact
> > as possible and then place the cleanups on top?
> >
> > In my local tree, I have the following as the first patch:
>
> Updated version with your tmpfs __GFP_PAGECACHE parts added and
> documentation, changelog updated as necessary. I remain unconvinced
> that tmpfs pages should be round-robined, but I agree with you that it
> is the conservative change to do for 3.12 and 3.12 and we can figure
> out the rest later.
Assume you with 3.12 and 3.13 here.
> I sure hope that this doesn't drive most people
> on NUMA to disable pagecache interleaving right away as I expect most
> tmpfs workloads to see little to no reclaim and prefer locality... :/
>
I hope you're right but I expect the experience will be like
zone_reclaim_mode. We're going to be looking out for bug reports that
are "fixed" by disabling pagecache locality and pushing back on them by
fixing the real problem.
This was the experience with zone_reclaim_mode when it started going
wrong. It was also the experience with THP for a very long time.
Disabling THP was a workaround for all sorts of problems and it was very
important to fix them and push back on anyone documenting disabling THP
as a standard workaround.
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache
>
Monolithic patch with multiple changes but meh. I'm not pushed because I
know what the breakout looks like. FWIW, I had intended the entire of my
broken-out series for 3.12 and 3.13 once it got ironed out. I find the
series easier to understand but of course I would.
> 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged
> in order to ensure predictable pagecache replacement and to maximize
> the cache's effectiveness of reducing IO regardless of zone or node
> topology.
>
> However, it was overzealous in round-robin placing every type of
> allocation over all allowable nodes, instead of preferring locality,
> which resulted in severe regressions on certain NUMA workloads that
> have nothing to do with pagecache.
>
> This patch drastically reduces the impact of the original change by
> having the round-robin placement policy only apply to pagecache
> allocations and no longer to anonymous memory, shmem, slab and other
> types of kernel allocations.
>
> This still changes the long-standing behavior of pagecache adhering to
> the configured memory policy and preferring local allocations per
> default, so make it configurable in case somebody relies on it.
> However, we also expect the majority of users to prefer maximium cache
> effectiveness and a predictable replacement behavior over memory
> locality, so reflect this in the default setting of the sysctl.
>
> No-signoff-without-Mel's
> Cc: <stable@kernel.org> # 3.12
> ---
> Documentation/sysctl/vm.txt | 20 ++++++++++++++++
> Documentation/vm/numa_memory_policy.txt | 7 ++++++
> include/linux/gfp.h | 4 +++-
> include/linux/pagemap.h | 2 +-
> include/linux/swap.h | 2 ++
> kernel/sysctl.c | 8 +++++++
> mm/filemap.c | 2 ++
> mm/page_alloc.c | 41 +++++++++++++++++++++++++--------
> mm/shmem.c | 14 +++++++++++
> 9 files changed, 88 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 1fbd4eb7b64a..308c342f62ad 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
> - memory_failure_early_kill
> - memory_failure_recovery
> - min_free_kbytes
> +- pagecache_mempolicy_mode
> - min_slab_ratio
> - min_unmapped_ratio
> - mmap_min_addr
Sure about the name?
This is a boolean and "mode" implies it might be a bitmask. That said, I
recognise that my own naming also sucked because complaining about yours
I can see that mine also sucks.
> @@ -404,6 +405,25 @@ Setting this too high will OOM your machine instantly.
>
> =============================================================
>
> +pagecache_mempolicy_mode:
> +
> +This is available only on NUMA kernels.
> +
> +Per default, pagecache is allocated in an interleaving fashion over
> +all allowed nodes (hardbindings and zone_reclaim_mode excluded),
> +regardless of the selected memory policy.
> +
> +The assumption is that, when it comes to pagecache, users generally
> +prefer predictable replacement behavior regardless of NUMA topology
> +and maximizing the cache's effectiveness in reducing IO over memory
> +locality.
> +
> +This behavior can be changed by enabling pagecache_mempolicy_mode, in
> +which case page cache allocations will be placed according to the
> +configured memory policy (Documentation/vm/numa_memory_policy.txt).
> +
Ok this indicates that pagecache will still be interleaved on zones local
to the node the process is allocating on. Good because that preserves a
very important aspect of your original patch.
The current description feels a little backwards though -- "Enable this
to *not* interleave pagecache". This documented behaviour says to me
that pagecache_obey_mempolicy might be a better name if enabling it uses
the system default memory policy. However, even that might put us in a
corner. Ultimately we want this to be controllable on a per-process basis
using memory policies.
Merging what I have in v3, unreleased v4 and this thing I ended up with
this. The observation about cpusets was raised by Michal Hocko on IRC.
---8<---
mpol_interleave_files
This is available only on NUMA kernels.
Historically, the default behaviour of the system is to allocate memory
local to the process. The behaviour was usually modified through the use
of memory policies while zone_reclaim_mode controls how strict the local
memory allocation policy is.
Issues arise when the allocating process is frequently running on the same
node. The kernels memory reclaim daemon runs one instance per NUMA node.
A consequence is that relatively new memory may be reclaimed by kswapd when
the allocating process is running on a specific node. The user-visible
impact is that the system appears to do more IO than necessary when a
workload is accessing files that are larger than a given NUMA node.
To address this problem, the default system memory policy is modified by
this tunable.
When this tunable is enabled, the system default memory policy will
interleave batches of file-backed pages over all allowed zones and nodes.
The assumption is that, when it comes to file pages that users generally
prefer predictable replacement behavior regardless of NUMA topology and
maximizing the page cache's effectiveness in reducing IO over memory
locality.
The tunable zone_reclaim_mode overrides this and enabling zone_reclaim_mode
functionally disables mpol_interleave_pagecache.
A process running within a memory cpuset will obey the cpuset policy and
ignore mpol_interleave_files.
At the time of writing, this parameter cannot be overridden by a process
using set_mempolicy to set the task memory policy. Similarly, numactl
setting the task memory policy will not override this setting. This may
change in the future.
The tunable is default enabled and has two recognised parameters;
0: Use the MPOL_LOCAL policy as the system-wide default
1: Batch interleave file-backed allocations over all allowed nodes
One enabled, the downside is that some file accesses will now be to remote
memory even though the local node had available resources. This will hurt
workloads with small or short lived files that fit easily within one node.
The upside is that workloads working on files larger than a NUMA node will
not reclaim active pages prematurely.
---8<---
> +=============================================================
> +
> min_slab_ratio:
>
> This is available only on NUMA kernels.
> diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
> index 4e7da6543424..72247e565908 100644
> --- a/Documentation/vm/numa_memory_policy.txt
> +++ b/Documentation/vm/numa_memory_policy.txt
> @@ -16,6 +16,13 @@ programming interface that a NUMA-aware application can take advantage of. When
> both cpusets and policies are applied to a task, the restrictions of the cpuset
> takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.
>
> +Note that, per default, the memory policies do not apply to pagecache. Instead
> +it will be interleaved fairly over all allowable nodes (respecting hardbindings
> +and zone_reclaim_mode) in order to maximize the cache's effectiveness in
> +reducing IO and to ensure predictable cache replacement. Special setups that
> +require pagecache to adhere to the configured memory policy can change this
> +behavior by enabling pagecache_mempolicy_mode (see Documentation/sysctl/vm.txt).
> +
Manual pages should also be updated.
> MEMORY POLICY CONCEPTS
>
> Scope of Memory Policies
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 9b4dd491f7e8..f69e4cb78ccf 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -35,6 +35,7 @@ struct vm_area_struct;
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> #define ___GFP_WRITE 0x1000000u
> +#define ___GFP_PAGECACHE 0x2000000u
> /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>
> /*
> @@ -92,6 +93,7 @@ struct vm_area_struct;
> #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> #define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
> #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
> +#define __GFP_PAGECACHE ((__force gfp_t)___GFP_PAGECACHE) /* Page cache allocation */
>
> /*
> * This may seem redundant, but it's a way of annotating false positives vs.
> @@ -99,7 +101,7 @@ struct vm_area_struct;
> */
> #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>
> -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> /* This equals 0, but use constants in case they ever change */
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75a078b..bda48453af8e 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -221,7 +221,7 @@ extern struct page *__page_cache_alloc(gfp_t gfp);
> #else
> static inline struct page *__page_cache_alloc(gfp_t gfp)
> {
> - return alloc_pages(gfp, 0);
> + return alloc_pages(gfp | __GFP_PAGECACHE, 0);
> }
> #endif
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6c219f..3458994b0881 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -320,11 +320,13 @@ extern unsigned long vm_total_pages;
>
> #ifdef CONFIG_NUMA
> extern int zone_reclaim_mode;
> +extern int pagecache_mempolicy_mode;
> extern int sysctl_min_unmapped_ratio;
> extern int sysctl_min_slab_ratio;
> extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
> #else
> #define zone_reclaim_mode 0
> +#define pagecache_mempolicy_mode 0
> static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
> {
> return 0;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 34a604726d0b..a8c56c1dc98e 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1359,6 +1359,14 @@ static struct ctl_table vm_table[] = {
> .extra1 = &zero,
> },
> {
> + .procname = "pagecache_mempolicy_mode",
> + .data = &pagecache_mempolicy_mode,
> + .maxlen = sizeof(pagecache_mempolicy_mode),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + .extra1 = &zero,
> + },
> + {
> .procname = "min_unmapped_ratio",
> .data = &sysctl_min_unmapped_ratio,
> .maxlen = sizeof(sysctl_min_unmapped_ratio),
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b7749a92021c..5bb922506906 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -517,6 +517,8 @@ struct page *__page_cache_alloc(gfp_t gfp)
> int n;
> struct page *page;
>
> + gfp |= __GFP_PAGECACHE;
> +
> if (cpuset_do_page_mem_spread()) {
> unsigned int cpuset_mems_cookie;
> do {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 580a5f075ed0..f7c0ecb5bb8b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1547,7 +1547,15 @@ again:
> get_pageblock_migratetype(page));
> }
>
> + /*
> + * All allocations eat into the round-robin batch, even
> + * allocations that are not subject to round-robin placement
> + * themselves. This makes sure that allocations that ARE
> + * subject to round-robin placement compensate for the
> + * allocations that aren't, to have equal placement overall.
> + */
> __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
> +
> __count_zone_vm_events(PGALLOC, zone, 1 << order);
> zone_statistics(preferred_zone, zone, gfp_flags);
> local_irq_restore(flags);
Thanks.
> @@ -1699,6 +1707,15 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
>
> #ifdef CONFIG_NUMA
> /*
> + * pagecache_mempolicy_mode - whether pagecache allocations should
> + * honor the configured memory policy and allocate from the zonelist
> + * in order of preference, or whether they should interleave fairly
> + * over all allowed zones in the given zonelist to maximize cache
> + * effects and ensure predictable cache replacement.
> + */
> +int pagecache_mempolicy_mode __read_mostly;
> +
> +/*
> * zlc_setup - Setup for "zonelist cache". Uses cached zone data to
> * skip over zones that are not allowed by the cpuset, or that have
> * been recently (in last second) found to be nearly full. See further
> @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>
> static bool zone_local(struct zone *local_zone, struct zone *zone)
> {
> - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> + return local_zone->node == zone->node;
> }
Does that not break on !CONFIG_NUMA?
It's why I used zone_to_nid
>
> static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> @@ -1908,22 +1925,25 @@ zonelist_scan:
> if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
> goto try_this_zone;
> /*
> - * Distribute pages in proportion to the individual
> - * zone size to ensure fair page aging. The zone a
> - * page was allocated in should have no effect on the
> - * time the page has in memory before being reclaimed.
> + * Distribute pagecache pages in proportion to the
> + * individual zone size to ensure fair page aging.
> + * The zone a page was allocated in should have no
> + * effect on the time the page has in memory before
> + * being reclaimed.
> *
> - * When zone_reclaim_mode is enabled, try to stay in
> - * local zones in the fastpath. If that fails, the
> + * When pagecache_mempolicy_mode or zone_reclaim_mode
> + * is enabled, try to allocate from zones within the
> + * preferred node in the fastpath. If that fails, the
> * slowpath is entered, which will do another pass
> * starting with the local zones, but ultimately fall
> * back to remote zones that do not partake in the
> * fairness round-robin cycle of this zonelist.
> */
> - if (alloc_flags & ALLOC_WMARK_LOW) {
> + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> + (gfp_mask & __GFP_PAGECACHE)) {
> if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> continue;
NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or
pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in
some cases. Looks a bit fishy even though I can't quite put my finger on it.
I also got details wrong here in the v3 of the series. In an unreleased
v4 of the series I had corrected the treatment of slab pages in line
with your wishes and reused the broken out helper in prepare_slowpath to
keep the decision in sync.
It's still in development but even if it gets rejected it'll act as a
comparison point to yours.
> - if (zone_reclaim_mode &&
> + if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
> !zone_local(preferred_zone, zone))
> continue;
> }
Documention says "enabling pagecache_mempolicy_mode, in which case page cache
allocations will be placed according to the configured memory policy". Should
that be !pagecache_mempolicy_mode? I'm getting confused with the double nots.
Breaking this out would be more comprehensible.
On a semi-related note, we might encounter a problem later where the
interleaving causes us to skip over usable zones and zones with available
batches are !zone_dirty_ok. We'd fall back to the slowpatch resetting the
batches so it will not be particularly visible but there might be some
interactions there.
> @@ -2390,7 +2410,8 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
> * thrash fairness information for zones that are not
> * actually part of this zonelist's round-robin cycle.
> */
> - if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
> + if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
> + !zone_local(preferred_zone, zone))
> continue;
> mod_zone_page_state(zone, NR_ALLOC_BATCH,
> high_wmark_pages(zone) -
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8297623fcaed..02d7a9c03463 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -929,6 +929,17 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
> return page;
> }
>
> +/* Fugly method of distinguishing sysv/MAP_SHARED anon from tmpfs */
> +static bool shmem_inode_on_tmpfs(struct shmem_inode_info *info)
> +{
> + /* If no internal shm_mount then it must be tmpfs */
> + if (IS_ERR(shm_mnt))
> + return true;
> +
> + /* Consider it to be tmpfs if the superblock is not the internal mount */
> + return info->vfs_inode.i_sb != shm_mnt->mnt_sb;
> +}
> +
> static struct page *shmem_alloc_page(gfp_t gfp,
> struct shmem_inode_info *info, pgoff_t index)
> {
> @@ -942,6 +953,9 @@ static struct page *shmem_alloc_page(gfp_t gfp,
> pvma.vm_ops = NULL;
> pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
>
> + if (shmem_inode_on_tmpfs(info))
> + gfp |= __GFP_PAGECACHE;
> +
> page = alloc_page_vma(gfp, &pvma, 0);
>
> /* Drop reference taken by mpol_shared_policy_lookup() */
For what it's worth, this is what I've currently kicked off testes for
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-pgalloc-interleave-zones-v4r12
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 15:00 ` Mel Gorman
@ 2013-12-18 16:09 ` Mel Gorman
2013-12-18 19:48 ` Johannes Weiner
1 sibling, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-18 16:09 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Wed, Dec 18, 2013 at 03:00:38PM +0000, Mel Gorman wrote:
>
> For what it's worth, this is what I've currently kicked off testes for
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-pgalloc-interleave-zones-v4r12
>
Pushed a dirty tree by accident. Now mm-pgalloc-interleave-zones-v4r13
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 15:00 ` Mel Gorman
2013-12-18 16:09 ` Mel Gorman
@ 2013-12-18 19:48 ` Johannes Weiner
2013-12-19 11:20 ` Mel Gorman
1 sibling, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18 19:48 UTC (permalink / raw)
To: Mel Gorman; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Wed, Dec 18, 2013 at 03:00:38PM +0000, Mel Gorman wrote:
> On Wed, Dec 18, 2013 at 01:17:50AM -0500, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 03:02:10PM -0500, Johannes Weiner wrote:
> > > Hi Mel,
> > >
> > > On Tue, Dec 17, 2013 at 04:48:18PM +0000, Mel Gorman wrote:
> > > > This series is currently untested and is being posted to sync up discussions
> > > > on the treatment of page cache pages, particularly the sysv part. I have
> > > > not thought it through in detail but postings patches is the easiest way
> > > > to highlight where I think a problem might be.
> > > >
> > > > Changelog since v2
> > > > o Drop an accounting patch, behaviour is deliberate
> > > > o Special case tmpfs and shmem pages for discussion
> > > >
> > > > Changelog since v1
> > > > o Fix lot of brain damage in the configurable policy patch
> > > > o Yoink a page cache annotation patch
> > > > o Only account batch pages against allocations eligible for the fair policy
> > > > o Add patch that default distributes file pages on remote nodes
> > > >
> > > > Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> > > > bug whereby new pages could be reclaimed before old pages because of how
> > > > the page allocator and kswapd interacted on the per-zone LRU lists.
> > >
> > > Not just that, it was about ensuring predictable cache replacement and
> > > maximizing the cache's effectiveness. This implicitely fixed the
> > > kswapd interaction bug, but that was not the sole reason (I realize
> > > that the original changelog is incomplete and I apologize for that).
> > >
> > > I have had offline discussions with Andrea back then and his first
> > > suggestion was too to make this a zone fairness placement that is
> > > exclusive to the local node, but eventually he agreed that the problem
> > > applies just as much on the global level and that we should apply
> > > fairness throughout the system as long as we honor zone_reclaim_mode
> > > and hard bindings. During our discussions now, it turned out that
> > > zone_reclaim_mode is a terrible predictor for preferred locality, but
> > > we also more or less agreed that the locality issues in the first
> > > place are not really applicable to cache loads dominated by IO cost.
> > >
> > > So I think the main discrepancy between the original patch and what we
> > > truly want is that aging fairness is really only relevant for actual
> > > cache backed by secondary storage, because cache replacement is an
> > > ongoing operation that involves IO. As opposed to memory types that
> > > involve IO only in extreme cases (anon, tmpfs, shmem) or no IO at all
> > > (slab, kernel allocations), in which case we prefer NUMA locality.
> > >
> > > > Unfortunately a side-effect missed during review was that it's now very
> > > > easy to allocate remote memory on NUMA machines. The problem is that
> > > > it is not a simple case of just restoring local allocation policies as
> > > > there are genuine reasons why global page aging may be prefereable. It's
> > > > still a major change to default behaviour so this patch makes the policy
> > > > configurable and sets what I think is a sensible default.
> > > >
> > > > The patches are on top of some NUMA balancing patches currently in -mm.
> > > > It's untested and posted to discuss patches 4 and 6.
> > >
> > > It might be easier in dealing with -stable if we start with the
> > > critical fix(es) to restore sane functionality as much and as compact
> > > as possible and then place the cleanups on top?
> > >
> > > In my local tree, I have the following as the first patch:
> >
> > Updated version with your tmpfs __GFP_PAGECACHE parts added and
> > documentation, changelog updated as necessary. I remain unconvinced
> > that tmpfs pages should be round-robined, but I agree with you that it
> > is the conservative change to do for 3.12 and 3.12 and we can figure
> > out the rest later.
>
> Assume you with 3.12 and 3.13 here.
Yes :)
> > ---
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Subject: [patch] mm: page_alloc: restrict fair allocator policy to pagecache
> >
>
> Monolithic patch with multiple changes but meh. I'm not pushed because I
> know what the breakout looks like. FWIW, I had intended the entire of my
> broken-out series for 3.12 and 3.13 once it got ironed out. I find the
> series easier to understand but of course I would.
And of course I can live without the cleanups to make code I wrote
more readable ;-) I'm happy to defer on this, let's keep logical
changes separated.
> > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") was merged
> > in order to ensure predictable pagecache replacement and to maximize
> > the cache's effectiveness of reducing IO regardless of zone or node
> > topology.
> >
> > However, it was overzealous in round-robin placing every type of
> > allocation over all allowable nodes, instead of preferring locality,
> > which resulted in severe regressions on certain NUMA workloads that
> > have nothing to do with pagecache.
> >
> > This patch drastically reduces the impact of the original change by
> > having the round-robin placement policy only apply to pagecache
> > allocations and no longer to anonymous memory, shmem, slab and other
> > types of kernel allocations.
> >
> > This still changes the long-standing behavior of pagecache adhering to
> > the configured memory policy and preferring local allocations per
> > default, so make it configurable in case somebody relies on it.
> > However, we also expect the majority of users to prefer maximium cache
> > effectiveness and a predictable replacement behavior over memory
> > locality, so reflect this in the default setting of the sysctl.
> >
> > No-signoff-without-Mel's
> > Cc: <stable@kernel.org> # 3.12
> > ---
> > Documentation/sysctl/vm.txt | 20 ++++++++++++++++
> > Documentation/vm/numa_memory_policy.txt | 7 ++++++
> > include/linux/gfp.h | 4 +++-
> > include/linux/pagemap.h | 2 +-
> > include/linux/swap.h | 2 ++
> > kernel/sysctl.c | 8 +++++++
> > mm/filemap.c | 2 ++
> > mm/page_alloc.c | 41 +++++++++++++++++++++++++--------
> > mm/shmem.c | 14 +++++++++++
> > 9 files changed, 88 insertions(+), 12 deletions(-)
> >
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 1fbd4eb7b64a..308c342f62ad 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
> > - memory_failure_early_kill
> > - memory_failure_recovery
> > - min_free_kbytes
> > +- pagecache_mempolicy_mode
> > - min_slab_ratio
> > - min_unmapped_ratio
> > - mmap_min_addr
>
> Sure about the name?
>
> This is a boolean and "mode" implies it might be a bitmask. That said, I
> recognise that my own naming also sucked because complaining about yours
> I can see that mine also sucks.
Is it because of how we use zone_reclaim_mode? I don't see anything
wrong with a "mode" toggle that switches between only two modes of
operation instead of three or more. But English being a second
language and all...
> > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> >
> > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > {
> > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > + return local_zone->node == zone->node;
> > }
>
> Does that not break on !CONFIG_NUMA?
>
> It's why I used zone_to_nid
There is a separate definition for !CONFIG_NUMA, it fit nicely next to
the zlc stuff.
> > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> > @@ -1908,22 +1925,25 @@ zonelist_scan:
> > if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
> > goto try_this_zone;
> > /*
> > - * Distribute pages in proportion to the individual
> > - * zone size to ensure fair page aging. The zone a
> > - * page was allocated in should have no effect on the
> > - * time the page has in memory before being reclaimed.
> > + * Distribute pagecache pages in proportion to the
> > + * individual zone size to ensure fair page aging.
> > + * The zone a page was allocated in should have no
> > + * effect on the time the page has in memory before
> > + * being reclaimed.
> > *
> > - * When zone_reclaim_mode is enabled, try to stay in
> > - * local zones in the fastpath. If that fails, the
> > + * When pagecache_mempolicy_mode or zone_reclaim_mode
> > + * is enabled, try to allocate from zones within the
> > + * preferred node in the fastpath. If that fails, the
> > * slowpath is entered, which will do another pass
> > * starting with the local zones, but ultimately fall
> > * back to remote zones that do not partake in the
> > * fairness round-robin cycle of this zonelist.
> > */
> > - if (alloc_flags & ALLOC_WMARK_LOW) {
> > + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > + (gfp_mask & __GFP_PAGECACHE)) {
> > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> > continue;
>
> NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or
> pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in
> some cases. Looks a bit fishy even though I can't quite put my finger on it.
>
> I also got details wrong here in the v3 of the series. In an unreleased
> v4 of the series I had corrected the treatment of slab pages in line
> with your wishes and reused the broken out helper in prepare_slowpath to
> keep the decision in sync.
>
> It's still in development but even if it gets rejected it'll act as a
> comparison point to yours.
>
> > - if (zone_reclaim_mode &&
> > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
> > !zone_local(preferred_zone, zone))
> > continue;
> > }
>
> Documention says "enabling pagecache_mempolicy_mode, in which case page cache
> allocations will be placed according to the configured memory policy". Should
> that be !pagecache_mempolicy_mode? I'm getting confused with the double nots.
Yes, it's a bit weird.
We want to consider the round-robin batches for local zones but at the
same time avoid exhausted batches from pushing the allocation off-node
when either of those modes are enabled. So in the fastpath we filter
for both and in the slowpath, once kswapd has been woken at the same
time that the batches have been reset to launch the new aging cycle,
we try in order of zonelist preference.
However, to answer your question above, if the slowpath still has to
fall back to a remote zone, we don't want to reset its batch because
we didn't verify it was actually exhausted in the fastpath and we
could risk cutting short the aging cycle for that particular zone.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 19:48 ` Johannes Weiner
@ 2013-12-19 11:20 ` Mel Gorman
0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-12-19 11:20 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML
On Wed, Dec 18, 2013 at 02:48:13PM -0500, Johannes Weiner wrote:
> > <SNIP>
> >
> > Sure about the name?
> >
> > This is a boolean and "mode" implies it might be a bitmask. That said, I
> > recognise that my own naming also sucked because complaining about yours
> > I can see that mine also sucks.
>
> Is it because of how we use zone_reclaim_mode? I don't see anything
> wrong with a "mode" toggle that switches between only two modes of
> operation instead of three or more. But English being a second
> language and all...
>
It's not just zone_reclaim_mode. Most references to mode in the VM (but
not all because who needs consistentcy) refer to either a mask or multiple
potential values. isolate_mode_t, gfp masks referred to as mode, memory
policies described as mode, migration modes etc.
Intuitively, I expect "mode" to not be a binary value.
> > > @@ -1816,7 +1833,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> > >
> > > static bool zone_local(struct zone *local_zone, struct zone *zone)
> > > {
> > > - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> > > + return local_zone->node == zone->node;
> > > }
> >
> > Does that not break on !CONFIG_NUMA?
> >
> > It's why I used zone_to_nid
>
> There is a separate definition for !CONFIG_NUMA, it fit nicely next to
> the zlc stuff.
>
Ah, fair enough.
> > > static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> > > @@ -1908,22 +1925,25 @@ zonelist_scan:
> > > if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
> > > goto try_this_zone;
> > > /*
> > > - * Distribute pages in proportion to the individual
> > > - * zone size to ensure fair page aging. The zone a
> > > - * page was allocated in should have no effect on the
> > > - * time the page has in memory before being reclaimed.
> > > + * Distribute pagecache pages in proportion to the
> > > + * individual zone size to ensure fair page aging.
> > > + * The zone a page was allocated in should have no
> > > + * effect on the time the page has in memory before
> > > + * being reclaimed.
> > > *
> > > - * When zone_reclaim_mode is enabled, try to stay in
> > > - * local zones in the fastpath. If that fails, the
> > > + * When pagecache_mempolicy_mode or zone_reclaim_mode
> > > + * is enabled, try to allocate from zones within the
> > > + * preferred node in the fastpath. If that fails, the
> > > * slowpath is entered, which will do another pass
> > > * starting with the local zones, but ultimately fall
> > > * back to remote zones that do not partake in the
> > > * fairness round-robin cycle of this zonelist.
> > > */
> > > - if (alloc_flags & ALLOC_WMARK_LOW) {
> > > + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > > + (gfp_mask & __GFP_PAGECACHE)) {
> > > if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> > > continue;
> >
> > NR_ALLOC_BATCH is updated regardless of zone_reclaim_mode or
> > pagecache_mempolicy_mode. We only reset batch in the prepare_slowpath in
> > some cases. Looks a bit fishy even though I can't quite put my finger on it.
> >
> > I also got details wrong here in the v3 of the series. In an unreleased
> > v4 of the series I had corrected the treatment of slab pages in line
> > with your wishes and reused the broken out helper in prepare_slowpath to
> > keep the decision in sync.
> >
> > It's still in development but even if it gets rejected it'll act as a
> > comparison point to yours.
> >
> > > - if (zone_reclaim_mode &&
> > > + if ((zone_reclaim_mode || pagecache_mempolicy_mode) &&
> > > !zone_local(preferred_zone, zone))
> > > continue;
> > > }
> >
> > Documention says "enabling pagecache_mempolicy_mode, in which case page cache
> > allocations will be placed according to the configured memory policy". Should
> > that be !pagecache_mempolicy_mode? I'm getting confused with the double nots.
>
> Yes, it's a bit weird.
>
> We want to consider the round-robin batches for local zones but at the
> same time avoid exhausted batches from pushing the allocation off-node
> when either of those modes are enabled. So in the fastpath we filter
> for both and in the slowpath, once kswapd has been woken at the same
> time that the batches have been reset to launch the new aging cycle,
> we try in order of zonelist preference.
>
> However, to answer your question above, if the slowpath still has to
> fall back to a remote zone, we don't want to reset its batch because
> we didn't verify it was actually exhausted in the fastpath and we
> could risk cutting short the aging cycle for that particular zone.
Understood, thanks.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-17 20:02 ` [RFC PATCH 0/6] Configurable fair allocation zone policy v3 Johannes Weiner
2013-12-18 6:17 ` Johannes Weiner
@ 2013-12-18 14:51 ` Michal Hocko
2013-12-18 15:18 ` Johannes Weiner
1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-12-18 14:51 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
LKML
On Tue 17-12-13 15:02:10, Johannes Weiner wrote:
[...]
> +pagecache_mempolicy_mode:
> +
> +This is available only on NUMA kernels.
> +
> +Per default, the configured memory policy is applicable to anonymous
> +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
> +interleaving fashion over all allowed nodes (hardbindings and
> +zone_reclaim_mode excluded).
> +
> +The assumption is that, when it comes to pagecache, users generally
> +prefer predictable replacement behavior regardless of NUMA topology
> +and maximizing the cache's effectiveness in reducing IO over memory
> +locality.
Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing
semantically? The setting is per-cpuset rather than global which makes
it harder to use but essentially it tries to distribute page cache pages
across all the nodes.
This is really getting confusing. We have zone_reclaim_mode to keep
memory local in general, pagecache_mempolicy_mode to keep page cache
local and PF_SPREAD_PAGE to spread the page cache around nodes.
> +
> +This behavior can be changed by enabling pagecache_mempolicy_mode, in
> +which case page cache allocations will be placed according to the
> +configured memory policy (Documentation/vm/numa_memory_policy.txt).
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 14:51 ` Michal Hocko
@ 2013-12-18 15:18 ` Johannes Weiner
2013-12-18 16:20 ` Michal Hocko
0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18 15:18 UTC (permalink / raw)
To: Michal Hocko
Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
LKML
On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote:
> On Tue 17-12-13 15:02:10, Johannes Weiner wrote:
> [...]
> > +pagecache_mempolicy_mode:
> > +
> > +This is available only on NUMA kernels.
> > +
> > +Per default, the configured memory policy is applicable to anonymous
> > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
> > +interleaving fashion over all allowed nodes (hardbindings and
> > +zone_reclaim_mode excluded).
> > +
> > +The assumption is that, when it comes to pagecache, users generally
> > +prefer predictable replacement behavior regardless of NUMA topology
> > +and maximizing the cache's effectiveness in reducing IO over memory
> > +locality.
>
> Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing
> semantically? The setting is per-cpuset rather than global which makes
> it harder to use but essentially it tries to distribute page cache pages
> across all the nodes.
>
> This is really getting confusing. We have zone_reclaim_mode to keep
> memory local in general, pagecache_mempolicy_mode to keep page cache
> local and PF_SPREAD_PAGE to spread the page cache around nodes.
zone_reclaim_mode is a global setting to go through great lengths to
stay on local nodes, intended to be used depending on the hardware,
not the workload.
Mempolicy on the other hand is to optimize placement for maximum
locality depending on access patterns of a workload or even just the
subset of a workload. I'm trying to change whether this applies to
page cache (due to different locality / cache effectiveness tradeoff)
and we want to provide pagecache_mempolicy_mode to revert in the field
in case this is a mistake.
PF_SPREAD_PAGE becomes implied per default and should eventually be
removed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 15:18 ` Johannes Weiner
@ 2013-12-18 16:20 ` Michal Hocko
2013-12-18 19:20 ` Johannes Weiner
0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-12-18 16:20 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
LKML
On Wed 18-12-13 10:18:46, Johannes Weiner wrote:
> On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote:
> > On Tue 17-12-13 15:02:10, Johannes Weiner wrote:
> > [...]
> > > +pagecache_mempolicy_mode:
> > > +
> > > +This is available only on NUMA kernels.
> > > +
> > > +Per default, the configured memory policy is applicable to anonymous
> > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
> > > +interleaving fashion over all allowed nodes (hardbindings and
> > > +zone_reclaim_mode excluded).
> > > +
> > > +The assumption is that, when it comes to pagecache, users generally
> > > +prefer predictable replacement behavior regardless of NUMA topology
> > > +and maximizing the cache's effectiveness in reducing IO over memory
> > > +locality.
> >
> > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing
> > semantically? The setting is per-cpuset rather than global which makes
> > it harder to use but essentially it tries to distribute page cache pages
> > across all the nodes.
> >
> > This is really getting confusing. We have zone_reclaim_mode to keep
> > memory local in general, pagecache_mempolicy_mode to keep page cache
> > local and PF_SPREAD_PAGE to spread the page cache around nodes.
>
> zone_reclaim_mode is a global setting to go through great lengths to
> stay on local nodes, intended to be used depending on the hardware,
> not the workload.
>
> Mempolicy on the other hand is to optimize placement for maximum
> locality depending on access patterns of a workload or even just the
> subset of a workload. I'm trying to change whether this applies to
> page cache (due to different locality / cache effectiveness tradeoff)
> and we want to provide pagecache_mempolicy_mode to revert in the field
> in case this is a mistake.
>
> PF_SPREAD_PAGE becomes implied per default and should eventually be
> removed.
I guess many loads do not care about page cache locality and the default
spreading would be OK for them but what about those that do care?
Currently we have a per-process (cpuset in fact) flag but this will
change it to all or nothing. Is this really a good step?
Btw. I do not mind having PF_SPREAD_PAGE enabled by default.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 16:20 ` Michal Hocko
@ 2013-12-18 19:20 ` Johannes Weiner
2013-12-19 12:59 ` Michal Hocko
0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-12-18 19:20 UTC (permalink / raw)
To: Michal Hocko
Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
LKML
On Wed, Dec 18, 2013 at 05:20:50PM +0100, Michal Hocko wrote:
> On Wed 18-12-13 10:18:46, Johannes Weiner wrote:
> > On Wed, Dec 18, 2013 at 03:51:11PM +0100, Michal Hocko wrote:
> > > On Tue 17-12-13 15:02:10, Johannes Weiner wrote:
> > > [...]
> > > > +pagecache_mempolicy_mode:
> > > > +
> > > > +This is available only on NUMA kernels.
> > > > +
> > > > +Per default, the configured memory policy is applicable to anonymous
> > > > +memory, shmem, tmpfs, etc., whereas pagecache is allocated in an
> > > > +interleaving fashion over all allowed nodes (hardbindings and
> > > > +zone_reclaim_mode excluded).
> > > > +
> > > > +The assumption is that, when it comes to pagecache, users generally
> > > > +prefer predictable replacement behavior regardless of NUMA topology
> > > > +and maximizing the cache's effectiveness in reducing IO over memory
> > > > +locality.
> > >
> > > Isn't page spreading (PF_SPREAD_PAGE) intended to do the same thing
> > > semantically? The setting is per-cpuset rather than global which makes
> > > it harder to use but essentially it tries to distribute page cache pages
> > > across all the nodes.
> > >
> > > This is really getting confusing. We have zone_reclaim_mode to keep
> > > memory local in general, pagecache_mempolicy_mode to keep page cache
> > > local and PF_SPREAD_PAGE to spread the page cache around nodes.
You are right that the user interface we are exposing is kind of
cruddy and I'm less and less convinced that this is the right
direction.
> > zone_reclaim_mode is a global setting to go through great lengths to
> > stay on local nodes, intended to be used depending on the hardware,
> > not the workload.
> >
> > Mempolicy on the other hand is to optimize placement for maximum
> > locality depending on access patterns of a workload or even just the
> > subset of a workload. I'm trying to change whether this applies to
> > page cache (due to different locality / cache effectiveness tradeoff)
> > and we want to provide pagecache_mempolicy_mode to revert in the field
> > in case this is a mistake.
> >
> > PF_SPREAD_PAGE becomes implied per default and should eventually be
> > removed.
>
> I guess many loads do not care about page cache locality and the default
> spreading would be OK for them but what about those that do care?
Mel suggested that the page cache spreading be implemented as just
another memory policy and I rejected it on the grounds that we have
can have strange aging artifacts if it's not the default.
But you are right that there might be usecases that really have high
cache locality and don't incur any reclaim. The aging artifacts are
non-existent to them but they would care about the NUMA locality.
And basically, the same aging artifacts apply to anon e.g., just that
the trade-off balance is different, as reclaim is much less common.
And we do offer interleaving for anon as well. So the situation is
not all that different that I had myself convinced it would be...
So the more I'm thinking about it, the more I'm leaning towards making
it a mempolicy after all, provided that we can set a sane default.
Maybe we can make the new default a hybrid policy that keeps anon,
shmem, slab, kernel, etc. local but interleaves pagecache. This
should make sense to most usecases while providing the ability for
custom placement policies per-process or per-VMA without having to
make the decision on a global level or through an unusual interface.
> Currently we have a per-process (cpuset in fact) flag but this will
> change it to all or nothing. Is this really a good step?
> Btw. I do not mind having PF_SPREAD_PAGE enabled by default.
I don't want to muck around with cpusets too much, tbh... but I agree
that the behavior of PF_SPREAD_PAGE should be the default. Except it
should honor zone_reclaim_mode and round-robin nodes that are within
RECLAIM_DISTANCE of the local one.
I will have spotty access to internet starting tomorrow night until
New Year's. Is there a chance we can maybe revert the NUMA aspects of
the original patch for now and leave it as a node-local zone fairness
thing? The NUMA behavior was so broken on 3.12 that I doubt that
people have come to rely on the cache fairness on such machines in
that one release. So we should be able to release 3.12-stable and
3.13 with node-local zone fairness without regressing anybody, and
then give the NUMA aspect of it another try in 3.14.
Something like the following should restore NUMA behavior while still
fixing the kswapd vs. page allocator interaction bug of thrashing on
the highest zone. PS: zone_local() is in a CONFIG_NUMA block, which
is why accessing zone->node is safe :-)
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fac451a..317ea747d2cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1822,7 +1822,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
static bool zone_local(struct zone *local_zone, struct zone *zone)
{
- return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+ return local_zone->node == zone->node;
}
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
@@ -1919,18 +1919,17 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
* page was allocated in should have no effect on the
* time the page has in memory before being reclaimed.
*
- * When zone_reclaim_mode is enabled, try to stay in
- * local zones in the fastpath. If that fails, the
- * slowpath is entered, which will do another pass
- * starting with the local zones, but ultimately fall
- * back to remote zones that do not partake in the
- * fairness round-robin cycle of this zonelist.
+ * Try to stay in local zones in the fastpath. If
+ * that fails, the slowpath is entered, which will do
+ * another pass starting with the local zones, but
+ * ultimately fall back to remote zones that do not
+ * partake in the fairness round-robin cycle of this
+ * zonelist.
*/
if (alloc_flags & ALLOC_WMARK_LOW) {
if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
continue;
- if (zone_reclaim_mode &&
- !zone_local(preferred_zone, zone))
+ if (!zone_local(preferred_zone, zone))
continue;
}
/*
@@ -2396,7 +2395,7 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
* thrash fairness information for zones that are not
* actually part of this zonelist's round-robin cycle.
*/
- if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+ if (!zone_local(preferred_zone, zone))
continue;
mod_zone_page_state(zone, NR_ALLOC_BATCH,
high_wmark_pages(zone) -
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/6] Configurable fair allocation zone policy v3
2013-12-18 19:20 ` Johannes Weiner
@ 2013-12-19 12:59 ` Michal Hocko
0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2013-12-19 12:59 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mel Gorman, Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM,
LKML
On Wed 18-12-13 14:20:15, Johannes Weiner wrote:
> On Wed, Dec 18, 2013 at 05:20:50PM +0100, Michal Hocko wrote:
[...]
> > Currently we have a per-process (cpuset in fact) flag but this will
> > change it to all or nothing. Is this really a good step?
> > Btw. I do not mind having PF_SPREAD_PAGE enabled by default.
>
> I don't want to muck around with cpusets too much, tbh... but I agree
> that the behavior of PF_SPREAD_PAGE should be the default. Except it
> should honor zone_reclaim_mode and round-robin nodes that are within
> RECLAIM_DISTANCE of the local one.
Agreed.
> I will have spotty access to internet starting tomorrow night until
> New Year's. Is there a chance we can maybe revert the NUMA aspects of
> the original patch for now and leave it as a node-local zone fairness
> thing?
Yes, that sounds perfectly reasonable to me.
> The NUMA behavior was so broken on 3.12 that I doubt that
> people have come to rely on the cache fairness on such machines in
> that one release. So we should be able to release 3.12-stable and
> 3.13 with node-local zone fairness without regressing anybody, and
> then give the NUMA aspect of it another try in 3.14.
>
> Something like the following should restore NUMA behavior while still
> fixing the kswapd vs. page allocator interaction bug of thrashing on
> the highest zone.
Yes, it looks good to me. I guess zone_local could have stayed as it
was because it shouldn't be a big deal to fall-back to a different node
if the distance is LOCAL, but taking a conservative approach is not
harmfull.
> PS: zone_local() is in a CONFIG_NUMA block, which
> is why accessing zone->node is safe :-)
>
> ---
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dd886fac451a..317ea747d2cd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1822,7 +1822,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>
> static bool zone_local(struct zone *local_zone, struct zone *zone)
> {
> - return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
> + return local_zone->node == zone->node;
> }
>
> static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> @@ -1919,18 +1919,17 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> * page was allocated in should have no effect on the
> * time the page has in memory before being reclaimed.
> *
> - * When zone_reclaim_mode is enabled, try to stay in
> - * local zones in the fastpath. If that fails, the
> - * slowpath is entered, which will do another pass
> - * starting with the local zones, but ultimately fall
> - * back to remote zones that do not partake in the
> - * fairness round-robin cycle of this zonelist.
> + * Try to stay in local zones in the fastpath. If
> + * that fails, the slowpath is entered, which will do
> + * another pass starting with the local zones, but
> + * ultimately fall back to remote zones that do not
> + * partake in the fairness round-robin cycle of this
> + * zonelist.
> */
> if (alloc_flags & ALLOC_WMARK_LOW) {
> if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> continue;
> - if (zone_reclaim_mode &&
> - !zone_local(preferred_zone, zone))
> + if (!zone_local(preferred_zone, zone))
> continue;
> }
> /*
> @@ -2396,7 +2395,7 @@ static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
> * thrash fairness information for zones that are not
> * actually part of this zonelist's round-robin cycle.
> */
> - if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
> + if (!zone_local(preferred_zone, zone))
> continue;
> mod_zone_page_state(zone, NR_ALLOC_BATCH,
> high_wmark_pages(zone) -
>
>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread