[PATCH/RFC 4/4] VM: automatic reclaim through mempolicy

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Martin Hicks <mort@sgi.com>
To: Andrew Morton <akpm@osdl.org>, Linux-MM <linux-mm@kvack.org>
Cc: Ray Bryant <raybry@sgi.com>, ak@suse.de
Subject: [PATCH/RFC 4/4] VM: automatic reclaim through mempolicy
Date: Wed, 27 Apr 2005 11:10:10 -0400	[thread overview]
Message-ID: <20050427151010.GV8018@localhost> (raw)
In-Reply-To: <20050427145734.GL8018@localhost>

This implements a set of flags that modify the behavior
of the the mempolicies to allow reclaiming of preferred 
memory (as definited by the mempolicy) before spilling
onto remote nodes.  It also adds a new mempolicy
"localreclaim" which is just the default mempolicy with
non-zero reclaim flags.

The change required adding a "flags" argument to sys_set_mempolicy()
to give hints about what kind of memory you're willing to sacrifice.

A patch for numactl-0.6.4 to support these new flags is at
http://www.bork.org/~mort/sgi/localreclaim/numactl-localreclaim.patch
This patch breaks compatibility, but I just needed something to
test with.  I did update the numactl's usage message with the
new bits.  Essentially just add "--localreclaim=[umUM]" to get
the allocator to use localreclaim.

I'm sure that better tuning of the rate-limiting code in
vmscan.c::reclaim_clean_pages() could help performance further,
but at this stage I was fairly happy to keep the system time
at a reasonable level.  The obvious difficulty with this patch
is to ensure that it doesn't scan the LRU lists to death, looking
for those non-existant clean pages.

Here are some kernbench runs that show that things don't get out of
control under heavy VM pressure.  I think kernbench's "Maximal" run
is a fairly stressful test for this code because it allocates all
of the memory out of the system and still must do disk IO during
the compiles.

I haven't yet had time to do a run in a situation where I think the
patches will make a real difference.  I'm going to do some runs
with a big HPC app this week.

The test machine was a 4-way 8GB Altix.  The "minimal" (make -j3) and
"optimal" (make -j16) results are uninteresting.  All three runs
show almost exactly the same results because we never actually invoke
any of this new code.  There is no VM pressure.

		Wall	User	System	%CPU	Ctx Sw	Sleeps
		-----	----	------	----	------	------
2.6.12-rc2-mm2	1296	1375	387	160	252333	388268
noreclaim	1111	1370	319	195	216259	318279
reclaim=um	1251	1373	312	160	223148	371875

This is just the average of two runs.  There seems to be large
variance in the first two, but the reclaim=um run is quite
consistent.

2.6.12-rc2-mm2 is kernbench run on a pristine tree.
noreclaim is with the patches, but no use of numactl.
reclaim=um is kernbench invoked with:

./numactl --localreclaim=um ../kernbench-0.3.0/kernbench



Signed-off-by: Marting Hicks <mort@sgi.com>
---


 include/linux/gfp.h       |    3 +
 include/linux/mempolicy.h |   33 ++++++++++++++---
 mm/mempolicy.c            |   68 ++++++++++++++++++++++++++---------
 mm/page_alloc.c           |   87 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 168 insertions(+), 23 deletions(-)

Index: linux-2.6.12-rc2.wk/mm/mempolicy.c
===================================================================
--- linux-2.6.12-rc2.wk.orig/mm/mempolicy.c	2005-04-27 06:27:38.000000000 -0700
+++ linux-2.6.12-rc2.wk/mm/mempolicy.c	2005-04-27 07:09:09.000000000 -0700
@@ -19,7 +19,7 @@
  *                is used.
  * bind           Only allocate memory on a specific set of nodes,
  *                no fallback.
- * preferred       Try a specific node first before normal fallback.
+ * preferred      Try a specific node first before normal fallback.
  *                As a special case node -1 here means do the allocation
  *                on the local CPU. This is normally identical to default,
  *                but useful to set in a VMA when you have a non default
@@ -27,6 +27,9 @@
  * default        Allocate on the local node first, or when on a VMA
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
+ * localreclaim   This is a special case of default.  The allocator
+ *                will try very hard to get a local allocation.  It
+ *                invokes page cache cleaners and slab cleaners.
  *
  * The process policy is applied for most non interrupt memory allocations
  * in that process' context. Interrupts ignore the policies and always
@@ -113,6 +116,7 @@ static int mpol_check_policy(int mode, u
 
 	switch (mode) {
 	case MPOL_DEFAULT:
+	case MPOL_LOCALRECLAIM:
 		if (!empty)
 			return -EINVAL;
 		break;
@@ -205,13 +209,19 @@ static struct zonelist *bind_zonelist(un
 }
 
 /* Create a new policy */
-static struct mempolicy *mpol_new(int mode, unsigned long *nodes)
+static struct mempolicy *mpol_new(int mode, unsigned long *nodes,
+				  unsigned int flags)
 {
 	struct mempolicy *policy;
+	int mpol_flags = mpol_to_reclaim_flags(flags);
 
 	PDprintk("setting mode %d nodes[0] %lx\n", mode, nodes[0]);
-	if (mode == MPOL_DEFAULT)
-		return NULL;
+	if (mode == MPOL_DEFAULT) {
+		if (!flags)
+			return NULL;
+		else
+			mode = MPOL_LOCALRECLAIM;
+	}
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
 	if (!policy)
 		return ERR_PTR(-ENOMEM);
@@ -234,6 +244,7 @@ static struct mempolicy *mpol_new(int mo
 		break;
 	}
 	policy->policy = mode;
+	policy->flags = mpol_flags;
 	return policy;
 }
 
@@ -384,7 +395,7 @@ asmlinkage long sys_mbind(unsigned long 
 	if (err)
 		return err;
 
-	new = mpol_new(mode, nodes);
+	new = mpol_new(mode, nodes, flags);
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
@@ -403,7 +414,7 @@ asmlinkage long sys_mbind(unsigned long 
 
 /* Set the process memory policy */
 asmlinkage long sys_set_mempolicy(int mode, unsigned long __user *nmask,
-				   unsigned long maxnode)
+				  unsigned long maxnode, int flags)
 {
 	int err;
 	struct mempolicy *new;
@@ -411,10 +422,12 @@ asmlinkage long sys_set_mempolicy(int mo
 
 	if (mode > MPOL_MAX)
 		return -EINVAL;
+	if (flags & MPOL_FLAG_MASK)
+		return -EINVAL;
 	err = get_nodes(nodes, nmask, maxnode, mode);
 	if (err)
 		return err;
-	new = mpol_new(mode, nodes);
+	new = mpol_new(mode, nodes, flags);
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 	mpol_free(current->mempolicy);
@@ -436,6 +449,7 @@ static void get_zonemask(struct mempolic
 			__set_bit(p->v.zonelist->zones[i]->zone_pgdat->node_id, nodes);
 		break;
 	case MPOL_DEFAULT:
+	case MPOL_LOCALRECLAIM:
 		break;
 	case MPOL_INTERLEAVE:
 		bitmap_copy(nodes, p->v.nodes, MAX_NUMNODES);
@@ -600,7 +614,7 @@ asmlinkage long compat_sys_set_mempolicy
 	if (err)
 		return -EFAULT;
 
-	return sys_set_mempolicy(mode, nm, nr_bits+1);
+	return sys_set_mempolicy(mode, nm, nr_bits+1, 0);
 }
 
 asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
@@ -666,6 +680,7 @@ static struct zonelist *zonelist_policy(
 				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
+	case MPOL_LOCALRECLAIM:
 	case MPOL_DEFAULT:
 		nd = numa_node_id();
 		break;
@@ -712,14 +727,17 @@ static unsigned offset_il_node(struct me
 
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(unsigned int __nocast gfp, unsigned order, unsigned nid)
+static struct page *alloc_page_interleave(unsigned int __nocast gfp, unsigned order, unsigned nid, int flags)
 {
 	struct zonelist *zl;
 	struct page *page;
 
 	BUG_ON(!node_online(nid));
 	zl = NODE_DATA(nid)->node_zonelists + (gfp & GFP_ZONEMASK);
-	page = __alloc_pages(gfp, order, zl);
+	if (flags)
+		page = __alloc_pages_localreclaim(gfp, order, zl, flags);
+	else
+		page = __alloc_pages(gfp, order, zl);
 	if (page && page_zone(page) == zl->zones[0]) {
 		zl->zones[0]->pageset[get_cpu()].interleave_hit++;
 		put_cpu();
@@ -769,8 +787,12 @@ alloc_page_vma(unsigned int __nocast gfp
 			/* fall back to process interleaving */
 			nid = interleave_nodes(pol);
 		}
-		return alloc_page_interleave(gfp, 0, nid);
+		return alloc_page_interleave(gfp, 0, nid, pol->flags);
 	}
+
+	if (pol->flags)
+		return __alloc_pages_localreclaim(gfp, 0,
+				zonelist_policy(gfp, pol), pol->flags);
 	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 }
 
@@ -802,7 +824,11 @@ struct page *alloc_pages_current(unsigne
 	if (!pol || in_interrupt())
 		pol = &default_policy;
 	if (pol->policy == MPOL_INTERLEAVE)
-		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
+		return alloc_page_interleave(gfp, order, interleave_nodes(pol),
+					     pol->flags);
+	if (pol->flags)
+		return __alloc_pages_localreclaim(gfp, order,
+				zonelist_policy(gfp, pol), pol->flags);
 	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
 }
 EXPORT_SYMBOL(alloc_pages_current);
@@ -831,23 +857,29 @@ struct mempolicy *__mpol_copy(struct mem
 /* Slow path of a mempolicy comparison */
 int __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 {
+	int flags;
+
 	if (!a || !b)
 		return 0;
 	if (a->policy != b->policy)
 		return 0;
+	flags = a->flags == b->flags;
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_LOCALRECLAIM:
+		return a->flags == b->flags;
 	case MPOL_INTERLEAVE:
-		return bitmap_equal(a->v.nodes, b->v.nodes, MAX_NUMNODES);
+		return flags && bitmap_equal(a->v.nodes, b->v.nodes,
+					     MAX_NUMNODES);
 	case MPOL_PREFERRED:
-		return a->v.preferred_node == b->v.preferred_node;
+		return flags && a->v.preferred_node == b->v.preferred_node;
 	case MPOL_BIND: {
 		int i;
 		for (i = 0; a->v.zonelist->zones[i]; i++)
 			if (a->v.zonelist->zones[i] != b->v.zonelist->zones[i])
 				return 0;
-		return b->v.zonelist->zones[i] == NULL;
+		return flags && b->v.zonelist->zones[i] == NULL;
 	}
 	default:
 		BUG();
@@ -878,6 +910,7 @@ int mpol_first_node(struct vm_area_struc
 
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
+	case MPOL_LOCALRECLAIM:
 		return numa_node_id();
 	case MPOL_BIND:
 		return pol->v.zonelist->zones[0]->zone_pgdat->node_id;
@@ -900,6 +933,7 @@ int mpol_node_valid(int nid, struct vm_a
 	case MPOL_PREFERRED:
 	case MPOL_DEFAULT:
 	case MPOL_INTERLEAVE:
+	case MPOL_LOCALRECLAIM:
 		return 1;
 	case MPOL_BIND: {
 		struct zone **z;
@@ -1126,7 +1160,7 @@ void __init numa_policy_init(void)
 	   the data structures allocated at system boot end up in node zero. */
 
 	if (sys_set_mempolicy(MPOL_INTERLEAVE, nodes_addr(node_online_map),
-							MAX_NUMNODES) < 0)
+					MAX_NUMNODES, 0) < 0)
 		printk("numa_policy_init: interleaving failed\n");
 }
 
@@ -1134,5 +1168,5 @@ void __init numa_policy_init(void)
  * Assumes fs == KERNEL_DS */
 void numa_default_policy(void)
 {
-	sys_set_mempolicy(MPOL_DEFAULT, NULL, 0);
+	sys_set_mempolicy(MPOL_DEFAULT, NULL, 0, 0);
 }
Index: linux-2.6.12-rc2.wk/include/linux/gfp.h
===================================================================
--- linux-2.6.12-rc2.wk.orig/include/linux/gfp.h	2005-04-27 06:27:38.000000000 -0700
+++ linux-2.6.12-rc2.wk/include/linux/gfp.h	2005-04-27 07:09:09.000000000 -0700
@@ -81,6 +81,9 @@ static inline void arch_free_page(struct
 
 extern struct page *
 FASTCALL(__alloc_pages(unsigned int, unsigned int, struct zonelist *));
+extern struct page *
+FASTCALL(__alloc_pages_localreclaim(unsigned int, unsigned int,
+				    struct zonelist *, int));
 
 static inline struct page *alloc_pages_node(int nid, unsigned int __nocast gfp_mask,
 						unsigned int order)
Index: linux-2.6.12-rc2.wk/mm/page_alloc.c
===================================================================
--- linux-2.6.12-rc2.wk.orig/mm/page_alloc.c	2005-04-27 06:56:57.000000000 -0700
+++ linux-2.6.12-rc2.wk/mm/page_alloc.c	2005-04-27 07:09:09.000000000 -0700
@@ -958,6 +958,93 @@ got_pg:
 
 EXPORT_SYMBOL(__alloc_pages);
 
+#ifdef CONFIG_NUMA
+
+/*
+ * A function that tries to allocate memory from the local
+ * node by trying really hard, including trying to free up
+ * easily-freed memory from the page cache and (perhaps in the
+ * future) the slab
+ */
+struct page * fastcall
+__alloc_pages_localreclaim(unsigned int gfp_mask, unsigned int order,
+			   struct zonelist *zonelist, int flags)
+{
+	struct zone **zones, *z;
+	struct page *page = NULL;
+	int classzone_idx;
+	int i;
+
+	/*
+	 * Never try local reclaim with GFP_ATOMIC and friends, because
+	 * this path might sleep.
+	 */
+	if (!(gfp_mask & __GFP_WAIT))
+		return __alloc_pages(gfp_mask, order, zonelist);
+
+	zones = zonelist->zones;
+	if (unlikely(zones[0] == NULL))
+		return NULL;
+
+	classzone_idx = zone_idx(zones[0]);
+
+	/*
+	 * Go through the zonelist once, looking for a local zone
+	 * with enough free memory.
+	 */
+	for (i = 0; (z = zones[i]) != NULL; i++) {
+		if (NODE_DATA(numa_node_id()) != z->zone_pgdat)
+			continue;
+		if (!cpuset_zone_allowed(z))
+			continue;
+
+		if (zone_watermark_ok(z, order, z->pages_low,
+				      classzone_idx, 0, 0)) {
+			page = buffered_rmqueue(z, order, gfp_mask);
+			if (page)
+				goto got_pg;
+		}
+	}
+
+	/* Go through again trying to free memory from the zone */
+	for (i = 0; (z = zones[i]) != NULL; i++) {
+		if (NODE_DATA(numa_node_id()) != z->zone_pgdat)
+			continue;
+		if (!cpuset_zone_allowed(z))
+			continue;
+
+		while (reclaim_clean_pages(z, 1<<order, flags)) {
+		       if (zone_watermark_ok(z, order, z->pages_low,
+					     classzone_idx, 0, 0)) {
+			       page = buffered_rmqueue(z, order, gfp_mask);
+			       if (page)
+				       goto got_pg;
+		       }
+		}
+	}
+
+	/* Didn't get a local page - invoke the normal allocator */
+	return __alloc_pages(gfp_mask, order, zonelist);
+ got_pg:
+
+#ifdef CONFIG_PAGE_OWNER /* huga... */
+ 	{
+	unsigned long address, bp;
+#ifdef X86_64
+	asm ("movq %%rbp, %0" : "=r" (bp) : );
+#else
+        asm ("movl %%ebp, %0" : "=r" (bp) : );
+#endif
+        page->order = (int) order;
+        __stack_trace(page, &address, bp);
+	}
+#endif /* CONFIG_PAGE_OWNER */
+	zone_statistics(zonelist, z);
+	return page;
+}
+
+#endif /* CONFIG_NUMA */
+
 /*
  * Common helper functions.
  */
Index: linux-2.6.12-rc2.wk/include/linux/mempolicy.h
===================================================================
--- linux-2.6.12-rc2.wk.orig/include/linux/mempolicy.h	2005-04-27 06:27:38.000000000 -0700
+++ linux-2.6.12-rc2.wk/include/linux/mempolicy.h	2005-04-27 07:09:09.000000000 -0700
@@ -2,6 +2,7 @@
 #define _LINUX_MEMPOLICY_H 1
 
 #include <linux/errno.h>
+#include <linux/swap.h>
 
 /*
  * NUMA memory policies for Linux.
@@ -9,19 +10,38 @@
  */
 
 /* Policies */
-#define MPOL_DEFAULT	0
-#define MPOL_PREFERRED	1
-#define MPOL_BIND	2
-#define MPOL_INTERLEAVE	3
+#define MPOL_DEFAULT		0
+#define MPOL_PREFERRED		1
+#define MPOL_BIND		2
+#define MPOL_INTERLEAVE		3
+#define MPOL_LOCALRECLAIM	4
 
-#define MPOL_MAX MPOL_INTERLEAVE
+#define MPOL_MAX MPOL_LOCALRECLAIM
 
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
 
 /* Flags for mbind */
-#define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
+#define MPOL_MF_STRICT	(1<<2)	/* Verify existing pages in the mapping */
+
+/* Flags for set_mempolicy */
+#define mpol_reclaim_shift(x)	((x)<<3)
+#define MPOL_LR_UNMAPPED	mpol_reclaim_shift(RECLAIM_UNMAPPED)
+#define MPOL_LR_MAPPED		mpol_reclaim_shift(RECLAIM_MAPPED)
+#define MPOL_LR_ACTIVE_UNMAPPED	mpol_reclaim_shift(RECLAIM_ACTIVE_UNMAPPED)
+#define MPOL_LR_ACTIVE_MAPPED	mpol_reclaim_shift(RECLAIM_ACTIVE_MAPPED)
+#define MPOL_LR_SLAB		mpol_reclaim_shift(RECLAIM_SLAB)
+
+#define MPOL_LR_FLAGS	(MPOL_LR_UNMAPPED | MPOL_LR_MAPPED | \
+			 MPOL_LR_ACTIVE_MAPPED | MPOL_LR_ACTIVE_UNMAPPED | \
+			 MPOL_LR_SLAB)
+#define MPOL_LR_MASK	~MPOL_LR_FLAGS
+#define MPOL_FLAGS	(MPOL_F_NODE | MPOL_F_ADDR | MPOL_MF_STRICT | \
+			 MPOL_LR_FLAGS)
+#define MPOL_FLAG_MASK	~MPOL_FLAGS
+#define mpol_to_reclaim_flags(flags)	((flags & MPOL_LR_FLAGS) >> 3)
+
 
 #ifdef __KERNEL__
 
@@ -60,6 +80,7 @@ struct vm_area_struct;
 struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
+	int flags;
 	union {
 		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

next prev parent reply	other threads:[~2005-04-27 15:10 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20050427145734.GL8018@localhost>
2005-04-27 15:09 ` [PATCH/RFC 1/4] VM: merge_lru_pages Martin Hicks
2005-04-27 15:09 ` [PATCH/RFC 2/4] VM: page cache reclaim core Martin Hicks
2005-04-27 23:32   ` Andrew Morton
2005-04-27 15:09 ` [PATCH/RFC 3/4] VM: toss_page_cache_node syscall Martin Hicks
2005-04-27 23:33   ` Andrew Morton
2005-04-27 15:10 ` Martin Hicks [this message]
2005-04-27 23:35   ` [PATCH/RFC 4/4] VM: automatic reclaim through mempolicy Andrew Morton
2005-04-28 12:56     ` Martin Hicks
2005-04-27 23:50   ` Andrew Morton
2005-04-28 17:41     ` Martin Hicks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050427151010.GV8018@localhost \
    --to=mort@sgi.com \
    --cc=ak@suse.de \
    --cc=akpm@osdl.org \
    --cc=linux-mm@kvack.org \
    --cc=raybry@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.