All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: linux-mm <linux-mm@kvack.org>, Paul Mundt <lethal@linux-sh.org>,
	Christoph Lameter <clameter@sgi.com>,
	Nishanth Aravamudan <nacc@us.ibm.com>,
	kxr@sgi.com, ak@suse.de, akpm@linux-foundation.org,
	Eric Whitney <eric.whitney@hp.com>
Subject: Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
Date: Mon, 30 Jul 2007 12:13:48 -0400	[thread overview]
Message-ID: <1185812028.5492.79.camel@localhost> (raw)
In-Reply-To: <20070728151912.c541aec0.kamezawa.hiroyu@jp.fujitsu.com>

On Sat, 2007-07-28 at 15:19 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Jul 2007 16:07:57 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Questions:
> > 
> > * do we need/want a sysctl for run time modifications?  IMO, no.
> > 
> 
> I can agree that runtime modification is not necessary. But applications or
> libnuma will not use this information ? Doing all in implicit way is enough ?
> (maybe enough)

I think it's enough.  But, maybe we should export this info as a node
attribute in sysfs?  Would be easy enough to do, if demand exists.

> 
> BTW, could you print "nodes of XXXX are ignored in INTERLEAVE mempolicy" to
> /var/log/messages at boot ?

Good idea.  It also prompts me to consider better error handling. 

How about this?

---

Introduce mask of nodes to exclude from MPOL_INTERLEAVE masks - V2

Against:  2.6.23-rc1-mm1 atop Christoph Lameter's memoryless
	  node patch set.

V1 -> V2:
+ issue KERN_NOTICE for successful parse of nodelist.
  Suggestion by Kamezawa Hiroyuki.
+ clear no_interleave_nodes nodemask and issue KERN_ERR for
  invalid nodelist argument.

This patch implements a new node state, N_INTERLEAVE to specify
the subset of nodes with memory [state N_MEMORY] that are valid
for MPOL_INTERLEAVE node masks.  The new state mask is populated
from the N_MEMORY state mask, less any nodes excluded by a new
command line option, no_interleave_nodes.

Rationale:  some architectures and platforms include nodes with
memory that, in some cases, should never appear in MPOL_INTERLEAVE
node masks.  For example, the 'sh' architecture contains a small
amount of SRAM that is local to each cpu.  In some applications,
this memory should be reserved for explicit usage.  Another example
is the pseudo-node on HP ia64 platforms that is already interleaved
on a cache-line granularity by hardware.  Again, in some cases, we
want to reserve this for explicit usage, as it has bandwidth and
[average] latency characteristics quite different from the "real"
nodes.

Note that allocation of fresh hugepages in response to increases
in /proc/sys/vm/nr_hugepages is a form of interleaving.  I would
like to propose that allocate_fresh_huge_page() use the 
N_INTERLEAVE state as well as MPOL_INTERLEAVE.  Then, one can
explicity allocate hugepages on the excluded nodes, when needed,
using Nish Aravamundan's per node huge page sysfs attribute.
NOT in this patch.

Questions:

* do we need/want a sysctl for run time modifications?  IMO, no.
	Kame-san votes "No".

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/kernel-parameters.txt |    9 +++++++++
 include/linux/nodemask.h            |    1 +
 mm/mempolicy.c                      |    9 +++++----
 mm/page_alloc.c                     |   34 +++++++++++++++++++++++++++++++++-
 4 files changed, 48 insertions(+), 5 deletions(-)

Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-07-27 15:23:53.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-07-27 15:23:53.000000000 -0400
@@ -345,6 +345,7 @@ enum node_states {
 	N_ONLINE,	/* The node is online */
 	N_MEMORY,	/* The node has memory */
 	N_CPU,		/* The node has cpus */
+	N_INTERLEAVE,	/* The node is valid for MPOL_INTERLEAVE */
 	NR_NODE_STATES
 };
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-27 15:23:53.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-30 10:25:38.000000000 -0400
@@ -2003,6 +2003,31 @@ static char zonelist_order_name[3][8] = 
 
 
 #ifdef CONFIG_NUMA
+/*
+ * Command line:  no_interleave_nodes=<NodeList>
+ * Specify nodes to exclude from MPOL_INTERLEAVE masks.
+ */
+static nodemask_t no_interleave_nodes;	/* default:  none */
+
+static __init int setup_no_interleave_nodes(char *nodelist)
+{
+	if (nodelist) {
+		int err = nodelist_parse(nodelist, no_interleave_nodes);
+		if (err) {
+			printk(KERN_ERR
+				"Ignoring invalid no_interleave_nodes nodelist:"
+				"  %s\n", nodelist);
+			nodes_clear(no_interleave_nodes); /* all or nothing */
+			return err;
+		}
+		printk(KERN_NOTICE
+			"Nodes ignored for INTERLEAVE memory policy: %s\n",
+			nodelist);
+	}
+	return 0;
+}
+early_param("no_interleave_nodes", setup_no_interleave_nodes);
+
 /* The value user specified ....changed by config */
 static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
 /* string for sysctl */
@@ -2410,8 +2435,15 @@ static int __build_all_zonelists(void *d
 		build_zonelists(pgdat);
 		build_zonelist_cache(pgdat);
 
-		if (pgdat->node_present_pages)
+		if (pgdat->node_present_pages) {
 			node_set_state(nid, N_MEMORY);
+			/*
+			 * Only nodes with memory are valid for MPOL_INTERLEAVE,
+			 * but maybe not all of them?
+			 */
+			if (!node_isset(nid, no_interleave_nodes))
+				node_set_state(nid, N_INTERLEAVE);
+		}
 	}
 	return 0;
 }
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-07-27 15:23:53.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-07-30 11:09:20.000000000 -0400
@@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
 		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_MEMORY]);
+					node_states[N_INTERLEAVE]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -1612,11 +1612,12 @@ void __init numa_policy_init(void)
 
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
-	 * enabled across suitably sized nodes (default is >= 16MB), or
-	 * fall back to the largest node if they're all smaller.
+	 * enabled across suitably sized nodes (hard coded >= 16MB) on which
+	 * interleaving is allowed  Fall back to the largest node if all
+	 * allowable nodes are smaller than the hard coded limit.
 	 */
 	nodes_clear(interleave_nodes);
-	for_each_node_state(nid, N_MEMORY) {
+	for_each_node_state(nid, N_INTERLEAVE) {
 		unsigned long total_pages = node_present_pages(nid);
 
 		/* Preserve the largest node */
Index: Linux/Documentation/kernel-parameters.txt
===================================================================
--- Linux.orig/Documentation/kernel-parameters.txt	2007-07-27 15:22:41.000000000 -0400
+++ Linux/Documentation/kernel-parameters.txt	2007-07-27 15:23:53.000000000 -0400
@@ -1181,6 +1181,15 @@ and is between 256 and 4096 characters. 
 	noinitrd	[RAM] Tells the kernel not to load any configured
 			initial RAM disk.
 
+	no_interleave_nodes [KNL, BOOT] Specifies a list of nodes to exclude
+			[remove] from any nodemask specified with the
+			MPOL_INTERLEAVE policy.  Some platforms have nodes
+			that are "special" in some way and should not be
+			used for policy based interleaving.
+			Format:  no_interleave_nodes=<NodeList>
+			NodeList format is described in
+				Documentation/filesystems/tmpfs.txt
+
 	nointroute	[IA-64]
 
 	nojitter	[IA64] Disables jitter checking for ITC timers.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2007-07-30 16:13 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-27 20:07 [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks Lee Schermerhorn
2007-07-28  6:19 ` KAMEZAWA Hiroyuki
2007-07-30 16:13   ` Lee Schermerhorn [this message]
2007-07-30 18:29     ` Christoph Lameter
2007-07-30 20:32       ` Lee Schermerhorn
2007-07-30 21:57         ` Christoph Lameter
2007-08-01 10:16     ` Paul Mundt
2007-08-01 10:33       ` Andi Kleen
2007-08-01 11:01         ` Paul Mundt
2007-08-01 11:07           ` Andi Kleen
2007-08-01 11:21             ` Paul Mundt
2007-08-01 13:54               ` Lee Schermerhorn
2007-08-02 17:38                 ` Mark Gross
2007-08-02 18:46                   ` Lee Schermerhorn
2007-08-06 16:42                     ` Mark Gross
2007-08-01 13:39       ` Lee Schermerhorn
2007-08-03  7:53         ` Paul Mundt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1185812028.5492.79.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=eric.whitney@hp.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kxr@sgi.com \
    --cc=lethal@linux-sh.org \
    --cc=linux-mm@kvack.org \
    --cc=nacc@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.