Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Paul Mundt <lethal@linux-sh.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	linux-mm <linux-mm@kvack.org>,
	Christoph Lameter <clameter@sgi.com>,
	Nishanth Aravamudan <nacc@us.ibm.com>,
	kxr@sgi.com, ak@suse.de, akpm@linux-foundation.org,
	Eric Whitney <eric.whitney@hp.com>
Subject: Re: [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks
Date: Wed, 01 Aug 2007 09:39:18 -0400	[thread overview]
Message-ID: <1185975558.5059.18.camel@localhost> (raw)
In-Reply-To: <20070801101651.GA9113@linux-sh.org>

On Wed, 2007-08-01 at 19:16 +0900, Paul Mundt wrote:
> On Mon, Jul 30, 2007 at 12:13:48PM -0400, Lee Schermerhorn wrote:
> > Rationale:  some architectures and platforms include nodes with
> > memory that, in some cases, should never appear in MPOL_INTERLEAVE
> > node masks.  For example, the 'sh' architecture contains a small
> > amount of SRAM that is local to each cpu.  In some applications,
> > this memory should be reserved for explicit usage.  Another example
> > is the pseudo-node on HP ia64 platforms that is already interleaved
> > on a cache-line granularity by hardware.  Again, in some cases, we
> > want to reserve this for explicit usage, as it has bandwidth and
> > [average] latency characteristics quite different from the "real"
> > nodes.
> > 
> Well, it's not so much the interleave that's the problem so much as
> _when_ we interleave. The problem with the interleave node mask at system
> init is that the kernel attempts to spread out data structures across
> these nodes, which results in us being completely out of memory by the
> time we get to userspace. After we've booted, supporting MPOL_INTERLEAVE
> is not so much of a problem, applications just have to be careful with
> their allocations.
> 
> The main thing is keeping the kernel away from these nodes unless it's
> been specifically asked to fetch some memory from there. Every page does
> count.
> 
> The real problem is how we want to deal with the node avoidance mask. In
> SLOB things presently work quite well in this regard, Christoph's
> slub_nodes= patch did a similar thing:
> 
> 	http://marc.info/?l=linux-mm&m=118127465421877&w=2
> 	http://marc.info/?l=linux-mm&m=118127688911359&w=2
> 
> > Note that allocation of fresh hugepages in response to increases
> > in /proc/sys/vm/nr_hugepages is a form of interleaving.  I would
> > like to propose that allocate_fresh_huge_page() use the 
> > N_INTERLEAVE state as well as MPOL_INTERLEAVE.  Then, one can
> > explicity allocate hugepages on the excluded nodes, when needed,
> > using Nish Aravamundan's per node huge page sysfs attribute.
> > NOT in this patch.
> > 
> If we can differentiate between MPOL_INTERLEAVE from the kernel's point
> of view, and explicit MPOL_INTERLEAVE specifiers via mbind() from
> userspace, that works fine for my case. However, the mpol_new() changes
> in this patch deny small nodes the ability to ever be included in an
> MPOL_INTERLEAVE policy, when it's only the kernel policy that I have a
> problem with.

Ah, but it would only "deny small nodes" if you nominate them in the
boot option.  I haven't changed your heuristic in numa_policy_init.  So,
it will still eliminate small nodes from the boot time interleave
nodemask, independent of whether or not you specify them in the
no_interleave_nodes list.

Or am I missing your point?
> 
> Having said that, I do like the node states and using that to exclude a
> node from the system init interleave nodelist, but this still won't
> completely solve the tiny node problems.

Right, so we should keep your boot time heuristic.

> 
> > @@ -184,7 +184,7 @@ static struct mempolicy *mpol_new(int mo
> >  	case MPOL_INTERLEAVE:
> >  		policy->v.nodes = *nodes;
> >  		nodes_and(policy->v.nodes, policy->v.nodes,
> > -					node_states[N_MEMORY]);
> > +					node_states[N_INTERLEAVE]);
> >  		if (nodes_weight(policy->v.nodes) == 0) {
> >  			kmem_cache_free(policy_cache, policy);
> >  			return ERR_PTR(-EINVAL);
> 
> Leaving this as node_states[N_MEMORY] combined with the rest of the patch
> would work for me, but that sort of changes the scope of the entire patch
> ;-)

Yeah, it breaks one of my main reasons for proposing this.  I still have
no way to keep user requested interleaving off my "special" hardware
interleaved nodes in the case where we don't want this.  I should
mention that I'm assuming that the current "best practice" is to
interleave across "all available nodes" in the applications current
context.

[more follow up to later messages]

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-08-01 13:39 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-27 20:07 [PATCH/RFC] Allow selected nodes to be excluded from MPOL_INTERLEAVE masks Lee Schermerhorn
2007-07-28  6:19 ` KAMEZAWA Hiroyuki
2007-07-30 16:13   ` Lee Schermerhorn
2007-07-30 18:29     ` Christoph Lameter
2007-07-30 20:32       ` Lee Schermerhorn
2007-07-30 21:57         ` Christoph Lameter
2007-08-01 10:16     ` Paul Mundt
2007-08-01 10:33       ` Andi Kleen
2007-08-01 11:01         ` Paul Mundt
2007-08-01 11:07           ` Andi Kleen
2007-08-01 11:21             ` Paul Mundt
2007-08-01 13:54               ` Lee Schermerhorn
2007-08-02 17:38                 ` Mark Gross
2007-08-02 18:46                   ` Lee Schermerhorn
2007-08-06 16:42                     ` Mark Gross
2007-08-01 13:39       ` Lee Schermerhorn [this message]
2007-08-03  7:53         ` Paul Mundt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1185975558.5059.18.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=eric.whitney@hp.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kxr@sgi.com \
    --cc=lethal@linux-sh.org \
    --cc=linux-mm@kvack.org \
    --cc=nacc@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.