All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: hpa@zytor.com, linux-kernel@vger.kernel.org,
	a.p.zijlstra@chello.nl, torvalds@linux-foundation.org,
	pjt@google.com, cl@linux.com, riel@redhat.com,
	bharata.rao@gmail.com, akpm@linux-foundation.org,
	Lee.Schermerhorn@hp.com, aarcange@redhat.com, danms@us.ibm.com,
	suresh.b.siddha@intel.com, tglx@linutronix.de
Cc: linux-tip-commits@vger.kernel.org, bburns@redhat.com
Subject: [FEATURE TREE] sched, mm: Introduce the 'home node' affinity concept
Date: Fri, 18 May 2012 13:57:43 +0200	[thread overview]
Message-ID: <20120518115742.GA19785@gmail.com> (raw)
In-Reply-To: <tip-41h2mswllhkskd4bnxpoi388@git.kernel.org>


* tip-bot for Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Commit-ID:  84213e2b6e2166083c3d06e91dcf54f8e136bd78
> Gitweb:     http://git.kernel.org/tip/84213e2b6e2166083c3d06e91dcf54f8e136bd78
> Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
> AuthorDate: Sat, 3 Mar 2012 17:05:16 +0100
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Fri, 18 May 2012 08:16:20 +0200
> 
> sched, mm: Introduce tsk_home_node()

So, I wanted to see some progress on this issue and committed 
Peter's 'home node NUMA affinity' changes to the tip:sched/numa 
tree.

Basically the scheme Peter implemented is an extended notion of 
NUMA affinity, one that both the scheduler and the MM honors - 
but one that is flexible and treats affinity as a preference, 
not as a hard mask:

 - For example if there's significant idle time on distant CPUs
   then the scheduler will still utilize those CPUs and fill the
   whole machine - but otherwise the scheduler and the MM will
   try to maintain good NUMA locality.

 - Similary, memory allocations will go to the home node even if
   the task is running on another node temporarily. [as long as
   the allocation can be satisfied.]

This is a more dynamic, more intelligent version of hard 
partitioning the system and workloads between NUMA nodes - yet 
it is pretty simple and existing MM and scheduling code mostly 
support this scheme and needed only small reorganization.

When home node awareness is active then applications can use new 
system calls to group themselves into affinity groups, via:

     sys_numa_tbind(tid, -1, 0);         // create new group, return new ng_id
     sys_numa_tbind(tid, -2, 0);         // returns existing ng_id
     sys_numa_tbind(tid, ng_id, 0);      // set ng_id

... and to assign memory to a NUMA group:

     sys_numa_mbind(addr, len, ng_id, 0);
 
We are seeing user-space daemons trying to achieve something 
similar, for example there's "numad":

   https://fedoraproject.org/w/index.php?title=Features/numad&oldid=272815

the kernel is in a much better position to handle affinities and 
resource allocation preferences, especially ones that change and 
mix so dynamically as scheduling and memory allocation - so 
maybe "numad" could make use of the new syscalls and map 
application/package policies into NUMA groups the kernel 
recognizes.

(There's more such daemons out there, in the HPC area.)

One configurability detail I'd like to suggest to Peter: could 
we make this NUMA affinity grouping capability unconditional to 
apps, i.e. enable apps to put themselves on home node aware 
policy even if the sysctl is off?

That way this capability would always be available on NUMA 
systems in an opt-in fashion, just like regular affinities are 
available. The sysctl would merely control whether all tasks on 
the system are scheduled in a home node aware fashion or not. 
(and it would still default to off)

Thanks,

	Ingo

  reply	other threads:[~2012-05-18 11:57 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-18 10:31 [tip:sched/numa] sched, mm: Introduce tsk_home_node() tip-bot for Peter Zijlstra
2012-05-18 11:57 ` Ingo Molnar [this message]
2012-05-18 12:33   ` [FEATURE TREE] sched, mm: Introduce the 'home node' affinity concept Bill Burns

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120518115742.GA19785@gmail.com \
    --to=mingo@kernel.org \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bburns@redhat.com \
    --cc=bharata.rao@gmail.com \
    --cc=cl@linux.com \
    --cc=danms@us.ibm.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.