All of lore.kernel.org
 help / color / mirror / Atom feed
* [tip:sched/numa] sched, mm: Introduce tsk_home_node()
@ 2012-05-18 10:31 tip-bot for Peter Zijlstra
  2012-05-18 11:57 ` [FEATURE TREE] sched, mm: Introduce the 'home node' affinity concept Ingo Molnar
  0 siblings, 1 reply; 3+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-05-18 10:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, pjt, cl, riel,
	akpm, bharata.rao, aarcange, Lee.Schermerhorn, suresh.b.siddha,
	danms, tglx

Commit-ID:  84213e2b6e2166083c3d06e91dcf54f8e136bd78
Gitweb:     http://git.kernel.org/tip/84213e2b6e2166083c3d06e91dcf54f8e136bd78
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Sat, 3 Mar 2012 17:05:16 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 18 May 2012 08:16:20 +0200

sched, mm: Introduce tsk_home_node()

Introduce the home-node concept for tasks. In order to keep memory
locality we need to have a something to stay local to, we define the
home-node of a task as the node we prefer to allocate memory from and
prefer to execute on.

These are no hard guarantees, merely preferences. This allows for
optimal resource usage, we can run a task away from the home-node, the
remote memory hit -- while expensive -- is less expensive than not
running at all, or very little, due to severe cpu overload.

Similarly, we can allocate memory from another node if our home-node
is depleted, again, some memory is better than no memory.

This patch merely introduces the basic infrastructure, all policy
comes later.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Paul Turner <pjt@google.com>
Cc: Dan Smith <danms@us.ibm.com>
Cc: Bharata B Rao <bharata.rao@gmail.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-41h2mswllhkskd4bnxpoi388@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |   10 ++++++++++
 kernel/sched/core.c       |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e4baff5..d3029fa 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -134,6 +134,13 @@ extern struct cred init_cred;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_NUMA
+# define INIT_TASK_NUMA(tsk)						\
+	.node = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -200,6 +207,7 @@ extern struct cred init_cred;
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ							\
+	INIT_TASK_NUMA(tsk)						\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3d64480..49378f0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,7 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+	int node;
 #endif
 	struct rcu_head rcu;
 
@@ -1575,6 +1576,15 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+static inline int tsk_home_node(struct task_struct *p)
+{
+#ifdef CONFIG_NUMA
+	return p->node;
+#else
+	return -1;
+#endif
+}
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eb9bee9..8fd0325 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6279,6 +6279,38 @@ static struct sched_domain_topology_level *sched_domain_topology = default_topol
 
 #ifdef CONFIG_NUMA
 
+/*
+ * Requeues a task ensuring its on the right load-balance list so
+ * that it might get migrated to its new home.
+ *
+ * Note that we cannot actively migrate ourselves since our callers
+ * can be from atomic context. We rely on the regular load-balance
+ * mechanisms to move us around -- its all preference anyway.
+ */
+void sched_setnode(struct task_struct *p, int node)
+{
+	unsigned long flags;
+	int on_rq, running;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->node = node;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
+
 static int sched_domains_numa_levels;
 static int sched_domains_numa_scale;
 static int *sched_domains_numa_distance;

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [FEATURE TREE] sched, mm: Introduce the 'home node' affinity concept
  2012-05-18 10:31 [tip:sched/numa] sched, mm: Introduce tsk_home_node() tip-bot for Peter Zijlstra
@ 2012-05-18 11:57 ` Ingo Molnar
  2012-05-18 12:33   ` Bill Burns
  0 siblings, 1 reply; 3+ messages in thread
From: Ingo Molnar @ 2012-05-18 11:57 UTC (permalink / raw)
  To: hpa, linux-kernel, a.p.zijlstra, torvalds, pjt, cl, riel,
	bharata.rao, akpm, Lee.Schermerhorn, aarcange, danms,
	suresh.b.siddha, tglx
  Cc: linux-tip-commits, bburns


* tip-bot for Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Commit-ID:  84213e2b6e2166083c3d06e91dcf54f8e136bd78
> Gitweb:     http://git.kernel.org/tip/84213e2b6e2166083c3d06e91dcf54f8e136bd78
> Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
> AuthorDate: Sat, 3 Mar 2012 17:05:16 +0100
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Fri, 18 May 2012 08:16:20 +0200
> 
> sched, mm: Introduce tsk_home_node()

So, I wanted to see some progress on this issue and committed 
Peter's 'home node NUMA affinity' changes to the tip:sched/numa 
tree.

Basically the scheme Peter implemented is an extended notion of 
NUMA affinity, one that both the scheduler and the MM honors - 
but one that is flexible and treats affinity as a preference, 
not as a hard mask:

 - For example if there's significant idle time on distant CPUs
   then the scheduler will still utilize those CPUs and fill the
   whole machine - but otherwise the scheduler and the MM will
   try to maintain good NUMA locality.

 - Similary, memory allocations will go to the home node even if
   the task is running on another node temporarily. [as long as
   the allocation can be satisfied.]

This is a more dynamic, more intelligent version of hard 
partitioning the system and workloads between NUMA nodes - yet 
it is pretty simple and existing MM and scheduling code mostly 
support this scheme and needed only small reorganization.

When home node awareness is active then applications can use new 
system calls to group themselves into affinity groups, via:

     sys_numa_tbind(tid, -1, 0);         // create new group, return new ng_id
     sys_numa_tbind(tid, -2, 0);         // returns existing ng_id
     sys_numa_tbind(tid, ng_id, 0);      // set ng_id

... and to assign memory to a NUMA group:

     sys_numa_mbind(addr, len, ng_id, 0);
 
We are seeing user-space daemons trying to achieve something 
similar, for example there's "numad":

   https://fedoraproject.org/w/index.php?title=Features/numad&oldid=272815

the kernel is in a much better position to handle affinities and 
resource allocation preferences, especially ones that change and 
mix so dynamically as scheduling and memory allocation - so 
maybe "numad" could make use of the new syscalls and map 
application/package policies into NUMA groups the kernel 
recognizes.

(There's more such daemons out there, in the HPC area.)

One configurability detail I'd like to suggest to Peter: could 
we make this NUMA affinity grouping capability unconditional to 
apps, i.e. enable apps to put themselves on home node aware 
policy even if the sysctl is off?

That way this capability would always be available on NUMA 
systems in an opt-in fashion, just like regular affinities are 
available. The sysctl would merely control whether all tasks on 
the system are scheduled in a home node aware fashion or not. 
(and it would still default to off)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [FEATURE TREE] sched, mm: Introduce the 'home node' affinity concept
  2012-05-18 11:57 ` [FEATURE TREE] sched, mm: Introduce the 'home node' affinity concept Ingo Molnar
@ 2012-05-18 12:33   ` Bill Burns
  0 siblings, 0 replies; 3+ messages in thread
From: Bill Burns @ 2012-05-18 12:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: hpa, linux-kernel, a.p.zijlstra, torvalds, pjt, cl, riel,
	bharata.rao, akpm, Lee.Schermerhorn, aarcange, danms,
	suresh.b.siddha, tglx, linux-tip-commits, bburns, Bill Gray

On 05/18/2012 07:57 AM, Ingo Molnar wrote:
> * tip-bot for Peter Zijlstra<a.p.zijlstra@chello.nl>  wrote:
>
>> Commit-ID:  84213e2b6e2166083c3d06e91dcf54f8e136bd78
>> Gitweb:     http://git.kernel.org/tip/84213e2b6e2166083c3d06e91dcf54f8e136bd78
>> Author:     Peter Zijlstra<a.p.zijlstra@chello.nl>
>> AuthorDate: Sat, 3 Mar 2012 17:05:16 +0100
>> Committer:  Ingo Molnar<mingo@kernel.org>
>> CommitDate: Fri, 18 May 2012 08:16:20 +0200
>>
>> sched, mm: Introduce tsk_home_node()
> So, I wanted to see some progress on this issue and committed
> Peter's 'home node NUMA affinity' changes to the tip:sched/numa
> tree.
>
> Basically the scheme Peter implemented is an extended notion of
> NUMA affinity, one that both the scheduler and the MM honors -
> but one that is flexible and treats affinity as a preference,
> not as a hard mask:
>
>   - For example if there's significant idle time on distant CPUs
>     then the scheduler will still utilize those CPUs and fill the
>     whole machine - but otherwise the scheduler and the MM will
>     try to maintain good NUMA locality.
>
>   - Similary, memory allocations will go to the home node even if
>     the task is running on another node temporarily. [as long as
>     the allocation can be satisfied.]
>
> This is a more dynamic, more intelligent version of hard
> partitioning the system and workloads between NUMA nodes - yet
> it is pretty simple and existing MM and scheduling code mostly
> support this scheme and needed only small reorganization.
>
> When home node awareness is active then applications can use new
> system calls to group themselves into affinity groups, via:
>
>       sys_numa_tbind(tid, -1, 0);         // create new group, return new ng_id
>       sys_numa_tbind(tid, -2, 0);         // returns existing ng_id
>       sys_numa_tbind(tid, ng_id, 0);      // set ng_id
>
> ... and to assign memory to a NUMA group:
>
>       sys_numa_mbind(addr, len, ng_id, 0);
>
> We are seeing user-space daemons trying to achieve something
> similar, for example there's "numad":
>
>     https://fedoraproject.org/w/index.php?title=Features/numad&oldid=272815
>
> the kernel is in a much better position to handle affinities and
> resource allocation preferences, especially ones that change and
> mix so dynamically as scheduling and memory allocation - so
> maybe "numad" could make use of the new syscalls and map
> application/package policies into NUMA groups the kernel
> recognizes.
Thanks for the cc on this. I have cc'ed Bill Gray who authored
numad and will leave it to him to comment on the interfaces,
but indeed this has been something that we have been looking
for, the concept of a home node (or nodes).

cheers,
  Bill (Burns)

> (There's more such daemons out there, in the HPC area.)
>
> One configurability detail I'd like to suggest to Peter: could
> we make this NUMA affinity grouping capability unconditional to
> apps, i.e. enable apps to put themselves on home node aware
> policy even if the sysctl is off?
>
> That way this capability would always be available on NUMA
> systems in an opt-in fashion, just like regular affinities are
> available. The sysctl would merely control whether all tasks on
> the system are scheduled in a home node aware fashion or not.
> (and it would still default to off)
>
> Thanks,
>
> 	Ingo


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-05-18 12:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-18 10:31 [tip:sched/numa] sched, mm: Introduce tsk_home_node() tip-bot for Peter Zijlstra
2012-05-18 11:57 ` [FEATURE TREE] sched, mm: Introduce the 'home node' affinity concept Ingo Molnar
2012-05-18 12:33   ` Bill Burns

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.