[PATCH 07/10] sched: Track quality and strength of convergence

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Paul Turner <pjt@google.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>
Subject: [PATCH 07/10] sched: Track quality and strength of convergence
Date: Fri, 30 Nov 2012 20:58:38 +0100	[thread overview]
Message-ID: <1354305521-11583-8-git-send-email-mingo@kernel.org> (raw)
In-Reply-To: <1354305521-11583-1-git-send-email-mingo@kernel.org>

Track strength of convergence, which is a value between 1 and 1024.
This will be used by the placement logic later on.

A strength value of 1024 means that the workload has fully
converged, all faults after the last scan period came from a
single node.

A value of 1024/nr_nodes means a totally spread out working set.

'max_faults' is the number of faults observed on the highest-faulting node.
'sum_faults' are all faults from the last scan, averaged over ~16 periods.

The goal of the scheduler is to maximize convergence system-wide.
Once a task has converged, it carries with it a non-trivial amount
of working set. If such a task is migrated to another node later
on then its working set will migrate there as well, which is a
non-trivial cost.

So the ultimate goal of NUMA scheduling is to let as many tasks
converge as possible, and to run them as close to their memory
as possible.

( Note: we could also sample migration activities to directly measure
  how much convergence influx there is. )

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  2 ++
 kernel/sched/fair.c   | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8eeb866..5b2cf2e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1509,6 +1509,8 @@ struct task_struct {
 	unsigned long numa_scan_ts_secs;
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
+	unsigned long convergence_strength;
+	int convergence_node;
 	unsigned long *numa_faults;
 	unsigned long *numa_faults_curr;
 	struct callback_head numa_scan_work;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c5a707c..47b14d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1555,6 +1555,8 @@ static void __sched_fork(struct task_struct *p)
 
 	p->numa_shared = -1;
 	p->node_stamp = 0ULL;
+	p->convergence_strength		= 0;
+	p->convergence_node		= -1;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_faults = NULL;
 	p->numa_scan_period = sysctl_sched_numa_scan_delay;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7af89b7..1f6104a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1934,6 +1934,50 @@ clear_buddy:
 }
 
 /*
+ * Update the p->convergence_strength info, which is a value between 1 and 1024.
+ *
+ * A strength value of 1024 means that the workload has fully
+ * converged, all faults after the last scan period came from a
+ * single node.
+ *
+ * A value of 1024/nr_nodes means a totally spread out working set.
+ *
+ * 'max_faults' is the number of faults observed on the highest-faulting node.
+ * 'sum_faults' are all faults from the last scan, averaged over ~8 periods.
+ *
+ * The goal of the scheduler is to maximize convergence system-wide.
+ * Once a task has converged, it carries with it a non-trivial amount
+ * of working set. If such a task is migrated to another node later
+ * on then its working set will migrate there as well, which is a
+ * non-trivial cost.
+ *
+ * So the ultimate goal of NUMA scheduling is to let as many tasks
+ * converge as possible, and to run them as close to their memory
+ * as possible.
+ *
+ * ( Note: we could also sample migration activities to directly measure
+ *   how much convergence influx there is. )
+ */
+static void
+shared_fault_calc_convergence(struct task_struct *p, int max_node,
+			      unsigned long max_faults, unsigned long sum_faults)
+{
+	/*
+	 * If sum_faults is 0 then leave the convergence alone:
+	 */
+	if (sum_faults) {
+		p->convergence_strength = 1024L * max_faults / sum_faults;
+
+		if (p->convergence_strength >= 921) {
+			WARN_ON_ONCE(max_node == -1);
+			p->convergence_node = max_node;
+		} else {
+			p->convergence_node = -1;
+		}
+	}
+}
+
+/*
  * Called every couple of hundred milliseconds in the task's
  * execution life-time, this function decides whether to
  * change placement parameters:
@@ -1974,6 +2018,8 @@ static void task_numa_placement_tick(struct task_struct *p)
 		}
 	}
 
+	shared_fault_calc_convergence(p, ideal_node, max_faults, total[0] + total[1]);
+
 	shared_fault_full_scan_done(p);
 
 	/*
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Ingo Molnar <mingo@kernel.org>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Paul Turner <pjt@google.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>
Subject: [PATCH 07/10] sched: Track quality and strength of convergence
Date: Fri, 30 Nov 2012 20:58:38 +0100	[thread overview]
Message-ID: <1354305521-11583-8-git-send-email-mingo@kernel.org> (raw)
In-Reply-To: <1354305521-11583-1-git-send-email-mingo@kernel.org>

Track strength of convergence, which is a value between 1 and 1024.
This will be used by the placement logic later on.

A strength value of 1024 means that the workload has fully
converged, all faults after the last scan period came from a
single node.

A value of 1024/nr_nodes means a totally spread out working set.

'max_faults' is the number of faults observed on the highest-faulting node.
'sum_faults' are all faults from the last scan, averaged over ~16 periods.

The goal of the scheduler is to maximize convergence system-wide.
Once a task has converged, it carries with it a non-trivial amount
of working set. If such a task is migrated to another node later
on then its working set will migrate there as well, which is a
non-trivial cost.

So the ultimate goal of NUMA scheduling is to let as many tasks
converge as possible, and to run them as close to their memory
as possible.

( Note: we could also sample migration activities to directly measure
  how much convergence influx there is. )

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  2 ++
 kernel/sched/fair.c   | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8eeb866..5b2cf2e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1509,6 +1509,8 @@ struct task_struct {
 	unsigned long numa_scan_ts_secs;
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
+	unsigned long convergence_strength;
+	int convergence_node;
 	unsigned long *numa_faults;
 	unsigned long *numa_faults_curr;
 	struct callback_head numa_scan_work;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c5a707c..47b14d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1555,6 +1555,8 @@ static void __sched_fork(struct task_struct *p)
 
 	p->numa_shared = -1;
 	p->node_stamp = 0ULL;
+	p->convergence_strength		= 0;
+	p->convergence_node		= -1;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_faults = NULL;
 	p->numa_scan_period = sysctl_sched_numa_scan_delay;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7af89b7..1f6104a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1934,6 +1934,50 @@ clear_buddy:
 }
 
 /*
+ * Update the p->convergence_strength info, which is a value between 1 and 1024.
+ *
+ * A strength value of 1024 means that the workload has fully
+ * converged, all faults after the last scan period came from a
+ * single node.
+ *
+ * A value of 1024/nr_nodes means a totally spread out working set.
+ *
+ * 'max_faults' is the number of faults observed on the highest-faulting node.
+ * 'sum_faults' are all faults from the last scan, averaged over ~8 periods.
+ *
+ * The goal of the scheduler is to maximize convergence system-wide.
+ * Once a task has converged, it carries with it a non-trivial amount
+ * of working set. If such a task is migrated to another node later
+ * on then its working set will migrate there as well, which is a
+ * non-trivial cost.
+ *
+ * So the ultimate goal of NUMA scheduling is to let as many tasks
+ * converge as possible, and to run them as close to their memory
+ * as possible.
+ *
+ * ( Note: we could also sample migration activities to directly measure
+ *   how much convergence influx there is. )
+ */
+static void
+shared_fault_calc_convergence(struct task_struct *p, int max_node,
+			      unsigned long max_faults, unsigned long sum_faults)
+{
+	/*
+	 * If sum_faults is 0 then leave the convergence alone:
+	 */
+	if (sum_faults) {
+		p->convergence_strength = 1024L * max_faults / sum_faults;
+
+		if (p->convergence_strength >= 921) {
+			WARN_ON_ONCE(max_node == -1);
+			p->convergence_node = max_node;
+		} else {
+			p->convergence_node = -1;
+		}
+	}
+}
+
+/*
  * Called every couple of hundred milliseconds in the task's
  * execution life-time, this function decides whether to
  * change placement parameters:
@@ -1974,6 +2018,8 @@ static void task_numa_placement_tick(struct task_struct *p)
 		}
 	}
 
+	shared_fault_calc_convergence(p, ideal_node, max_faults, total[0] + total[1]);
+
 	shared_fault_full_scan_done(p);
 
 	/*
-- 
1.7.11.7

next prev parent reply	other threads:[~2012-11-30 19:59 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
2012-11-30 19:58 ` Ingo Molnar
2012-11-30 19:58 ` [PATCH 01/10] sched: Add "task flipping" support Ingo Molnar
2012-11-30 19:58   ` Ingo Molnar
2012-11-30 19:58 ` [PATCH 02/10] sched: Move the NUMA placement logic to a worklet Ingo Molnar
2012-11-30 19:58   ` Ingo Molnar
2012-11-30 19:58 ` [PATCH 03/10] numa, mempolicy: Improve CONFIG_NUMA_BALANCING=y OOM behavior Ingo Molnar
2012-11-30 19:58   ` Ingo Molnar
2012-11-30 19:58 ` [PATCH 04/10] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones Ingo Molnar
2012-11-30 19:58   ` Ingo Molnar
2012-11-30 19:58 ` [PATCH 05/10] sched: Introduce directed NUMA convergence Ingo Molnar
2012-11-30 19:58   ` Ingo Molnar
2012-11-30 19:58 ` [PATCH 06/10] sched: Remove statistical NUMA scheduling Ingo Molnar
2012-11-30 19:58   ` Ingo Molnar
2012-11-30 19:58 ` Ingo Molnar [this message]
2012-11-30 19:58   ` [PATCH 07/10] sched: Track quality and strength of convergence Ingo Molnar
2012-11-30 19:58 ` [PATCH 08/10] sched: Converge NUMA migrations Ingo Molnar
2012-11-30 19:58   ` Ingo Molnar
2012-11-30 19:58 ` [PATCH 09/10] sched: Add convergence strength based adaptive NUMA page fault rate Ingo Molnar
2012-11-30 19:58   ` Ingo Molnar
2012-11-30 19:58 ` [PATCH 10/10] sched: Refine the 'shared tasks' memory interleaving logic Ingo Molnar
2012-11-30 19:58   ` Ingo Molnar
2012-11-30 20:37 ` [PATCH 00/10] Latest numa/core release, v18 Linus Torvalds
2012-11-30 20:37   ` Linus Torvalds
2012-12-01  9:49   ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Ingo Molnar
2012-12-01  9:49     ` Ingo Molnar
2012-12-01 12:26     ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Ingo Molnar
2012-12-01 12:26       ` Ingo Molnar
2012-12-01 18:38       ` Linus Torvalds
2012-12-01 18:38         ` Linus Torvalds
2012-12-01 18:41         ` Ingo Molnar
2012-12-01 18:41           ` Ingo Molnar
2012-12-01 18:50           ` Linus Torvalds
2012-12-01 18:50             ` Linus Torvalds
2012-12-01 20:10             ` [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Ingo Molnar
2012-12-01 20:10               ` Ingo Molnar
2012-12-01 20:19               ` Rik van Riel
2012-12-01 20:19                 ` Rik van Riel
2012-12-02 15:10                 ` Ingo Molnar
2012-12-02 15:10                   ` Ingo Molnar
2012-12-03 13:59               ` Mel Gorman
2012-12-03 13:59                 ` Mel Gorman
2012-12-01 20:15             ` [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Ingo Molnar
2012-12-01 20:15               ` Ingo Molnar
2012-12-01 20:33               ` Rik van Riel
2012-12-01 20:33                 ` Rik van Riel
2012-12-02 15:12                 ` [PATCH 2/2, v2] " Ingo Molnar
2012-12-02 15:12                   ` Ingo Molnar
2012-12-02 17:53                   ` Rik van Riel
2012-12-02 17:53                     ` Rik van Riel
2012-12-04 14:42                   ` Michel Lespinasse
2012-12-04 14:42                     ` Michel Lespinasse
2012-12-05  2:59                   ` Michel Lespinasse
2012-12-05  2:59                     ` Michel Lespinasse
2012-12-03 14:17               ` [PATCH 2/2] " Mel Gorman
2012-12-03 14:17                 ` Mel Gorman
2012-12-04 14:37                 ` Michel Lespinasse
2012-12-04 14:37                   ` Michel Lespinasse
2012-12-04 18:17                   ` Mel Gorman
2012-12-04 18:17                     ` Mel Gorman
2012-12-01 18:55         ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Rik van Riel
2012-12-01 18:55           ` Rik van Riel
2012-12-01 16:19     ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Rik van Riel
2012-12-01 16:19       ` Rik van Riel
2012-12-01 17:55     ` Linus Torvalds
2012-12-01 17:55       ` Linus Torvalds
2012-12-01 18:30       ` Ingo Molnar
2012-12-01 18:30         ` Ingo Molnar
2012-12-03 13:41   ` [PATCH 00/10] Latest numa/core release, v18 Mel Gorman
2012-12-03 13:41     ` Mel Gorman
2012-12-04 17:30     ` Thomas Gleixner
2012-12-04 17:30       ` Thomas Gleixner
2012-12-03 10:43 ` Mel Gorman
2012-12-03 10:43   ` Mel Gorman
2012-12-03 11:32 ` Mel Gorman
2012-12-03 11:32   ` Mel Gorman
2012-12-04 22:49 ` Mel Gorman
2012-12-04 22:49   ` Mel Gorman

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:8eeb866 dfblob:5b2cf2e dfblob:c5a707c dfblob:47b14d1
dfblob:7af89b7 dfblob:1f6104a dfblob:8eeb866 dfblob:5b2cf2e
dfblob:c5a707c dfblob:47b14d1 dfblob:7af89b7 dfblob:1f6104a )
 OR (
bs:"[PATCH 07/10] sched: Track quality and strength of convergence" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1354305521-11583-8-git-send-email-mingo@kernel.org \
    --to=mingo@kernel.org \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.