[PATCH 0/6] pseudo-interleaving for automatic NUMA balancing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing
@ 2014-01-17  6:17 riel
  2014-01-17  6:17 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: riel @ 2014-01-17  6:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman

The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
   maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt:           throughput =     755900.20 SPECjbb2005 bops
spec2.txt:           throughput =     754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt:           throughput =     706439.84 SPECjbb2005 bops
spec2.txt:           throughput =     729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, without patches)
spec1.txt:           throughput =     667988.47 SPECjbb2005 bops
spec2.txt:           throughput =     638220.45 SPECjbb2005 bops

No Automatic NUMA and NO-pinning results
spec1.txt:           throughput =     544120.97 SPECjbb2005 bops
spec2.txt:           throughput =     453553.41 SPECjbb2005 bops

My own performance numbers are not as relevant, since I have been
running with a more hostile workload on purpose, and I have run
into a scheduler issue that caused the workload to run on only
two of the four NUMA nodes on my test system...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred
  2014-01-17  6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
@ 2014-01-17  6:17 ` riel
  2014-01-17  6:17 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17  6:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
	Rik van Riel

From: Rik van Riel <riel@surriel.com>

Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes.  However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.

Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/sched.h |  1 -
 kernel/sched/fair.c   |  8 --------
 kernel/sysctl.c       |  7 -------
 mm/mempolicy.c        | 45 ---------------------------------------------
 4 files changed, 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68a0e84..97efba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1469,7 +1469,6 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
 	int numa_preferred_nid;
-	int numa_migrate_deferred;
 	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867b0a4..41e2176 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -819,14 +819,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
-/*
- * After skipping a page migration on a shared page, skip N more numa page
- * migrations unconditionally. This reduces the number of NUMA migrations
- * in shared memory workloads, and has the effect of pulling tasks towards
- * where their memory lives, over pulling the memory towards the task.
- */
-unsigned int sysctl_numa_balancing_migrate_deferred = 16;
-
 static unsigned int task_nr_scan_windows(struct task_struct *p)
 {
 	unsigned long rss = 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 096db74..4d19492 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -384,13 +384,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname       = "numa_balancing_migrate_deferred",
-		.data           = &sysctl_numa_balancing_migrate_deferred,
-		.maxlen         = sizeof(unsigned int),
-		.mode           = 0644,
-		.proc_handler   = proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */
 		.maxlen		= sizeof(unsigned int),
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cb46c..052abac 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2301,35 +2301,6 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
-	/* Never defer a private fault */
-	if (cpupid_match_pid(p, last_cpupid))
-		return false;
-
-	if (p->numa_migrate_deferred) {
-		p->numa_migrate_deferred--;
-		return true;
-	}
-	return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
-	p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred;
-}
-#else
-static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
-	return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 /**
  * mpol_misplaced - check whether current page node is valid in policy
  *
@@ -2432,24 +2403,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 */
 		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
-
-			/* See sysctl_numa_balancing_migrate_deferred comment */
-			if (!cpupid_match_pid(current, last_cpupid))
-				defer_numa_migrate(current);
-
 			goto out;
 		}
-
-		/*
-		 * The quadratic filter above reduces extraneous migration
-		 * of shared pages somewhat. This code reduces it even more,
-		 * reducing the overhead of page migrations of shared pages.
-		 * This makes workloads with shared pages rely more on
-		 * "move task near its memory", and less on "move memory
-		 * towards its task", which is exactly what we want.
-		 */
-		if (numa_migrate_deferred(current, last_cpupid))
-			goto out;
 	}
 
 	if (curnid != polnid)
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
  2014-01-17  6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
  2014-01-17  6:17 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
@ 2014-01-17  6:17 ` riel
  2014-01-17  6:17 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17  6:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
	Rik van Riel

From: Rik van Riel <riel@surriel.com>

Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.

The next patches use this to build up a bitmap of which nodes a
workload is actively running on.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/sched.h | 10 ++++++++--
 kernel/sched/fair.c   | 30 +++++++++++++++++++++++-------
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
 	unsigned long *numa_faults_buffer;
 
 	/*
+	 * Track the nodes where faults are incurred. This is not very
+	 * interesting on a per-task basis, but it help with smarter
+	 * numa memory placement for groups of processes.
+	 */
+	unsigned long *numa_faults_from;
+	unsigned long *numa_faults_from_buffer;
+
+	/*
 	 * numa_faults_locality tracks if faults recorded during the last
 	 * scan window were remote/local. The task scan period is adapted
 	 * based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {
 
 	struct rcu_head rcu;
 	unsigned long total_faults;
+	unsigned long *faults_from;
 	unsigned long faults[0];
 };
 
@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
-			long diff;
+			long diff, f_diff;
 
 			i = task_faults_idx(nid, priv);
 			diff = -p->numa_faults[i];
+			f_diff = -p->numa_faults_from[i];
 
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
@@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
 			fault_types[priv] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
 
+			p->numa_faults_from[i] >>= 1;
+			p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+			p->numa_faults_from_buffer[i] = 0;
+
 			faults += p->numa_faults[i];
 			diff += p->numa_faults[i];
+			f_diff += p->numa_faults_from[i];
 			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
 				p->numa_group->faults[i] += diff;
+				p->numa_group->faults_from[i] += f_diff;
 				p->numa_group->total_faults += diff;
 				group_faults += p->numa_group->faults[i];
 			}
@@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 
 	if (unlikely(!p->numa_group)) {
 		unsigned int size = sizeof(struct numa_group) +
-				    2*nr_node_ids*sizeof(unsigned long);
+				    4*nr_node_ids*sizeof(unsigned long);
 
 		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!grp)
@@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		spin_lock_init(&grp->lock);
 		INIT_LIST_HEAD(&grp->task_list);
 		grp->gid = p->pid;
+		/* Second half of the array tracks where faults come from */
+		grp->faults_from = grp->faults + 2 * nr_node_ids;
 
-		for (i = 0; i < 2*nr_node_ids; i++)
+		for (i = 0; i < 4*nr_node_ids; i++)
 			grp->faults[i] = p->numa_faults[i];
 
 		grp->total_faults = p->total_numa_faults;
@@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 
 	double_lock(&my_grp->lock, &grp->lock);
 
-	for (i = 0; i < 2*nr_node_ids; i++) {
+	for (i = 0; i < 4*nr_node_ids; i++) {
 		my_grp->faults[i] -= p->numa_faults[i];
 		grp->faults[i] += p->numa_faults[i];
 	}
@@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)
 
 	if (grp) {
 		spin_lock(&grp->lock);
-		for (i = 0; i < 2*nr_node_ids; i++)
+		for (i = 0; i < 4*nr_node_ids; i++)
 			grp->faults[i] -= p->numa_faults[i];
 		grp->total_faults -= p->total_numa_faults;
 
@@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)
 
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
+	p->numa_faults_from = NULL;
+	p->numa_faults_from_buffer = NULL;
 	kfree(numa_faults);
 }
 
@@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 {
 	struct task_struct *p = current;
 	bool migrated = flags & TNF_MIGRATED;
+	int this_node = task_node(current);
 	int priv;
 
 	if (!numabalancing_enabled)
@@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+		p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
+		p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
+		p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
 		p->total_numa_faults = 0;
 		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
 	}
@@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 		p->numa_pages_migrated += pages;
 
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+	p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
 	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
 }
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
  2014-01-17  6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
  2014-01-17  6:17 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
  2014-01-17  6:17 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
@ 2014-01-17  6:17 ` riel
  2014-01-17  6:17 ` [PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes riel
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17  6:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
	Rik van Riel

From: Rik van Riel <riel@surriel.com>

The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..aa680e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
+	nodemask_t active_nodes;
 	unsigned long total_faults;
 	unsigned long *faults_from;
 	unsigned long faults[0];
@@ -1275,6 +1276,38 @@ static void numa_migrate_preferred(struct task_struct *p)
 }
 
 /*
+ * Iterate over the nodes from which NUMA hinting faults were triggered, in
+ * other words where the CPUs that incurred NUMA hinting faults are. The
+ * bitmask is used to limit NUMA page migrations, and spread out memory
+ * between the actively used nodes. To prevent flip-flopping, and excessive
+ * page migrations, nodes are added when they cause over 40% of the maximum
+ * number of faults, but only removed when they drop below 20%.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+	unsigned long faults, max_faults = 0;
+	struct numa_group *numa_group = p->numa_group;
+	int nid;
+
+	for_each_online_node(nid) {
+		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+			 numa_group->faults_from[task_faults_idx(nid, 1)];
+		if (faults > max_faults)
+			max_faults = faults;
+	}
+
+	for_each_online_node(nid) {
+		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+			 numa_group->faults_from[task_faults_idx(nid, 1)];
+		if (!node_isset(nid, numa_group->active_nodes)) {
+			if (faults > max_faults * 4 / 10)
+				node_set(nid, numa_group->active_nodes);
+		} else if (faults < max_faults * 2 / 10)
+			node_clear(nid, numa_group->active_nodes);
+	}
+}
+
+/*
  * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
  * increments. The more local the fault statistics are, the higher the scan
  * period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1449,7 @@ static void task_numa_placement(struct task_struct *p)
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
 
 	if (p->numa_group) {
+		update_numa_active_node_mask(p);
 		/*
 		 * If the preferred task and group nids are different,
 		 * iterate over the nodes again to find the best place.
@@ -1478,6 +1512,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		/* Second half of the array tracks where faults come from */
 		grp->faults_from = grp->faults + 2 * nr_node_ids;
 
+		node_set(task_node(current), grp->active_nodes);
+
 		for (i = 0; i < 4*nr_node_ids; i++)
 			grp->faults[i] = p->numa_faults[i];
 
@@ -1547,6 +1583,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 	my_grp->nr_tasks--;
 	grp->nr_tasks++;
 
+	update_numa_active_node_mask(p);
+
 	spin_unlock(&my_grp->lock);
 	spin_unlock(&grp->lock);
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes
  2014-01-17  6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
                   ` (2 preceding siblings ...)
  2014-01-17  6:17 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
@ 2014-01-17  6:17 ` riel
  2014-01-17  6:17 ` [PATCH 5/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
  2014-01-17  6:17 ` [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
  5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17  6:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
	Rik van Riel

From: Rik van Riel <riel@surriel.com>

Being able to see how the active nodemask changes over time, and why,
can be quite useful.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  8 ++++++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..91726b6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa,
 			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
 			__entry->dst_cpu, __entry->dst_nid)
 );
+
+TRACE_EVENT(update_numa_active_nodes_mask,
+
+	TP_PROTO(int pid, int gid, int nid, int set, long faults, long max_faults),
+
+	TP_ARGS(pid, gid, nid, set, faults, max_faults),
+
+	TP_STRUCT__entry(
+		__field(	pid_t,		pid)
+		__field(	pid_t,		gid)
+		__field(	int,		nid)
+		__field(	int,		set)
+		__field(	long,		faults)
+		__field(	long,		max_faults);
+	),
+
+	TP_fast_assign(
+		__entry->pid = pid;
+		__entry->gid = gid;
+		__entry->nid = nid;
+		__entry->set = set;
+		__entry->faults = faults;
+		__entry->max_faults = max_faults;
+	),
+
+	TP_printk("pid=%d gid=%d nid=%d set=%d faults=%ld max_faults=%ld",
+		__entry->pid,
+		__entry->gid,
+		__entry->nid,
+		__entry->set,
+		__entry->faults,
+		__entry->max_faults)
+
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa680e2..3551009 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
 		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
 			 numa_group->faults_from[task_faults_idx(nid, 1)];
 		if (!node_isset(nid, numa_group->active_nodes)) {
-			if (faults > max_faults * 4 / 10)
+			if (faults > max_faults * 4 / 10) {
+				trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
 				node_set(nid, numa_group->active_nodes);
-		} else if (faults < max_faults * 2 / 10)
+			}
+		} else if (faults < max_faults * 2 / 10) {
+			trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, false, faults, max_faults);
 			node_clear(nid, numa_group->active_nodes);
+		}
 	}
 }
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations
  2014-01-17  6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
                   ` (3 preceding siblings ...)
  2014-01-17  6:17 ` [PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes riel
@ 2014-01-17  6:17 ` riel
  2014-01-17  6:17 ` [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
  5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17  6:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
	Rik van Riel

From: Rik van Riel <riel@surriel.com>

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
   maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
   for the numa_group
3) always migrate from a node outside of the set of active nodes,
   to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
   from a node with more NUMA page faults, to a node with fewer
   NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/sched.h |  7 +++++++
 kernel/sched/fair.c   | 37 +++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c        |  3 +++
 3 files changed, 47 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9f7f05..0af6c1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
+extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+				int src_nid, int dst_nid);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
 static inline void task_numa_free(struct task_struct *p)
 {
 }
+static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+				       int src_nid, int dst_nid)
+{
+	return true;
+}
 #endif
 
 static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3551009..8e0a53a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -948,6 +948,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
 }
 
+bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+			 int src_nid, int dst_nid)
+{
+	struct numa_group *ng = p->numa_group;
+
+	/* Always allow migrate on private faults */
+	if (cpupid_match_pid(p, last_cpupid))
+		return true;
+
+	/* A shared fault, but p->numa_group has not been set up yet. */
+	if (!ng)
+		return true;
+
+	/*
+	 * Do not migrate if the destination is not a node that
+	 * is actively used by this numa group.
+	 */
+	if (!node_isset(dst_nid, ng->active_nodes))
+		return false;
+
+	/*
+	 * Source is a node that is not actively used by this
+	 * numa group, while the destination is. Migrate.
+	 */
+	if (!node_isset(src_nid, ng->active_nodes))
+		return true;
+
+	/*
+	 * Both source and destination are nodes in active
+	 * use by this numa group. Maximize memory bandwidth
+	 * by migrating from more heavily used groups, to less
+	 * heavily used ones, spreading the load around.
+	 * Use a 1/4 hysteresis to avoid spurious page movement.
+	 */
+	return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 052abac..050962b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
 			goto out;
 		}
+
+		if (!should_numa_migrate(current, last_cpupid, curnid, polnid))
+			goto out;
 	}
 
 	if (curnid != polnid)
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
  2014-01-17  6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
                   ` (4 preceding siblings ...)
  2014-01-17  6:17 ` [PATCH 5/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
@ 2014-01-17  6:17 ` riel
  2014-01-17  9:54   ` Peter Zijlstra
  5 siblings, 1 reply; 19+ messages in thread
From: riel @ 2014-01-17  6:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
	Rik van Riel

From: Rik van Riel <riel@surriel.com>

The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.

Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the node with
the garbage collector being marked the only active node in
the group.

This issue is avoided if we weigh the statistics by CPU use
of each task in the numa group, instead of by how many faults
each thread has occurred.

To achieve this, we normalize the number of faults to the
fraction of faults that occurred on each node, and then
multiply that fraction by the fraction of CPU time the
task has used since the last time task_numa_placement was
invoked.

This way the nodes in the active node mask will be the ones
where the tasks from the numa group are most actively running,
and the influence of eg. the garbage collector and other
do-little threads is properly minimized.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  2 ++
 kernel/sched/fair.c   | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0af6c1a..52de567 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1471,6 +1471,8 @@ struct task_struct {
 	int numa_preferred_nid;
 	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
+	u64 last_task_numa_placement;
+	u64 last_sum_exec_runtime;
 	struct callback_head numa_work;
 
 	struct list_head numa_entry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f45fd5..9a0908a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1758,6 +1758,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
+	p->last_task_numa_placement = 0;
+	p->last_sum_exec_runtime = 0;
 
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e0a53a..1601c68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p,
 	memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
 }
 
+/*
+ * Get the fraction of time the task has been running since the last
+ * NUMA placement cycle. The scheduler keeps similar statistics, but
+ * decays those on a 32ms period, which is orders of magnitude off
+ * from the dozens-of-seconds NUMA balancing period. Use the scheduler
+ * stats only if the task is so new there are no NUMA statistics yet.
+ */
+static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
+{
+	u64 runtime, delta, now;
+	/* Use the start of this time slice to avoid calculations. */
+	now = p->se.exec_start;
+	runtime = p->se.sum_exec_runtime;
+
+	if (p->last_task_numa_placement) {
+		delta = runtime - p->last_sum_exec_runtime;
+		*period = now - p->last_task_numa_placement;
+	} else {
+		delta = p->se.avg.runnable_avg_sum;
+		*period = p->se.avg.runnable_avg_period;
+	}
+
+	p->last_sum_exec_runtime = runtime;
+	p->last_task_numa_placement = now;
+
+	return delta;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
 	unsigned long fault_types[2] = { 0, 0 };
+	unsigned long total_faults;
+	u64 runtime, period;
 	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1435,6 +1465,10 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_scan_seq = seq;
 	p->numa_scan_period_max = task_scan_max(p);
 
+	total_faults = p->numa_faults_locality[0] +
+		       p->numa_faults_locality[1] + 1;
+	runtime = numa_get_avg_runtime(p, &period);
+
 	/* If the task is part of a group prevent parallel updates to group stats */
 	if (p->numa_group) {
 		group_lock = &p->numa_group->lock;
@@ -1447,7 +1481,7 @@ static void task_numa_placement(struct task_struct *p)
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
-			long diff, f_diff;
+			long diff, f_diff, f_weight;
 
 			i = task_faults_idx(nid, priv);
 			diff = -p->numa_faults[i];
@@ -1459,8 +1493,19 @@ static void task_numa_placement(struct task_struct *p)
 			fault_types[priv] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
 
+			/*
+			 * Normalize the faults_from, so all tasks in a group
+			 * count according to CPU use, instead of by the raw
+			 * number of faults. This prevents the situation where
+			 * the garbage collector totally dominates the stats,
+			 * and the access patterns of the worker threads are
+			 * ignored.
+			 */
+			f_weight = (1024 * runtime *
+				   p->numa_faults_from_buffer[i]) /
+				   (total_faults * period);
 			p->numa_faults_from[i] >>= 1;
-			p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+			p->numa_faults_from[i] += f_weight;
 			p->numa_faults_from_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
  2014-01-17  6:17 ` [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
@ 2014-01-17  9:54   ` Peter Zijlstra
  2014-01-17 14:51     ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-17  9:54 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, mingo, chegu_vinod, mgorman, Rik van Riel

On Fri, Jan 17, 2014 at 01:17:36AM -0500, riel@redhat.com wrote:
> +			/*
> +			 * Normalize the faults_from, so all tasks in a group
> +			 * count according to CPU use, instead of by the raw
> +			 * number of faults. This prevents the situation where
> +			 * the garbage collector totally dominates the stats,
> +			 * and the access patterns of the worker threads are
> +			 * ignored.
> +			 */

Instead of focusing on the one example (GC) here, I would suggest
saying something along the lines of: Tasks with little runtime have
little over-all impact on throughput and thus their faults are less
important.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
  2014-01-17  9:54   ` Peter Zijlstra
@ 2014-01-17 14:51     ` Rik van Riel
  2014-01-17 15:04       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-01-17 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, mingo, chegu_vinod, mgorman, Rik van Riel

On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 01:17:36AM -0500, riel@redhat.com wrote:
>> +			/*
>> +			 * Normalize the faults_from, so all tasks in a group
>> +			 * count according to CPU use, instead of by the raw
>> +			 * number of faults. This prevents the situation where
>> +			 * the garbage collector totally dominates the stats,
>> +			 * and the access patterns of the worker threads are
>> +			 * ignored.
>> +			 */
> 
> Instead of focusing on the one example (GC) here, I would suggest
> saying something along the lines of: Tasks with little runtime have
> little over-all impact on throughput and thus their faults are less
> important.

Thanks for the review, I have updated the comment
accordingly.

Does anybody else have comments on this series, or should
I start begging for Acked-by: and Reviewed-by: lines? :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
  2014-01-17 14:51     ` Rik van Riel
@ 2014-01-17 15:04       ` Peter Zijlstra
  2014-01-17 18:47         ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-17 15:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, mingo, chegu_vinod, mgorman, Rik van Riel

On Fri, Jan 17, 2014 at 09:51:44AM -0500, Rik van Riel wrote:
> On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
> > On Fri, Jan 17, 2014 at 01:17:36AM -0500, riel@redhat.com wrote:
> >> +			/*
> >> +			 * Normalize the faults_from, so all tasks in a group
> >> +			 * count according to CPU use, instead of by the raw
> >> +			 * number of faults. This prevents the situation where
> >> +			 * the garbage collector totally dominates the stats,
> >> +			 * and the access patterns of the worker threads are
> >> +			 * ignored.
> >> +			 */
> > 
> > Instead of focusing on the one example (GC) here, I would suggest
> > saying something along the lines of: Tasks with little runtime have
> > little over-all impact on throughput and thus their faults are less
> > important.
> 
> Thanks for the review, I have updated the comment
> accordingly.
> 
> Does anybody else have comments on this series, or should
> I start begging for Acked-by: and Reviewed-by: lines? :)

I still need to get me head around part of the previous patches, and I
suppose Mel might have a look, give it a few days ;-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
  2014-01-17 15:04       ` Peter Zijlstra
@ 2014-01-17 18:47         ` Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-01-17 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, mingo, chegu_vinod, mgorman, Rik van Riel

On 01/17/2014 10:04 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 09:51:44AM -0500, Rik van Riel wrote:
>> On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
>>> On Fri, Jan 17, 2014 at 01:17:36AM -0500, riel@redhat.com wrote:
>>>> +			/*
>>>> +			 * Normalize the faults_from, so all tasks in a group
>>>> +			 * count according to CPU use, instead of by the raw
>>>> +			 * number of faults. This prevents the situation where
>>>> +			 * the garbage collector totally dominates the stats,
>>>> +			 * and the access patterns of the worker threads are
>>>> +			 * ignored.
>>>> +			 */
>>>
>>> Instead of focusing on the one example (GC) here, I would suggest
>>> saying something along the lines of: Tasks with little runtime have
>>> little over-all impact on throughput and thus their faults are less
>>> important.
>>
>> Thanks for the review, I have updated the comment
>> accordingly.
>>
>> Does anybody else have comments on this series, or should
>> I start begging for Acked-by: and Reviewed-by: lines? :)
> 
> I still need to get me head around part of the previous patches, and I
> suppose Mel might have a look, give it a few days ;-)

Well, you didn't spot the divide by zero that Vinod somehow
managed to trigger :)

I am also going to add another patch, that prevents the temporary
values from "p->numa_faults[i] >>= 1;" becoming visible, by using
local variables for those calculations.

Expect a v2 soonish.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
  2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
@ 2014-01-20 19:21   ` riel
  0 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod

From: Rik van Riel <riel@redhat.com>

The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..ea8b2ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
+	nodemask_t active_nodes;
 	unsigned long total_faults;
 	unsigned long *faults_from;
 	unsigned long faults[0];
@@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
 }
 
 /*
+ * Find the nodes on which the workload is actively running. We do this by
+ * tracking the nodes from which NUMA hinting faults are triggered. This can
+ * be different from the set of nodes where the workload's memory is currently
+ * located.
+ *
+ * The bitmask is used to make smarter decisions on when to do NUMA page
+ * migrations, To prevent flip-flopping, and excessive page migrations, nodes
+ * are added when they cause over 6/16 of the maximum number of faults, but
+ * only removed when they drop below 3/16.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+	unsigned long faults, max_faults = 0;
+	struct numa_group *numa_group = p->numa_group;
+	int nid;
+
+	for_each_online_node(nid) {
+		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+			 numa_group->faults_from[task_faults_idx(nid, 1)];
+		if (faults > max_faults)
+			max_faults = faults;
+	}
+
+	for_each_online_node(nid) {
+		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+			 numa_group->faults_from[task_faults_idx(nid, 1)];
+		if (!node_isset(nid, numa_group->active_nodes)) {
+			if (faults > max_faults * 6 / 16)
+				node_set(nid, numa_group->active_nodes);
+		} else if (faults < max_faults * 3 / 16)
+			node_clear(nid, numa_group->active_nodes);
+	}
+}
+
+/*
  * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
  * increments. The more local the fault statistics are, the higher the scan
  * period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
 
 	if (p->numa_group) {
+		update_numa_active_node_mask(p);
 		/*
 		 * If the preferred task and group nids are different,
 		 * iterate over the nodes again to find the best place.
@@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		/* Second half of the array tracks where faults come from */
 		grp->faults_from = grp->faults + 2 * nr_node_ids;
 
+		node_set(task_node(current), grp->active_nodes);
+
 		for (i = 0; i < 4*nr_node_ids; i++)
 			grp->faults[i] = p->numa_faults[i];
 
@@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 	my_grp->nr_tasks--;
 	grp->nr_tasks++;
 
+	update_numa_active_node_mask(p);
+
 	spin_unlock(&my_grp->lock);
 	spin_unlock(&grp->lock);
 
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
@ 2014-01-20 19:21   ` riel
  0 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod

From: Rik van Riel <riel@redhat.com>

The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..ea8b2ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
+	nodemask_t active_nodes;
 	unsigned long total_faults;
 	unsigned long *faults_from;
 	unsigned long faults[0];
@@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
 }
 
 /*
+ * Find the nodes on which the workload is actively running. We do this by
+ * tracking the nodes from which NUMA hinting faults are triggered. This can
+ * be different from the set of nodes where the workload's memory is currently
+ * located.
+ *
+ * The bitmask is used to make smarter decisions on when to do NUMA page
+ * migrations, To prevent flip-flopping, and excessive page migrations, nodes
+ * are added when they cause over 6/16 of the maximum number of faults, but
+ * only removed when they drop below 3/16.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+	unsigned long faults, max_faults = 0;
+	struct numa_group *numa_group = p->numa_group;
+	int nid;
+
+	for_each_online_node(nid) {
+		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+			 numa_group->faults_from[task_faults_idx(nid, 1)];
+		if (faults > max_faults)
+			max_faults = faults;
+	}
+
+	for_each_online_node(nid) {
+		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+			 numa_group->faults_from[task_faults_idx(nid, 1)];
+		if (!node_isset(nid, numa_group->active_nodes)) {
+			if (faults > max_faults * 6 / 16)
+				node_set(nid, numa_group->active_nodes);
+		} else if (faults < max_faults * 3 / 16)
+			node_clear(nid, numa_group->active_nodes);
+	}
+}
+
+/*
  * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
  * increments. The more local the fault statistics are, the higher the scan
  * period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
 
 	if (p->numa_group) {
+		update_numa_active_node_mask(p);
 		/*
 		 * If the preferred task and group nids are different,
 		 * iterate over the nodes again to find the best place.
@@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		/* Second half of the array tracks where faults come from */
 		grp->faults_from = grp->faults + 2 * nr_node_ids;
 
+		node_set(task_node(current), grp->active_nodes);
+
 		for (i = 0; i < 4*nr_node_ids; i++)
 			grp->faults[i] = p->numa_faults[i];
 
@@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 	my_grp->nr_tasks--;
 	grp->nr_tasks++;
 
+	update_numa_active_node_mask(p);
+
 	spin_unlock(&my_grp->lock);
 	spin_unlock(&grp->lock);
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
  2014-01-20 19:21   ` riel
@ 2014-01-21 14:19     ` Mel Gorman
  -1 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-01-21 14:19 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Mon, Jan 20, 2014 at 02:21:04PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> The faults_from statistics are used to maintain an active_nodes nodemask
> per numa_group. This allows us to be smarter about when to do numa migrations.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1945ddc..ea8b2ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -885,6 +885,7 @@ struct numa_group {
>  	struct list_head task_list;
>  
>  	struct rcu_head rcu;
> +	nodemask_t active_nodes;
>  	unsigned long total_faults;
>  	unsigned long *faults_from;
>  	unsigned long faults[0];

It's not a concern for now but in the land of unicorns and ponies we'll
relook at the size of some of these structures and see what can be
optimised.

Similar to my comment on faults_from I think we could potentially evaluate
the fitness of the automatic NUMA balancing feature by looking at the
weight of the active_nodes for a numa_group. If
bitmask_weight(active_nodes) == nr_online_nodes
for all numa_groups in the system then I think it would be an indication
that the algorithm has collapsed.

It's not a comment on the patch itself. We could could just do with more
metrics that help analyse this thing when debugging problems.

> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
>  }
>  
>  /*
> + * Find the nodes on which the workload is actively running. We do this by

hmm, it's not the workload though, it's a single NUMA group and a workload
may consist of multiple NUMA groups. For example, in an ideal world and
a JVM-based workload the application threads and the GC threads would be
in different NUMA groups.

The signature is even more misleading because the signature implies that
the function is concerned with tasks. Pass in p->numa_group

> + * tracking the nodes from which NUMA hinting faults are triggered. This can
> + * be different from the set of nodes where the workload's memory is currently
> + * located.
> + *
> + * The bitmask is used to make smarter decisions on when to do NUMA page
> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
> + * are added when they cause over 6/16 of the maximum number of faults, but
> + * only removed when they drop below 3/16.
> + */

Looking at the values, I'm guessing you did it this way to use shifts
instead of divides. That's fine, but how did you arrive at those values?
Experimentally or just felt reasonable?

> +static void update_numa_active_node_mask(struct task_struct *p)
> +{
> +	unsigned long faults, max_faults = 0;
> +	struct numa_group *numa_group = p->numa_group;
> +	int nid;
> +
> +	for_each_online_node(nid) {
> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> +			 numa_group->faults_from[task_faults_idx(nid, 1)];

task_faults() implements a helper for p->numa_faults equivalent of this.
Just as with the other renaming, it would not hurt to rename task_faults()
to something like task_faults_memory() and add a task_faults_cpu() for
this. The objective again is to be clear about whether we care about CPU
or memory locality information.

> +		if (faults > max_faults)
> +			max_faults = faults;
> +	}
> +
> +	for_each_online_node(nid) {
> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> +			 numa_group->faults_from[task_faults_idx(nid, 1)];

group_faults would need similar adjustment.

> +		if (!node_isset(nid, numa_group->active_nodes)) {
> +			if (faults > max_faults * 6 / 16)
> +				node_set(nid, numa_group->active_nodes);
> +		} else if (faults < max_faults * 3 / 16)
> +			node_clear(nid, numa_group->active_nodes);
> +	}
> +}
> +

I think there is a subtle problem here

/*
 * Be mindful that this is subject to sampling error. As we only have
 * data on hinting faults active_nodes may miss a heavily referenced
 * node due to the references being to a small number of pages. If
 * there is a large linear scanner in the same numa group as a
 * task operating on a small amount of memory then the latter task
 * may be ignored.
 */

I have no suggestion on how to handle this because we're vulnerable to
sampling errors in a number of places but it does not hurt to be reminded
of that in a few places.

> +/*
>   * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
>   * increments. The more local the fault statistics are, the higher the scan
>   * period will be for the next scan window. If local/remote ratio is below
> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
>  	update_task_scan_period(p, fault_types[0], fault_types[1]);
>  
>  	if (p->numa_group) {
> +		update_numa_active_node_mask(p);

We are updating that thing once per scan window, that's fine. There is
potentially a wee issue though. If all the tasks in the group are threads
then they share p->mm->numa_scan_seq and only one task does the update
per scan window. If they are different processes then we could be updating
more frequently than necessary.

Functionally it'll be fine but higher cost than necessary. I do not have a
better suggestion right now as superficially a numa_scan_seq per numa_group
would not be a good fit.

If we think of nothing better and the issue is real then we can at least
stick a comment there for future reference.

>  		/*
>  		 * If the preferred task and group nids are different,
>  		 * iterate over the nodes again to find the best place.
> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  		/* Second half of the array tracks where faults come from */
>  		grp->faults_from = grp->faults + 2 * nr_node_ids;
>  
> +		node_set(task_node(current), grp->active_nodes);
> +
>  		for (i = 0; i < 4*nr_node_ids; i++)
>  			grp->faults[i] = p->numa_faults[i];
>  
> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  	my_grp->nr_tasks--;
>  	grp->nr_tasks++;
>  
> +	update_numa_active_node_mask(p);
> +

This may be subtle enough to deserve a comment

/* Tasks have joined/left groups and the active_mask is no longer valid */

If we left a group, we update our new group. Is the old group now out of
date and in need of updating too? If so, then we should update both and
only update the old group if it still has tasks in it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
@ 2014-01-21 14:19     ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-01-21 14:19 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Mon, Jan 20, 2014 at 02:21:04PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> The faults_from statistics are used to maintain an active_nodes nodemask
> per numa_group. This allows us to be smarter about when to do numa migrations.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1945ddc..ea8b2ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -885,6 +885,7 @@ struct numa_group {
>  	struct list_head task_list;
>  
>  	struct rcu_head rcu;
> +	nodemask_t active_nodes;
>  	unsigned long total_faults;
>  	unsigned long *faults_from;
>  	unsigned long faults[0];

It's not a concern for now but in the land of unicorns and ponies we'll
relook at the size of some of these structures and see what can be
optimised.

Similar to my comment on faults_from I think we could potentially evaluate
the fitness of the automatic NUMA balancing feature by looking at the
weight of the active_nodes for a numa_group. If
bitmask_weight(active_nodes) == nr_online_nodes
for all numa_groups in the system then I think it would be an indication
that the algorithm has collapsed.

It's not a comment on the patch itself. We could could just do with more
metrics that help analyse this thing when debugging problems.

> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
>  }
>  
>  /*
> + * Find the nodes on which the workload is actively running. We do this by

hmm, it's not the workload though, it's a single NUMA group and a workload
may consist of multiple NUMA groups. For example, in an ideal world and
a JVM-based workload the application threads and the GC threads would be
in different NUMA groups.

The signature is even more misleading because the signature implies that
the function is concerned with tasks. Pass in p->numa_group

> + * tracking the nodes from which NUMA hinting faults are triggered. This can
> + * be different from the set of nodes where the workload's memory is currently
> + * located.
> + *
> + * The bitmask is used to make smarter decisions on when to do NUMA page
> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
> + * are added when they cause over 6/16 of the maximum number of faults, but
> + * only removed when they drop below 3/16.
> + */

Looking at the values, I'm guessing you did it this way to use shifts
instead of divides. That's fine, but how did you arrive at those values?
Experimentally or just felt reasonable?

> +static void update_numa_active_node_mask(struct task_struct *p)
> +{
> +	unsigned long faults, max_faults = 0;
> +	struct numa_group *numa_group = p->numa_group;
> +	int nid;
> +
> +	for_each_online_node(nid) {
> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> +			 numa_group->faults_from[task_faults_idx(nid, 1)];

task_faults() implements a helper for p->numa_faults equivalent of this.
Just as with the other renaming, it would not hurt to rename task_faults()
to something like task_faults_memory() and add a task_faults_cpu() for
this. The objective again is to be clear about whether we care about CPU
or memory locality information.

> +		if (faults > max_faults)
> +			max_faults = faults;
> +	}
> +
> +	for_each_online_node(nid) {
> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> +			 numa_group->faults_from[task_faults_idx(nid, 1)];

group_faults would need similar adjustment.

> +		if (!node_isset(nid, numa_group->active_nodes)) {
> +			if (faults > max_faults * 6 / 16)
> +				node_set(nid, numa_group->active_nodes);
> +		} else if (faults < max_faults * 3 / 16)
> +			node_clear(nid, numa_group->active_nodes);
> +	}
> +}
> +

I think there is a subtle problem here

/*
 * Be mindful that this is subject to sampling error. As we only have
 * data on hinting faults active_nodes may miss a heavily referenced
 * node due to the references being to a small number of pages. If
 * there is a large linear scanner in the same numa group as a
 * task operating on a small amount of memory then the latter task
 * may be ignored.
 */

I have no suggestion on how to handle this because we're vulnerable to
sampling errors in a number of places but it does not hurt to be reminded
of that in a few places.

> +/*
>   * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
>   * increments. The more local the fault statistics are, the higher the scan
>   * period will be for the next scan window. If local/remote ratio is below
> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
>  	update_task_scan_period(p, fault_types[0], fault_types[1]);
>  
>  	if (p->numa_group) {
> +		update_numa_active_node_mask(p);

We are updating that thing once per scan window, that's fine. There is
potentially a wee issue though. If all the tasks in the group are threads
then they share p->mm->numa_scan_seq and only one task does the update
per scan window. If they are different processes then we could be updating
more frequently than necessary.

Functionally it'll be fine but higher cost than necessary. I do not have a
better suggestion right now as superficially a numa_scan_seq per numa_group
would not be a good fit.

If we think of nothing better and the issue is real then we can at least
stick a comment there for future reference.

>  		/*
>  		 * If the preferred task and group nids are different,
>  		 * iterate over the nodes again to find the best place.
> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  		/* Second half of the array tracks where faults come from */
>  		grp->faults_from = grp->faults + 2 * nr_node_ids;
>  
> +		node_set(task_node(current), grp->active_nodes);
> +
>  		for (i = 0; i < 4*nr_node_ids; i++)
>  			grp->faults[i] = p->numa_faults[i];
>  
> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  	my_grp->nr_tasks--;
>  	grp->nr_tasks++;
>  
> +	update_numa_active_node_mask(p);
> +

This may be subtle enough to deserve a comment

/* Tasks have joined/left groups and the active_mask is no longer valid */

If we left a group, we update our new group. Is the old group now out of
date and in need of updating too? If so, then we should update both and
only update the old group if it still has tasks in it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
  2014-01-21 14:19     ` Mel Gorman
@ 2014-01-21 15:09       ` Rik van Riel
  -1 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-01-21 15:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On 01/21/2014 09:19 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:04PM -0500, riel@redhat.com wrote:

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1945ddc..ea8b2ae 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -885,6 +885,7 @@ struct numa_group {
>>  	struct list_head task_list;
>>  
>>  	struct rcu_head rcu;
>> +	nodemask_t active_nodes;
>>  	unsigned long total_faults;
>>  	unsigned long *faults_from;
>>  	unsigned long faults[0];
> 
> It's not a concern for now but in the land of unicorns and ponies we'll
> relook at the size of some of these structures and see what can be
> optimised.

Unsigned int should be enough for systems with less than 8TB
of memory per node :)

> Similar to my comment on faults_from I think we could potentially evaluate
> the fitness of the automatic NUMA balancing feature by looking at the
> weight of the active_nodes for a numa_group. If
> bitmask_weight(active_nodes) == nr_online_nodes
> for all numa_groups in the system then I think it would be an indication
> that the algorithm has collapsed.

If the system runs one very large workload, I would expect the
scheduler to spread that workload across all nodes.

In that situation, it is perfectly legitimate for all nodes
to end up being marked as active nodes, and for the system
to try distribute the workload's memory somewhat evenly
between them.

> It's not a comment on the patch itself. We could could just do with more
> metrics that help analyse this thing when debugging problems.
> 
>> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
>>  }
>>  
>>  /*
>> + * Find the nodes on which the workload is actively running. We do this by
> 
> hmm, it's not the workload though, it's a single NUMA group and a workload
> may consist of multiple NUMA groups. For example, in an ideal world and
> a JVM-based workload the application threads and the GC threads would be
> in different NUMA groups.

Why should they be in a different numa group?

The rest of the series contains patches to make sure they
should be just fine together in the same group...

> The signature is even more misleading because the signature implies that
> the function is concerned with tasks. Pass in p->numa_group

Will do.

>> + * tracking the nodes from which NUMA hinting faults are triggered. This can
>> + * be different from the set of nodes where the workload's memory is currently
>> + * located.
>> + *
>> + * The bitmask is used to make smarter decisions on when to do NUMA page
>> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
>> + * are added when they cause over 6/16 of the maximum number of faults, but
>> + * only removed when they drop below 3/16.
>> + */
> 
> Looking at the values, I'm guessing you did it this way to use shifts
> instead of divides. That's fine, but how did you arrive at those values?
> Experimentally or just felt reasonable?

Experimentally I got to 20% and 40%.  Peter suggested I change it
to 3/16 and 6/16, which appear to give identical performance.

>> +static void update_numa_active_node_mask(struct task_struct *p)
>> +{
>> +	unsigned long faults, max_faults = 0;
>> +	struct numa_group *numa_group = p->numa_group;
>> +	int nid;
>> +
>> +	for_each_online_node(nid) {
>> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> 
> task_faults() implements a helper for p->numa_faults equivalent of this.
> Just as with the other renaming, it would not hurt to rename task_faults()
> to something like task_faults_memory() and add a task_faults_cpu() for
> this. The objective again is to be clear about whether we care about CPU
> or memory locality information.

Will do.

>> +		if (faults > max_faults)
>> +			max_faults = faults;
>> +	}
>> +
>> +	for_each_online_node(nid) {
>> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> 
> group_faults would need similar adjustment.
> 
>> +		if (!node_isset(nid, numa_group->active_nodes)) {
>> +			if (faults > max_faults * 6 / 16)
>> +				node_set(nid, numa_group->active_nodes);
>> +		} else if (faults < max_faults * 3 / 16)
>> +			node_clear(nid, numa_group->active_nodes);
>> +	}
>> +}
>> +
> 
> I think there is a subtle problem here

Can you be more specific about what problem you think the hysteresis
could be causing?

> /*
>  * Be mindful that this is subject to sampling error. As we only have
>  * data on hinting faults active_nodes may miss a heavily referenced
>  * node due to the references being to a small number of pages. If
>  * there is a large linear scanner in the same numa group as a
>  * task operating on a small amount of memory then the latter task
>  * may be ignored.
>  */
> 
> I have no suggestion on how to handle this

Since the numa_faults_cpu statistics are all about driving
memory-follows-cpu, there actually is a decent way to handle
it.  See patch 5 :)

>> +/*
>>   * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
>>   * increments. The more local the fault statistics are, the higher the scan
>>   * period will be for the next scan window. If local/remote ratio is below
>> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
>>  	update_task_scan_period(p, fault_types[0], fault_types[1]);
>>  
>>  	if (p->numa_group) {
>> +		update_numa_active_node_mask(p);
> 
> We are updating that thing once per scan window, that's fine. There is
> potentially a wee issue though. If all the tasks in the group are threads
> then they share p->mm->numa_scan_seq and only one task does the update
> per scan window. If they are different processes then we could be updating
> more frequently than necessary.
> 
> Functionally it'll be fine but higher cost than necessary. I do not have a
> better suggestion right now as superficially a numa_scan_seq per numa_group
> would not be a good fit.

I suspect this cost will be small anyway, compared to the costs
incurred in both the earlier part of task_numa_placement, and
in the code where we may look for a better place to migrate the
task to.

This just iterates over memory we have already touched before
(likely to still be cached), and does some cheap comparisons.

>>  		/*
>>  		 * If the preferred task and group nids are different,
>>  		 * iterate over the nodes again to find the best place.
>> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>  		/* Second half of the array tracks where faults come from */
>>  		grp->faults_from = grp->faults + 2 * nr_node_ids;
>>  
>> +		node_set(task_node(current), grp->active_nodes);
>> +
>>  		for (i = 0; i < 4*nr_node_ids; i++)
>>  			grp->faults[i] = p->numa_faults[i];
>>  
>> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>  	my_grp->nr_tasks--;
>>  	grp->nr_tasks++;
>>  
>> +	update_numa_active_node_mask(p);
>> +
> 
> This may be subtle enough to deserve a comment
> 
> /* Tasks have joined/left groups and the active_mask is no longer valid */

I have added a comment.

> If we left a group, we update our new group. Is the old group now out of
> date and in need of updating too?

The entire old group will join the new group, and the old group
is freed.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
@ 2014-01-21 15:09       ` Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-01-21 15:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On 01/21/2014 09:19 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:04PM -0500, riel@redhat.com wrote:

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1945ddc..ea8b2ae 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -885,6 +885,7 @@ struct numa_group {
>>  	struct list_head task_list;
>>  
>>  	struct rcu_head rcu;
>> +	nodemask_t active_nodes;
>>  	unsigned long total_faults;
>>  	unsigned long *faults_from;
>>  	unsigned long faults[0];
> 
> It's not a concern for now but in the land of unicorns and ponies we'll
> relook at the size of some of these structures and see what can be
> optimised.

Unsigned int should be enough for systems with less than 8TB
of memory per node :)

> Similar to my comment on faults_from I think we could potentially evaluate
> the fitness of the automatic NUMA balancing feature by looking at the
> weight of the active_nodes for a numa_group. If
> bitmask_weight(active_nodes) == nr_online_nodes
> for all numa_groups in the system then I think it would be an indication
> that the algorithm has collapsed.

If the system runs one very large workload, I would expect the
scheduler to spread that workload across all nodes.

In that situation, it is perfectly legitimate for all nodes
to end up being marked as active nodes, and for the system
to try distribute the workload's memory somewhat evenly
between them.

> It's not a comment on the patch itself. We could could just do with more
> metrics that help analyse this thing when debugging problems.
> 
>> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
>>  }
>>  
>>  /*
>> + * Find the nodes on which the workload is actively running. We do this by
> 
> hmm, it's not the workload though, it's a single NUMA group and a workload
> may consist of multiple NUMA groups. For example, in an ideal world and
> a JVM-based workload the application threads and the GC threads would be
> in different NUMA groups.

Why should they be in a different numa group?

The rest of the series contains patches to make sure they
should be just fine together in the same group...

> The signature is even more misleading because the signature implies that
> the function is concerned with tasks. Pass in p->numa_group

Will do.

>> + * tracking the nodes from which NUMA hinting faults are triggered. This can
>> + * be different from the set of nodes where the workload's memory is currently
>> + * located.
>> + *
>> + * The bitmask is used to make smarter decisions on when to do NUMA page
>> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
>> + * are added when they cause over 6/16 of the maximum number of faults, but
>> + * only removed when they drop below 3/16.
>> + */
> 
> Looking at the values, I'm guessing you did it this way to use shifts
> instead of divides. That's fine, but how did you arrive at those values?
> Experimentally or just felt reasonable?

Experimentally I got to 20% and 40%.  Peter suggested I change it
to 3/16 and 6/16, which appear to give identical performance.

>> +static void update_numa_active_node_mask(struct task_struct *p)
>> +{
>> +	unsigned long faults, max_faults = 0;
>> +	struct numa_group *numa_group = p->numa_group;
>> +	int nid;
>> +
>> +	for_each_online_node(nid) {
>> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> 
> task_faults() implements a helper for p->numa_faults equivalent of this.
> Just as with the other renaming, it would not hurt to rename task_faults()
> to something like task_faults_memory() and add a task_faults_cpu() for
> this. The objective again is to be clear about whether we care about CPU
> or memory locality information.

Will do.

>> +		if (faults > max_faults)
>> +			max_faults = faults;
>> +	}
>> +
>> +	for_each_online_node(nid) {
>> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> 
> group_faults would need similar adjustment.
> 
>> +		if (!node_isset(nid, numa_group->active_nodes)) {
>> +			if (faults > max_faults * 6 / 16)
>> +				node_set(nid, numa_group->active_nodes);
>> +		} else if (faults < max_faults * 3 / 16)
>> +			node_clear(nid, numa_group->active_nodes);
>> +	}
>> +}
>> +
> 
> I think there is a subtle problem here

Can you be more specific about what problem you think the hysteresis
could be causing?

> /*
>  * Be mindful that this is subject to sampling error. As we only have
>  * data on hinting faults active_nodes may miss a heavily referenced
>  * node due to the references being to a small number of pages. If
>  * there is a large linear scanner in the same numa group as a
>  * task operating on a small amount of memory then the latter task
>  * may be ignored.
>  */
> 
> I have no suggestion on how to handle this

Since the numa_faults_cpu statistics are all about driving
memory-follows-cpu, there actually is a decent way to handle
it.  See patch 5 :)

>> +/*
>>   * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
>>   * increments. The more local the fault statistics are, the higher the scan
>>   * period will be for the next scan window. If local/remote ratio is below
>> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
>>  	update_task_scan_period(p, fault_types[0], fault_types[1]);
>>  
>>  	if (p->numa_group) {
>> +		update_numa_active_node_mask(p);
> 
> We are updating that thing once per scan window, that's fine. There is
> potentially a wee issue though. If all the tasks in the group are threads
> then they share p->mm->numa_scan_seq and only one task does the update
> per scan window. If they are different processes then we could be updating
> more frequently than necessary.
> 
> Functionally it'll be fine but higher cost than necessary. I do not have a
> better suggestion right now as superficially a numa_scan_seq per numa_group
> would not be a good fit.

I suspect this cost will be small anyway, compared to the costs
incurred in both the earlier part of task_numa_placement, and
in the code where we may look for a better place to migrate the
task to.

This just iterates over memory we have already touched before
(likely to still be cached), and does some cheap comparisons.

>>  		/*
>>  		 * If the preferred task and group nids are different,
>>  		 * iterate over the nodes again to find the best place.
>> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>  		/* Second half of the array tracks where faults come from */
>>  		grp->faults_from = grp->faults + 2 * nr_node_ids;
>>  
>> +		node_set(task_node(current), grp->active_nodes);
>> +
>>  		for (i = 0; i < 4*nr_node_ids; i++)
>>  			grp->faults[i] = p->numa_faults[i];
>>  
>> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>  	my_grp->nr_tasks--;
>>  	grp->nr_tasks++;
>>  
>> +	update_numa_active_node_mask(p);
>> +
> 
> This may be subtle enough to deserve a comment
> 
> /* Tasks have joined/left groups and the active_mask is no longer valid */

I have added a comment.

> If we left a group, we update our new group. Is the old group now out of
> date and in need of updating too?

The entire old group will join the new group, and the old group
is freed.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
  2014-01-21 15:09       ` Rik van Riel
@ 2014-01-21 15:41         ` Mel Gorman
  -1 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-01-21 15:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Tue, Jan 21, 2014 at 10:09:14AM -0500, Rik van Riel wrote:
> On 01/21/2014 09:19 AM, Mel Gorman wrote:
> > On Mon, Jan 20, 2014 at 02:21:04PM -0500, riel@redhat.com wrote:
> 
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 1945ddc..ea8b2ae 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -885,6 +885,7 @@ struct numa_group {
> >>  	struct list_head task_list;
> >>  
> >>  	struct rcu_head rcu;
> >> +	nodemask_t active_nodes;
> >>  	unsigned long total_faults;
> >>  	unsigned long *faults_from;
> >>  	unsigned long faults[0];
> > 
> > It's not a concern for now but in the land of unicorns and ponies we'll
> > relook at the size of some of these structures and see what can be
> > optimised.
> 
> Unsigned int should be enough for systems with less than 8TB
> of memory per node :)
> 

Is it not bigger than that?

typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;

so it depends on the value of NODES_SHIFT? Anyway, not worth getting
into a twist over.

> > Similar to my comment on faults_from I think we could potentially evaluate
> > the fitness of the automatic NUMA balancing feature by looking at the
> > weight of the active_nodes for a numa_group. If
> > bitmask_weight(active_nodes) == nr_online_nodes
> > for all numa_groups in the system then I think it would be an indication
> > that the algorithm has collapsed.
> 
> If the system runs one very large workload, I would expect the
> scheduler to spread that workload across all nodes.
> 
> In that situation, it is perfectly legitimate for all nodes
> to end up being marked as active nodes, and for the system
> to try distribute the workload's memory somewhat evenly
> between them.
> 

In the specific case where the workload is not partitioned and really
accessing all of memory then sure, it'll be spread throughout the
system. However, if we are looking at a case like multiple JVMs sized to
fit within nodes then the metric would hold.

> > It's not a comment on the patch itself. We could could just do with more
> > metrics that help analyse this thing when debugging problems.
> > 
> >> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
> >>  }
> >>  
> >>  /*
> >> + * Find the nodes on which the workload is actively running. We do this by
> > 
> > hmm, it's not the workload though, it's a single NUMA group and a workload
> > may consist of multiple NUMA groups. For example, in an ideal world and
> > a JVM-based workload the application threads and the GC threads would be
> > in different NUMA groups.
> 
> Why should they be in a different numa group?
> 

It would be ideal that they are in different groups so the hinting faults
incurred by the garbage collector (linear scan of the address space)
does not affect scheduling and placement decisions based on the numa
groups fault statistics.

> The rest of the series contains patches to make sure they
> should be just fine together in the same group...
> 
> > The signature is even more misleading because the signature implies that
> > the function is concerned with tasks. Pass in p->numa_group
> 
> Will do.
> 
> >> + * tracking the nodes from which NUMA hinting faults are triggered. This can
> >> + * be different from the set of nodes where the workload's memory is currently
> >> + * located.
> >> + *
> >> + * The bitmask is used to make smarter decisions on when to do NUMA page
> >> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
> >> + * are added when they cause over 6/16 of the maximum number of faults, but
> >> + * only removed when they drop below 3/16.
> >> + */
> > 
> > Looking at the values, I'm guessing you did it this way to use shifts
> > instead of divides. That's fine, but how did you arrive at those values?
> > Experimentally or just felt reasonable?
> 
> Experimentally I got to 20% and 40%.  Peter suggested I change it
> to 3/16 and 6/16, which appear to give identical performance.
> 

Cool

> >> +static void update_numa_active_node_mask(struct task_struct *p)
> >> +{
> >> +	unsigned long faults, max_faults = 0;
> >> +	struct numa_group *numa_group = p->numa_group;
> >> +	int nid;
> >> +
> >> +	for_each_online_node(nid) {
> >> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> >> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> > 
> > task_faults() implements a helper for p->numa_faults equivalent of this.
> > Just as with the other renaming, it would not hurt to rename task_faults()
> > to something like task_faults_memory() and add a task_faults_cpu() for
> > this. The objective again is to be clear about whether we care about CPU
> > or memory locality information.
> 
> Will do.
> 
> >> +		if (faults > max_faults)
> >> +			max_faults = faults;
> >> +	}
> >> +
> >> +	for_each_online_node(nid) {
> >> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> >> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> > 
> > group_faults would need similar adjustment.
> > 
> >> +		if (!node_isset(nid, numa_group->active_nodes)) {
> >> +			if (faults > max_faults * 6 / 16)
> >> +				node_set(nid, numa_group->active_nodes);
> >> +		} else if (faults < max_faults * 3 / 16)
> >> +			node_clear(nid, numa_group->active_nodes);
> >> +	}
> >> +}
> >> +
> > 
> > I think there is a subtle problem here
> 
> Can you be more specific about what problem you think the hysteresis
> could be causing?
> 

Lets say

Thread A: Most important thread for performance, accesses small amounts
	of memory during each scan window. Lets say it's doing calculations
	over a large cache-aware structure of some description

Thread B: Big stupid linear scanner accessing all of memory for whatever
	reason.

Thread B will incur more NUMA hinting faults because it is accessing
idle memory that is unused by Thread A. The fault stats and placement
decisions are then skewed in favour of Thread B because Thread A did not
trap enough hinting faults.

It's a theoretical problem.

> > /*
> >  * Be mindful that this is subject to sampling error. As we only have
> >  * data on hinting faults active_nodes may miss a heavily referenced
> >  * node due to the references being to a small number of pages. If
> >  * there is a large linear scanner in the same numa group as a
> >  * task operating on a small amount of memory then the latter task
> >  * may be ignored.
> >  */
> > 
> > I have no suggestion on how to handle this
> 
> Since the numa_faults_cpu statistics are all about driving
> memory-follows-cpu, there actually is a decent way to handle
> it.  See patch 5 :)
> 
> >> +/*
> >>   * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
> >>   * increments. The more local the fault statistics are, the higher the scan
> >>   * period will be for the next scan window. If local/remote ratio is below
> >> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
> >>  	update_task_scan_period(p, fault_types[0], fault_types[1]);
> >>  
> >>  	if (p->numa_group) {
> >> +		update_numa_active_node_mask(p);
> > 
> > We are updating that thing once per scan window, that's fine. There is
> > potentially a wee issue though. If all the tasks in the group are threads
> > then they share p->mm->numa_scan_seq and only one task does the update
> > per scan window. If they are different processes then we could be updating
> > more frequently than necessary.
> > 
> > Functionally it'll be fine but higher cost than necessary. I do not have a
> > better suggestion right now as superficially a numa_scan_seq per numa_group
> > would not be a good fit.
> 
> I suspect this cost will be small anyway, compared to the costs
> incurred in both the earlier part of task_numa_placement, and
> in the code where we may look for a better place to migrate the
> task to.
> 
> This just iterates over memory we have already touched before
> (likely to still be cached), and does some cheap comparisons.
> 

Fair enough. It'll show up in profiles if it's a problem anyway.

> >>  		/*
> >>  		 * If the preferred task and group nids are different,
> >>  		 * iterate over the nodes again to find the best place.
> >> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>  		/* Second half of the array tracks where faults come from */
> >>  		grp->faults_from = grp->faults + 2 * nr_node_ids;
> >>  
> >> +		node_set(task_node(current), grp->active_nodes);
> >> +
> >>  		for (i = 0; i < 4*nr_node_ids; i++)
> >>  			grp->faults[i] = p->numa_faults[i];
> >>  
> >> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>  	my_grp->nr_tasks--;
> >>  	grp->nr_tasks++;
> >>  
> >> +	update_numa_active_node_mask(p);
> >> +
> > 
> > This may be subtle enough to deserve a comment
> > 
> > /* Tasks have joined/left groups and the active_mask is no longer valid */
> 
> I have added a comment.
> 
> > If we left a group, we update our new group. Is the old group now out of
> > date and in need of updating too?
> 
> The entire old group will join the new group, and the old group
> is freed.
> 

We reference count the old group so that it only gets freed when the
last task leaves it. If the old group was guaranteed to be destroyed
there would be no need to do stuff like

list_move(&p->numa_entry, &grp->task_list);
my_grp->total_faults -= p->total_numa_faults;
my_grp->nr_tasks--;

All that reads as "a single task is moving group" and not the entire
old group joins the new group. I expected that the old group was only
guaranteed to be destroyed in the case where we had just allocated it
because p->numa_group was NULL when task_numa_group was called.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
@ 2014-01-21 15:41         ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-01-21 15:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod

On Tue, Jan 21, 2014 at 10:09:14AM -0500, Rik van Riel wrote:
> On 01/21/2014 09:19 AM, Mel Gorman wrote:
> > On Mon, Jan 20, 2014 at 02:21:04PM -0500, riel@redhat.com wrote:
> 
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 1945ddc..ea8b2ae 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -885,6 +885,7 @@ struct numa_group {
> >>  	struct list_head task_list;
> >>  
> >>  	struct rcu_head rcu;
> >> +	nodemask_t active_nodes;
> >>  	unsigned long total_faults;
> >>  	unsigned long *faults_from;
> >>  	unsigned long faults[0];
> > 
> > It's not a concern for now but in the land of unicorns and ponies we'll
> > relook at the size of some of these structures and see what can be
> > optimised.
> 
> Unsigned int should be enough for systems with less than 8TB
> of memory per node :)
> 

Is it not bigger than that?

typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;

so it depends on the value of NODES_SHIFT? Anyway, not worth getting
into a twist over.

> > Similar to my comment on faults_from I think we could potentially evaluate
> > the fitness of the automatic NUMA balancing feature by looking at the
> > weight of the active_nodes for a numa_group. If
> > bitmask_weight(active_nodes) == nr_online_nodes
> > for all numa_groups in the system then I think it would be an indication
> > that the algorithm has collapsed.
> 
> If the system runs one very large workload, I would expect the
> scheduler to spread that workload across all nodes.
> 
> In that situation, it is perfectly legitimate for all nodes
> to end up being marked as active nodes, and for the system
> to try distribute the workload's memory somewhat evenly
> between them.
> 

In the specific case where the workload is not partitioned and really
accessing all of memory then sure, it'll be spread throughout the
system. However, if we are looking at a case like multiple JVMs sized to
fit within nodes then the metric would hold.

> > It's not a comment on the patch itself. We could could just do with more
> > metrics that help analyse this thing when debugging problems.
> > 
> >> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
> >>  }
> >>  
> >>  /*
> >> + * Find the nodes on which the workload is actively running. We do this by
> > 
> > hmm, it's not the workload though, it's a single NUMA group and a workload
> > may consist of multiple NUMA groups. For example, in an ideal world and
> > a JVM-based workload the application threads and the GC threads would be
> > in different NUMA groups.
> 
> Why should they be in a different numa group?
> 

It would be ideal that they are in different groups so the hinting faults
incurred by the garbage collector (linear scan of the address space)
does not affect scheduling and placement decisions based on the numa
groups fault statistics.

> The rest of the series contains patches to make sure they
> should be just fine together in the same group...
> 
> > The signature is even more misleading because the signature implies that
> > the function is concerned with tasks. Pass in p->numa_group
> 
> Will do.
> 
> >> + * tracking the nodes from which NUMA hinting faults are triggered. This can
> >> + * be different from the set of nodes where the workload's memory is currently
> >> + * located.
> >> + *
> >> + * The bitmask is used to make smarter decisions on when to do NUMA page
> >> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
> >> + * are added when they cause over 6/16 of the maximum number of faults, but
> >> + * only removed when they drop below 3/16.
> >> + */
> > 
> > Looking at the values, I'm guessing you did it this way to use shifts
> > instead of divides. That's fine, but how did you arrive at those values?
> > Experimentally or just felt reasonable?
> 
> Experimentally I got to 20% and 40%.  Peter suggested I change it
> to 3/16 and 6/16, which appear to give identical performance.
> 

Cool

> >> +static void update_numa_active_node_mask(struct task_struct *p)
> >> +{
> >> +	unsigned long faults, max_faults = 0;
> >> +	struct numa_group *numa_group = p->numa_group;
> >> +	int nid;
> >> +
> >> +	for_each_online_node(nid) {
> >> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> >> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> > 
> > task_faults() implements a helper for p->numa_faults equivalent of this.
> > Just as with the other renaming, it would not hurt to rename task_faults()
> > to something like task_faults_memory() and add a task_faults_cpu() for
> > this. The objective again is to be clear about whether we care about CPU
> > or memory locality information.
> 
> Will do.
> 
> >> +		if (faults > max_faults)
> >> +			max_faults = faults;
> >> +	}
> >> +
> >> +	for_each_online_node(nid) {
> >> +		faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> >> +			 numa_group->faults_from[task_faults_idx(nid, 1)];
> > 
> > group_faults would need similar adjustment.
> > 
> >> +		if (!node_isset(nid, numa_group->active_nodes)) {
> >> +			if (faults > max_faults * 6 / 16)
> >> +				node_set(nid, numa_group->active_nodes);
> >> +		} else if (faults < max_faults * 3 / 16)
> >> +			node_clear(nid, numa_group->active_nodes);
> >> +	}
> >> +}
> >> +
> > 
> > I think there is a subtle problem here
> 
> Can you be more specific about what problem you think the hysteresis
> could be causing?
> 

Lets say

Thread A: Most important thread for performance, accesses small amounts
	of memory during each scan window. Lets say it's doing calculations
	over a large cache-aware structure of some description

Thread B: Big stupid linear scanner accessing all of memory for whatever
	reason.

Thread B will incur more NUMA hinting faults because it is accessing
idle memory that is unused by Thread A. The fault stats and placement
decisions are then skewed in favour of Thread B because Thread A did not
trap enough hinting faults.

It's a theoretical problem.

> > /*
> >  * Be mindful that this is subject to sampling error. As we only have
> >  * data on hinting faults active_nodes may miss a heavily referenced
> >  * node due to the references being to a small number of pages. If
> >  * there is a large linear scanner in the same numa group as a
> >  * task operating on a small amount of memory then the latter task
> >  * may be ignored.
> >  */
> > 
> > I have no suggestion on how to handle this
> 
> Since the numa_faults_cpu statistics are all about driving
> memory-follows-cpu, there actually is a decent way to handle
> it.  See patch 5 :)
> 
> >> +/*
> >>   * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
> >>   * increments. The more local the fault statistics are, the higher the scan
> >>   * period will be for the next scan window. If local/remote ratio is below
> >> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
> >>  	update_task_scan_period(p, fault_types[0], fault_types[1]);
> >>  
> >>  	if (p->numa_group) {
> >> +		update_numa_active_node_mask(p);
> > 
> > We are updating that thing once per scan window, that's fine. There is
> > potentially a wee issue though. If all the tasks in the group are threads
> > then they share p->mm->numa_scan_seq and only one task does the update
> > per scan window. If they are different processes then we could be updating
> > more frequently than necessary.
> > 
> > Functionally it'll be fine but higher cost than necessary. I do not have a
> > better suggestion right now as superficially a numa_scan_seq per numa_group
> > would not be a good fit.
> 
> I suspect this cost will be small anyway, compared to the costs
> incurred in both the earlier part of task_numa_placement, and
> in the code where we may look for a better place to migrate the
> task to.
> 
> This just iterates over memory we have already touched before
> (likely to still be cached), and does some cheap comparisons.
> 

Fair enough. It'll show up in profiles if it's a problem anyway.

> >>  		/*
> >>  		 * If the preferred task and group nids are different,
> >>  		 * iterate over the nodes again to find the best place.
> >> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>  		/* Second half of the array tracks where faults come from */
> >>  		grp->faults_from = grp->faults + 2 * nr_node_ids;
> >>  
> >> +		node_set(task_node(current), grp->active_nodes);
> >> +
> >>  		for (i = 0; i < 4*nr_node_ids; i++)
> >>  			grp->faults[i] = p->numa_faults[i];
> >>  
> >> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>  	my_grp->nr_tasks--;
> >>  	grp->nr_tasks++;
> >>  
> >> +	update_numa_active_node_mask(p);
> >> +
> > 
> > This may be subtle enough to deserve a comment
> > 
> > /* Tasks have joined/left groups and the active_mask is no longer valid */
> 
> I have added a comment.
> 
> > If we left a group, we update our new group. Is the old group now out of
> > date and in need of updating too?
> 
> The entire old group will join the new group, and the old group
> is freed.
> 

We reference count the old group so that it only gets freed when the
last task leaves it. If the old group was guaranteed to be destroyed
there would be no need to do stuff like

list_move(&p->numa_entry, &grp->task_list);
my_grp->total_faults -= p->total_numa_faults;
my_grp->nr_tasks--;

All that reads as "a single task is moving group" and not the entire
old group joins the new group. I expected that the old group was only
guaranteed to be destroyed in the case where we had just allocated it
because p->numa_group was NULL when task_numa_group was called.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-01-21 15:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-17  6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
2014-01-17  6:17 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
2014-01-17  6:17 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
2014-01-17  6:17 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
2014-01-17  6:17 ` [PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes riel
2014-01-17  6:17 ` [PATCH 5/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
2014-01-17  6:17 ` [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
2014-01-17  9:54   ` Peter Zijlstra
2014-01-17 14:51     ` Rik van Riel
2014-01-17 15:04       ` Peter Zijlstra
2014-01-17 18:47         ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
2014-01-20 19:21 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
2014-01-20 19:21   ` riel
2014-01-21 14:19   ` Mel Gorman
2014-01-21 14:19     ` Mel Gorman
2014-01-21 15:09     ` Rik van Riel
2014-01-21 15:09       ` Rik van Riel
2014-01-21 15:41       ` Mel Gorman
2014-01-21 15:41         ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.