* [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred
2014-01-17 6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
@ 2014-01-17 6:17 ` riel
2014-01-17 6:17 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
` (4 subsequent siblings)
5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17 6:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
Rik van Riel
From: Rik van Riel <riel@surriel.com>
Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.
Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
include/linux/sched.h | 1 -
kernel/sched/fair.c | 8 --------
kernel/sysctl.c | 7 -------
mm/mempolicy.c | 45 ---------------------------------------------
4 files changed, 61 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68a0e84..97efba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1469,7 +1469,6 @@ struct task_struct {
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
int numa_preferred_nid;
- int numa_migrate_deferred;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867b0a4..41e2176 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -819,14 +819,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
-/*
- * After skipping a page migration on a shared page, skip N more numa page
- * migrations unconditionally. This reduces the number of NUMA migrations
- * in shared memory workloads, and has the effect of pulling tasks towards
- * where their memory lives, over pulling the memory towards the task.
- */
-unsigned int sysctl_numa_balancing_migrate_deferred = 16;
-
static unsigned int task_nr_scan_windows(struct task_struct *p)
{
unsigned long rss = 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 096db74..4d19492 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -384,13 +384,6 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
- .procname = "numa_balancing_migrate_deferred",
- .data = &sysctl_numa_balancing_migrate_deferred,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- },
- {
.procname = "numa_balancing",
.data = NULL, /* filled in by handler */
.maxlen = sizeof(unsigned int),
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cb46c..052abac 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2301,35 +2301,6 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}
-#ifdef CONFIG_NUMA_BALANCING
-static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
- /* Never defer a private fault */
- if (cpupid_match_pid(p, last_cpupid))
- return false;
-
- if (p->numa_migrate_deferred) {
- p->numa_migrate_deferred--;
- return true;
- }
- return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
- p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred;
-}
-#else
-static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
- return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
/**
* mpol_misplaced - check whether current page node is valid in policy
*
@@ -2432,24 +2403,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
*/
last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
-
- /* See sysctl_numa_balancing_migrate_deferred comment */
- if (!cpupid_match_pid(current, last_cpupid))
- defer_numa_migrate(current);
-
goto out;
}
-
- /*
- * The quadratic filter above reduces extraneous migration
- * of shared pages somewhat. This code reduces it even more,
- * reducing the overhead of page migrations of shared pages.
- * This makes workloads with shared pages rely more on
- * "move task near its memory", and less on "move memory
- * towards its task", which is exactly what we want.
- */
- if (numa_migrate_deferred(current, last_cpupid))
- goto out;
}
if (curnid != polnid)
--
1.8.4.2
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
2014-01-17 6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
2014-01-17 6:17 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
@ 2014-01-17 6:17 ` riel
2014-01-17 6:17 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
` (3 subsequent siblings)
5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17 6:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
Rik van Riel
From: Rik van Riel <riel@surriel.com>
Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.
The next patches use this to build up a bitmap of which nodes a
workload is actively running on.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
include/linux/sched.h | 10 ++++++++--
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
unsigned long *numa_faults_buffer;
/*
+ * Track the nodes where faults are incurred. This is not very
+ * interesting on a per-task basis, but it help with smarter
+ * numa memory placement for groups of processes.
+ */
+ unsigned long *numa_faults_from;
+ unsigned long *numa_faults_from_buffer;
+
+ /*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local. The task scan period is adapted
* based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {
struct rcu_head rcu;
unsigned long total_faults;
+ unsigned long *faults_from;
unsigned long faults[0];
};
@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;
for (priv = 0; priv < 2; priv++) {
- long diff;
+ long diff, f_diff;
i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
+ f_diff = -p->numa_faults_from[i];
/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
@@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
+ p->numa_faults_from[i] >>= 1;
+ p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from_buffer[i] = 0;
+
faults += p->numa_faults[i];
diff += p->numa_faults[i];
+ f_diff += p->numa_faults_from[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
p->numa_group->faults[i] += diff;
+ p->numa_group->faults_from[i] += f_diff;
p->numa_group->total_faults += diff;
group_faults += p->numa_group->faults[i];
}
@@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
if (unlikely(!p->numa_group)) {
unsigned int size = sizeof(struct numa_group) +
- 2*nr_node_ids*sizeof(unsigned long);
+ 4*nr_node_ids*sizeof(unsigned long);
grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
if (!grp)
@@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
grp->gid = p->pid;
+ /* Second half of the array tracks where faults come from */
+ grp->faults_from = grp->faults + 2 * nr_node_ids;
- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];
grp->total_faults = p->total_numa_faults;
@@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
double_lock(&my_grp->lock, &grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++) {
+ for (i = 0; i < 4*nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults[i];
grp->faults[i] += p->numa_faults[i];
}
@@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)
if (grp) {
spin_lock(&grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] -= p->numa_faults[i];
grp->total_faults -= p->total_numa_faults;
@@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+ p->numa_faults_from = NULL;
+ p->numa_faults_from_buffer = NULL;
kfree(numa_faults);
}
@@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
{
struct task_struct *p = current;
bool migrated = flags & TNF_MIGRATED;
+ int this_node = task_node(current);
int priv;
if (!numabalancing_enabled)
@@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
return;
BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
+ p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
p->total_numa_faults = 0;
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}
@@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
p->numa_pages_migrated += pages;
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+ p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
}
--
1.8.4.2
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics
2014-01-17 6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
2014-01-17 6:17 ` [PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred riel
2014-01-17 6:17 ` [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered riel
@ 2014-01-17 6:17 ` riel
2014-01-17 6:17 ` [PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes riel
` (2 subsequent siblings)
5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17 6:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
Rik van Riel
From: Rik van Riel <riel@surriel.com>
The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..aa680e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
struct list_head task_list;
struct rcu_head rcu;
+ nodemask_t active_nodes;
unsigned long total_faults;
unsigned long *faults_from;
unsigned long faults[0];
@@ -1275,6 +1276,38 @@ static void numa_migrate_preferred(struct task_struct *p)
}
/*
+ * Iterate over the nodes from which NUMA hinting faults were triggered, in
+ * other words where the CPUs that incurred NUMA hinting faults are. The
+ * bitmask is used to limit NUMA page migrations, and spread out memory
+ * between the actively used nodes. To prevent flip-flopping, and excessive
+ * page migrations, nodes are added when they cause over 40% of the maximum
+ * number of faults, but only removed when they drop below 20%.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+ unsigned long faults, max_faults = 0;
+ struct numa_group *numa_group = p->numa_group;
+ int nid;
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+ numa_group->faults_from[task_faults_idx(nid, 1)];
+ if (faults > max_faults)
+ max_faults = faults;
+ }
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+ numa_group->faults_from[task_faults_idx(nid, 1)];
+ if (!node_isset(nid, numa_group->active_nodes)) {
+ if (faults > max_faults * 4 / 10)
+ node_set(nid, numa_group->active_nodes);
+ } else if (faults < max_faults * 2 / 10)
+ node_clear(nid, numa_group->active_nodes);
+ }
+}
+
+/*
* When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
* increments. The more local the fault statistics are, the higher the scan
* period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1449,7 @@ static void task_numa_placement(struct task_struct *p)
update_task_scan_period(p, fault_types[0], fault_types[1]);
if (p->numa_group) {
+ update_numa_active_node_mask(p);
/*
* If the preferred task and group nids are different,
* iterate over the nodes again to find the best place.
@@ -1478,6 +1512,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
/* Second half of the array tracks where faults come from */
grp->faults_from = grp->faults + 2 * nr_node_ids;
+ node_set(task_node(current), grp->active_nodes);
+
for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];
@@ -1547,6 +1583,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
my_grp->nr_tasks--;
grp->nr_tasks++;
+ update_numa_active_node_mask(p);
+
spin_unlock(&my_grp->lock);
spin_unlock(&grp->lock);
--
1.8.4.2
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes
2014-01-17 6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
` (2 preceding siblings ...)
2014-01-17 6:17 ` [PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics riel
@ 2014-01-17 6:17 ` riel
2014-01-17 6:17 ` [PATCH 5/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
2014-01-17 6:17 ` [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17 6:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
Rik van Riel
From: Rik van Riel <riel@surriel.com>
Being able to see how the active nodemask changes over time, and why,
can be quite useful.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 8 ++++++--
2 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..91726b6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa,
__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
__entry->dst_cpu, __entry->dst_nid)
);
+
+TRACE_EVENT(update_numa_active_nodes_mask,
+
+ TP_PROTO(int pid, int gid, int nid, int set, long faults, long max_faults),
+
+ TP_ARGS(pid, gid, nid, set, faults, max_faults),
+
+ TP_STRUCT__entry(
+ __field( pid_t, pid)
+ __field( pid_t, gid)
+ __field( int, nid)
+ __field( int, set)
+ __field( long, faults)
+ __field( long, max_faults);
+ ),
+
+ TP_fast_assign(
+ __entry->pid = pid;
+ __entry->gid = gid;
+ __entry->nid = nid;
+ __entry->set = set;
+ __entry->faults = faults;
+ __entry->max_faults = max_faults;
+ ),
+
+ TP_printk("pid=%d gid=%d nid=%d set=%d faults=%ld max_faults=%ld",
+ __entry->pid,
+ __entry->gid,
+ __entry->nid,
+ __entry->set,
+ __entry->faults,
+ __entry->max_faults)
+
+);
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa680e2..3551009 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
numa_group->faults_from[task_faults_idx(nid, 1)];
if (!node_isset(nid, numa_group->active_nodes)) {
- if (faults > max_faults * 4 / 10)
+ if (faults > max_faults * 4 / 10) {
+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
node_set(nid, numa_group->active_nodes);
- } else if (faults < max_faults * 2 / 10)
+ }
+ } else if (faults < max_faults * 2 / 10) {
+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, false, faults, max_faults);
node_clear(nid, numa_group->active_nodes);
+ }
}
}
--
1.8.4.2
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH 5/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations
2014-01-17 6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
` (3 preceding siblings ...)
2014-01-17 6:17 ` [PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes riel
@ 2014-01-17 6:17 ` riel
2014-01-17 6:17 ` [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
5 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-17 6:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
Rik van Riel
From: Rik van Riel <riel@surriel.com>
Use the active_nodes nodemask to make smarter decisions on NUMA migrations.
In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload
This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
for the numa_group
3) always migrate from a node outside of the set of active nodes,
to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
from a node with more NUMA page faults, to a node with fewer
NUMA page faults, with a 25% margin to avoid ping-ponging
This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
include/linux/sched.h | 7 +++++++
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
mm/mempolicy.c | 3 +++
3 files changed, 47 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9f7f05..0af6c1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
+extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
static inline void task_numa_free(struct task_struct *p)
{
}
+static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ return true;
+}
#endif
static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3551009..8e0a53a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -948,6 +948,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
}
+bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ struct numa_group *ng = p->numa_group;
+
+ /* Always allow migrate on private faults */
+ if (cpupid_match_pid(p, last_cpupid))
+ return true;
+
+ /* A shared fault, but p->numa_group has not been set up yet. */
+ if (!ng)
+ return true;
+
+ /*
+ * Do not migrate if the destination is not a node that
+ * is actively used by this numa group.
+ */
+ if (!node_isset(dst_nid, ng->active_nodes))
+ return false;
+
+ /*
+ * Source is a node that is not actively used by this
+ * numa group, while the destination is. Migrate.
+ */
+ if (!node_isset(src_nid, ng->active_nodes))
+ return true;
+
+ /*
+ * Both source and destination are nodes in active
+ * use by this numa group. Maximize memory bandwidth
+ * by migrating from more heavily used groups, to less
+ * heavily used ones, spreading the load around.
+ * Use a 1/4 hysteresis to avoid spurious page movement.
+ */
+ return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
+}
+
static unsigned long weighted_cpuload(const int cpu);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 052abac..050962b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
goto out;
}
+
+ if (!should_numa_migrate(current, last_cpupid, curnid, polnid))
+ goto out;
}
if (curnid != polnid)
--
1.8.4.2
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
2014-01-17 6:17 [PATCH 0/6] pseudo-interleaving for automatic NUMA balancing riel
` (4 preceding siblings ...)
2014-01-17 6:17 ` [PATCH 5/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations riel
@ 2014-01-17 6:17 ` riel
2014-01-17 9:54 ` Peter Zijlstra
5 siblings, 1 reply; 19+ messages in thread
From: riel @ 2014-01-17 6:17 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, mingo, peterz, chegu_vinod, mgorman, Rik van Riel,
Rik van Riel
From: Rik van Riel <riel@surriel.com>
The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.
Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the node with
the garbage collector being marked the only active node in
the group.
This issue is avoided if we weigh the statistics by CPU use
of each task in the numa group, instead of by how many faults
each thread has occurred.
To achieve this, we normalize the number of faults to the
fraction of faults that occurred on each node, and then
multiply that fraction by the fraction of CPU time the
task has used since the last time task_numa_placement was
invoked.
This way the nodes in the active node mask will be the ones
where the tasks from the numa group are most actively running,
and the influence of eg. the garbage collector and other
do-little threads is properly minimized.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0af6c1a..52de567 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1471,6 +1471,8 @@ struct task_struct {
int numa_preferred_nid;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
+ u64 last_task_numa_placement;
+ u64 last_sum_exec_runtime;
struct callback_head numa_work;
struct list_head numa_entry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f45fd5..9a0908a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1758,6 +1758,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+ p->last_task_numa_placement = 0;
+ p->last_sum_exec_runtime = 0;
INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e0a53a..1601c68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p,
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}
+/*
+ * Get the fraction of time the task has been running since the last
+ * NUMA placement cycle. The scheduler keeps similar statistics, but
+ * decays those on a 32ms period, which is orders of magnitude off
+ * from the dozens-of-seconds NUMA balancing period. Use the scheduler
+ * stats only if the task is so new there are no NUMA statistics yet.
+ */
+static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
+{
+ u64 runtime, delta, now;
+ /* Use the start of this time slice to avoid calculations. */
+ now = p->se.exec_start;
+ runtime = p->se.sum_exec_runtime;
+
+ if (p->last_task_numa_placement) {
+ delta = runtime - p->last_sum_exec_runtime;
+ *period = now - p->last_task_numa_placement;
+ } else {
+ delta = p->se.avg.runnable_avg_sum;
+ *period = p->se.avg.runnable_avg_period;
+ }
+
+ p->last_sum_exec_runtime = runtime;
+ p->last_task_numa_placement = now;
+
+ return delta;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1, max_group_nid = -1;
unsigned long max_faults = 0, max_group_faults = 0;
unsigned long fault_types[2] = { 0, 0 };
+ unsigned long total_faults;
+ u64 runtime, period;
spinlock_t *group_lock = NULL;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1435,6 +1465,10 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);
+ total_faults = p->numa_faults_locality[0] +
+ p->numa_faults_locality[1] + 1;
+ runtime = numa_get_avg_runtime(p, &period);
+
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;
@@ -1447,7 +1481,7 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;
for (priv = 0; priv < 2; priv++) {
- long diff, f_diff;
+ long diff, f_diff, f_weight;
i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
@@ -1459,8 +1493,19 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
+ /*
+ * Normalize the faults_from, so all tasks in a group
+ * count according to CPU use, instead of by the raw
+ * number of faults. This prevents the situation where
+ * the garbage collector totally dominates the stats,
+ * and the access patterns of the worker threads are
+ * ignored.
+ */
+ f_weight = (1024 * runtime *
+ p->numa_faults_from_buffer[i]) /
+ (total_faults * period);
p->numa_faults_from[i] >>= 1;
- p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from[i] += f_weight;
p->numa_faults_from_buffer[i] = 0;
faults += p->numa_faults[i];
--
1.8.4.2
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
2014-01-17 6:17 ` [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use riel
@ 2014-01-17 9:54 ` Peter Zijlstra
2014-01-17 14:51 ` Rik van Riel
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-17 9:54 UTC (permalink / raw)
To: riel; +Cc: linux-kernel, linux-mm, mingo, chegu_vinod, mgorman, Rik van Riel
On Fri, Jan 17, 2014 at 01:17:36AM -0500, riel@redhat.com wrote:
> + /*
> + * Normalize the faults_from, so all tasks in a group
> + * count according to CPU use, instead of by the raw
> + * number of faults. This prevents the situation where
> + * the garbage collector totally dominates the stats,
> + * and the access patterns of the worker threads are
> + * ignored.
> + */
Instead of focusing on the one example (GC) here, I would suggest
saying something along the lines of: Tasks with little runtime have
little over-all impact on throughput and thus their faults are less
important.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
2014-01-17 9:54 ` Peter Zijlstra
@ 2014-01-17 14:51 ` Rik van Riel
2014-01-17 15:04 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-01-17 14:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, mingo, chegu_vinod, mgorman, Rik van Riel
On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 01:17:36AM -0500, riel@redhat.com wrote:
>> + /*
>> + * Normalize the faults_from, so all tasks in a group
>> + * count according to CPU use, instead of by the raw
>> + * number of faults. This prevents the situation where
>> + * the garbage collector totally dominates the stats,
>> + * and the access patterns of the worker threads are
>> + * ignored.
>> + */
>
> Instead of focusing on the one example (GC) here, I would suggest
> saying something along the lines of: Tasks with little runtime have
> little over-all impact on throughput and thus their faults are less
> important.
Thanks for the review, I have updated the comment
accordingly.
Does anybody else have comments on this series, or should
I start begging for Acked-by: and Reviewed-by: lines? :)
--
All rights reversed
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
2014-01-17 14:51 ` Rik van Riel
@ 2014-01-17 15:04 ` Peter Zijlstra
2014-01-17 18:47 ` Rik van Riel
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-17 15:04 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, linux-mm, mingo, chegu_vinod, mgorman, Rik van Riel
On Fri, Jan 17, 2014 at 09:51:44AM -0500, Rik van Riel wrote:
> On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
> > On Fri, Jan 17, 2014 at 01:17:36AM -0500, riel@redhat.com wrote:
> >> + /*
> >> + * Normalize the faults_from, so all tasks in a group
> >> + * count according to CPU use, instead of by the raw
> >> + * number of faults. This prevents the situation where
> >> + * the garbage collector totally dominates the stats,
> >> + * and the access patterns of the worker threads are
> >> + * ignored.
> >> + */
> >
> > Instead of focusing on the one example (GC) here, I would suggest
> > saying something along the lines of: Tasks with little runtime have
> > little over-all impact on throughput and thus their faults are less
> > important.
>
> Thanks for the review, I have updated the comment
> accordingly.
>
> Does anybody else have comments on this series, or should
> I start begging for Acked-by: and Reviewed-by: lines? :)
I still need to get me head around part of the previous patches, and I
suppose Mel might have a look, give it a few days ;-)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use
2014-01-17 15:04 ` Peter Zijlstra
@ 2014-01-17 18:47 ` Rik van Riel
0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-01-17 18:47 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-mm, mingo, chegu_vinod, mgorman, Rik van Riel
On 01/17/2014 10:04 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 09:51:44AM -0500, Rik van Riel wrote:
>> On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
>>> On Fri, Jan 17, 2014 at 01:17:36AM -0500, riel@redhat.com wrote:
>>>> + /*
>>>> + * Normalize the faults_from, so all tasks in a group
>>>> + * count according to CPU use, instead of by the raw
>>>> + * number of faults. This prevents the situation where
>>>> + * the garbage collector totally dominates the stats,
>>>> + * and the access patterns of the worker threads are
>>>> + * ignored.
>>>> + */
>>>
>>> Instead of focusing on the one example (GC) here, I would suggest
>>> saying something along the lines of: Tasks with little runtime have
>>> little over-all impact on throughput and thus their faults are less
>>> important.
>>
>> Thanks for the review, I have updated the comment
>> accordingly.
>>
>> Does anybody else have comments on this series, or should
>> I start begging for Acked-by: and Reviewed-by: lines? :)
>
> I still need to get me head around part of the previous patches, and I
> suppose Mel might have a look, give it a few days ;-)
Well, you didn't spot the divide by zero that Vinod somehow
managed to trigger :)
I am also going to add another patch, that prevents the temporary
values from "p->numa_faults[i] >>= 1;" becoming visible, by using
local variables for those calculations.
Expect a v2 soonish.
--
All rights reversed
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
2014-01-20 19:21 [PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing riel
@ 2014-01-20 19:21 ` riel
0 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod
From: Rik van Riel <riel@redhat.com>
Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.
The next patches use this to build up a bitmap of which nodes a
workload is actively running on.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
include/linux/sched.h | 10 ++++++++--
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
unsigned long *numa_faults_buffer;
/*
+ * Track the nodes where faults are incurred. This is not very
+ * interesting on a per-task basis, but it help with smarter
+ * numa memory placement for groups of processes.
+ */
+ unsigned long *numa_faults_from;
+ unsigned long *numa_faults_from_buffer;
+
+ /*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local. The task scan period is adapted
* based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {
struct rcu_head rcu;
unsigned long total_faults;
+ unsigned long *faults_from;
unsigned long faults[0];
};
@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;
for (priv = 0; priv < 2; priv++) {
- long diff;
+ long diff, f_diff;
i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
+ f_diff = -p->numa_faults_from[i];
/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
@@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
+ p->numa_faults_from[i] >>= 1;
+ p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from_buffer[i] = 0;
+
faults += p->numa_faults[i];
diff += p->numa_faults[i];
+ f_diff += p->numa_faults_from[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
p->numa_group->faults[i] += diff;
+ p->numa_group->faults_from[i] += f_diff;
p->numa_group->total_faults += diff;
group_faults += p->numa_group->faults[i];
}
@@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
if (unlikely(!p->numa_group)) {
unsigned int size = sizeof(struct numa_group) +
- 2*nr_node_ids*sizeof(unsigned long);
+ 4*nr_node_ids*sizeof(unsigned long);
grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
if (!grp)
@@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
grp->gid = p->pid;
+ /* Second half of the array tracks where faults come from */
+ grp->faults_from = grp->faults + 2 * nr_node_ids;
- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];
grp->total_faults = p->total_numa_faults;
@@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
double_lock(&my_grp->lock, &grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++) {
+ for (i = 0; i < 4*nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults[i];
grp->faults[i] += p->numa_faults[i];
}
@@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)
if (grp) {
spin_lock(&grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] -= p->numa_faults[i];
grp->total_faults -= p->total_numa_faults;
@@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+ p->numa_faults_from = NULL;
+ p->numa_faults_from_buffer = NULL;
kfree(numa_faults);
}
@@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
{
struct task_struct *p = current;
bool migrated = flags & TNF_MIGRATED;
+ int this_node = task_node(current);
int priv;
if (!numabalancing_enabled)
@@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
return;
BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
+ p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
p->total_numa_faults = 0;
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}
@@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
p->numa_pages_migrated += pages;
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+ p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
}
--
1.8.4.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
@ 2014-01-20 19:21 ` riel
0 siblings, 0 replies; 19+ messages in thread
From: riel @ 2014-01-20 19:21 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-mm, peterz, mgorman, mingo, chegu_vinod
From: Rik van Riel <riel@redhat.com>
Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.
The next patches use this to build up a bitmap of which nodes a
workload is actively running on.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
include/linux/sched.h | 10 ++++++++--
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
unsigned long *numa_faults_buffer;
/*
+ * Track the nodes where faults are incurred. This is not very
+ * interesting on a per-task basis, but it help with smarter
+ * numa memory placement for groups of processes.
+ */
+ unsigned long *numa_faults_from;
+ unsigned long *numa_faults_from_buffer;
+
+ /*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local. The task scan period is adapted
* based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {
struct rcu_head rcu;
unsigned long total_faults;
+ unsigned long *faults_from;
unsigned long faults[0];
};
@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;
for (priv = 0; priv < 2; priv++) {
- long diff;
+ long diff, f_diff;
i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
+ f_diff = -p->numa_faults_from[i];
/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
@@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
+ p->numa_faults_from[i] >>= 1;
+ p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from_buffer[i] = 0;
+
faults += p->numa_faults[i];
diff += p->numa_faults[i];
+ f_diff += p->numa_faults_from[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
p->numa_group->faults[i] += diff;
+ p->numa_group->faults_from[i] += f_diff;
p->numa_group->total_faults += diff;
group_faults += p->numa_group->faults[i];
}
@@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
if (unlikely(!p->numa_group)) {
unsigned int size = sizeof(struct numa_group) +
- 2*nr_node_ids*sizeof(unsigned long);
+ 4*nr_node_ids*sizeof(unsigned long);
grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
if (!grp)
@@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
grp->gid = p->pid;
+ /* Second half of the array tracks where faults come from */
+ grp->faults_from = grp->faults + 2 * nr_node_ids;
- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];
grp->total_faults = p->total_numa_faults;
@@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
double_lock(&my_grp->lock, &grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++) {
+ for (i = 0; i < 4*nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults[i];
grp->faults[i] += p->numa_faults[i];
}
@@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)
if (grp) {
spin_lock(&grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] -= p->numa_faults[i];
grp->total_faults -= p->total_numa_faults;
@@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+ p->numa_faults_from = NULL;
+ p->numa_faults_from_buffer = NULL;
kfree(numa_faults);
}
@@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
{
struct task_struct *p = current;
bool migrated = flags & TNF_MIGRATED;
+ int this_node = task_node(current);
int priv;
if (!numabalancing_enabled)
@@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
return;
BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
+ p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
p->total_numa_faults = 0;
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}
@@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
p->numa_pages_migrated += pages;
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+ p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
}
--
1.8.4.2
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
2014-01-20 19:21 ` riel
@ 2014-01-21 12:21 ` Mel Gorman
-1 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-01-21 12:21 UTC (permalink / raw)
To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod
On Mon, Jan 20, 2014 at 02:21:03PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
>
> Track which nodes NUMA faults are triggered from, in other words
> the CPUs on which the NUMA faults happened. This uses a similar
> mechanism to what is used to track the memory involved in numa faults.
>
> The next patches use this to build up a bitmap of which nodes a
> workload is actively running on.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> include/linux/sched.h | 10 ++++++++--
> kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
> 2 files changed, 31 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 97efba4..a9f7f05 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1492,6 +1492,14 @@ struct task_struct {
> unsigned long *numa_faults_buffer;
>
> /*
> + * Track the nodes where faults are incurred. This is not very
> + * interesting on a per-task basis, but it help with smarter
> + * numa memory placement for groups of processes.
> + */
> + unsigned long *numa_faults_from;
> + unsigned long *numa_faults_from_buffer;
> +
As an aside I wonder if we can derive any useful metric from this. One
potential santiy check would be the number of nodes that a task is incurring
faults on. It would be best if the highest number of faults were recorded
on the node the task is currently running on. After that we either want
to minimise the number of nodes trapping faults or interleave between
all available nodes to avoid applying too much memory pressure on any
one node. For interleaving to always be the best option we would have to
assume that all nodes are equal distance but that would be a reasonable
assumption to start with.
> + /*
> * numa_faults_locality tracks if faults recorded during the last
> * scan window were remote/local. The task scan period is adapted
> * based on the locality of the faults with different weights
> @@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
> extern pid_t task_numa_group_id(struct task_struct *p);
> extern void set_numabalancing_state(bool enabled);
> extern void task_numa_free(struct task_struct *p);
> -
> -extern unsigned int sysctl_numa_balancing_migrate_deferred;
> #else
> static inline void task_numa_fault(int last_node, int node, int pages,
> int flags)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 41e2176..1945ddc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -886,6 +886,7 @@ struct numa_group {
>
> struct rcu_head rcu;
> unsigned long total_faults;
> + unsigned long *faults_from;
> unsigned long faults[0];
> };
>
faults_from is not ambiguous but it does not tell us a lot of information
either. If I am reading this right then fundamentally this patch means we
are tracking two pieces of information
1. The node the data resided on at the time of the hinting fault (numa_faults)
2. The node the accessing task was residing on at the time of the fault (faults_from)
We should be able to have names that reflect that. How about
memory_faults_locality and cpu_faults_locality with a prepartion patch
doing a simple rename for numa_faults and this patch adding
cpu_faults_locality?
It will be tough to be consistent about this but the clearer we are about
making decisions based on task locationo vs data location the happier we
will be in the long run.
> @@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
> int priv, i;
>
> for (priv = 0; priv < 2; priv++) {
> - long diff;
> + long diff, f_diff;
>
> i = task_faults_idx(nid, priv);
> diff = -p->numa_faults[i];
> + f_diff = -p->numa_faults_from[i];
>
> /* Decay existing window, copy faults since last scan */
> p->numa_faults[i] >>= 1;
> @@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
> fault_types[priv] += p->numa_faults_buffer[i];
> p->numa_faults_buffer[i] = 0;
>
> + p->numa_faults_from[i] >>= 1;
> + p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
> + p->numa_faults_from_buffer[i] = 0;
> +
> faults += p->numa_faults[i];
> diff += p->numa_faults[i];
> + f_diff += p->numa_faults_from[i];
> p->total_numa_faults += diff;
> if (p->numa_group) {
> /* safe because we can only change our own group */
> p->numa_group->faults[i] += diff;
> + p->numa_group->faults_from[i] += f_diff;
> p->numa_group->total_faults += diff;
> group_faults += p->numa_group->faults[i];
> }
> @@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>
> if (unlikely(!p->numa_group)) {
> unsigned int size = sizeof(struct numa_group) +
> - 2*nr_node_ids*sizeof(unsigned long);
> + 4*nr_node_ids*sizeof(unsigned long);
>
Should we convert that magic number to a define? NR_NUMA_HINT_FAULT_STATS?
> grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> if (!grp)
> @@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> spin_lock_init(&grp->lock);
> INIT_LIST_HEAD(&grp->task_list);
> grp->gid = p->pid;
> + /* Second half of the array tracks where faults come from */
> + grp->faults_from = grp->faults + 2 * nr_node_ids;
>
We have accessors when we overload arrays like this such as task_faults_idx
for example. We should have similar accessors for this in case those
offsets very change.
> - for (i = 0; i < 2*nr_node_ids; i++)
> + for (i = 0; i < 4*nr_node_ids; i++)
> grp->faults[i] = p->numa_faults[i];
>
This is a little obscure now. Functionally it is copying both numa_faults and
numa_faults_from but a casual reading of that will get confused. Minimally
it needs a comment explaining what is being copied here. Also, why did we
not use memcpy?
> grp->total_faults = p->total_numa_faults;
> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>
> double_lock(&my_grp->lock, &grp->lock);
>
> - for (i = 0; i < 2*nr_node_ids; i++) {
> + for (i = 0; i < 4*nr_node_ids; i++) {
> my_grp->faults[i] -= p->numa_faults[i];
> grp->faults[i] += p->numa_faults[i];
> }
The same obscure trick is used throughout and I'm not sure how
maintainable that will be. Would it be better to be explicit about this?
/* NUMA hinting faults may be either shared or private faults */
#define NR_NUMA_HINT_FAULT_TYPES 2
/* Track shared and private faults
#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES*2)
> @@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)
>
> if (grp) {
> spin_lock(&grp->lock);
> - for (i = 0; i < 2*nr_node_ids; i++)
> + for (i = 0; i < 4*nr_node_ids; i++)
> grp->faults[i] -= p->numa_faults[i];
> grp->total_faults -= p->total_numa_faults;
>
> @@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)
>
> p->numa_faults = NULL;
> p->numa_faults_buffer = NULL;
> + p->numa_faults_from = NULL;
> + p->numa_faults_from_buffer = NULL;
> kfree(numa_faults);
> }
>
> @@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> {
> struct task_struct *p = current;
> bool migrated = flags & TNF_MIGRATED;
> + int this_node = task_node(current);
> int priv;
>
> if (!numabalancing_enabled)
> @@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>
> /* Allocate buffer to track faults on a per-node basis */
> if (unlikely(!p->numa_faults)) {
> - int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
> + int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;
>
> /* numa_faults and numa_faults_buffer share the allocation */
> p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
> @@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> return;
>
> BUG_ON(p->numa_faults_buffer);
> - p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
> + p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
> + p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
> + p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
> p->total_numa_faults = 0;
> memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
> }
> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> p->numa_pages_migrated += pages;
>
> p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
> + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
> p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
this_node and node is similarly ambiguous in terms of name. Rename of
data_node and cpu_node would have been clearer.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
@ 2014-01-21 12:21 ` Mel Gorman
0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-01-21 12:21 UTC (permalink / raw)
To: riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod
On Mon, Jan 20, 2014 at 02:21:03PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
>
> Track which nodes NUMA faults are triggered from, in other words
> the CPUs on which the NUMA faults happened. This uses a similar
> mechanism to what is used to track the memory involved in numa faults.
>
> The next patches use this to build up a bitmap of which nodes a
> workload is actively running on.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> include/linux/sched.h | 10 ++++++++--
> kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
> 2 files changed, 31 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 97efba4..a9f7f05 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1492,6 +1492,14 @@ struct task_struct {
> unsigned long *numa_faults_buffer;
>
> /*
> + * Track the nodes where faults are incurred. This is not very
> + * interesting on a per-task basis, but it help with smarter
> + * numa memory placement for groups of processes.
> + */
> + unsigned long *numa_faults_from;
> + unsigned long *numa_faults_from_buffer;
> +
As an aside I wonder if we can derive any useful metric from this. One
potential santiy check would be the number of nodes that a task is incurring
faults on. It would be best if the highest number of faults were recorded
on the node the task is currently running on. After that we either want
to minimise the number of nodes trapping faults or interleave between
all available nodes to avoid applying too much memory pressure on any
one node. For interleaving to always be the best option we would have to
assume that all nodes are equal distance but that would be a reasonable
assumption to start with.
> + /*
> * numa_faults_locality tracks if faults recorded during the last
> * scan window were remote/local. The task scan period is adapted
> * based on the locality of the faults with different weights
> @@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
> extern pid_t task_numa_group_id(struct task_struct *p);
> extern void set_numabalancing_state(bool enabled);
> extern void task_numa_free(struct task_struct *p);
> -
> -extern unsigned int sysctl_numa_balancing_migrate_deferred;
> #else
> static inline void task_numa_fault(int last_node, int node, int pages,
> int flags)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 41e2176..1945ddc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -886,6 +886,7 @@ struct numa_group {
>
> struct rcu_head rcu;
> unsigned long total_faults;
> + unsigned long *faults_from;
> unsigned long faults[0];
> };
>
faults_from is not ambiguous but it does not tell us a lot of information
either. If I am reading this right then fundamentally this patch means we
are tracking two pieces of information
1. The node the data resided on at the time of the hinting fault (numa_faults)
2. The node the accessing task was residing on at the time of the fault (faults_from)
We should be able to have names that reflect that. How about
memory_faults_locality and cpu_faults_locality with a prepartion patch
doing a simple rename for numa_faults and this patch adding
cpu_faults_locality?
It will be tough to be consistent about this but the clearer we are about
making decisions based on task locationo vs data location the happier we
will be in the long run.
> @@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
> int priv, i;
>
> for (priv = 0; priv < 2; priv++) {
> - long diff;
> + long diff, f_diff;
>
> i = task_faults_idx(nid, priv);
> diff = -p->numa_faults[i];
> + f_diff = -p->numa_faults_from[i];
>
> /* Decay existing window, copy faults since last scan */
> p->numa_faults[i] >>= 1;
> @@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
> fault_types[priv] += p->numa_faults_buffer[i];
> p->numa_faults_buffer[i] = 0;
>
> + p->numa_faults_from[i] >>= 1;
> + p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
> + p->numa_faults_from_buffer[i] = 0;
> +
> faults += p->numa_faults[i];
> diff += p->numa_faults[i];
> + f_diff += p->numa_faults_from[i];
> p->total_numa_faults += diff;
> if (p->numa_group) {
> /* safe because we can only change our own group */
> p->numa_group->faults[i] += diff;
> + p->numa_group->faults_from[i] += f_diff;
> p->numa_group->total_faults += diff;
> group_faults += p->numa_group->faults[i];
> }
> @@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>
> if (unlikely(!p->numa_group)) {
> unsigned int size = sizeof(struct numa_group) +
> - 2*nr_node_ids*sizeof(unsigned long);
> + 4*nr_node_ids*sizeof(unsigned long);
>
Should we convert that magic number to a define? NR_NUMA_HINT_FAULT_STATS?
> grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> if (!grp)
> @@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> spin_lock_init(&grp->lock);
> INIT_LIST_HEAD(&grp->task_list);
> grp->gid = p->pid;
> + /* Second half of the array tracks where faults come from */
> + grp->faults_from = grp->faults + 2 * nr_node_ids;
>
We have accessors when we overload arrays like this such as task_faults_idx
for example. We should have similar accessors for this in case those
offsets very change.
> - for (i = 0; i < 2*nr_node_ids; i++)
> + for (i = 0; i < 4*nr_node_ids; i++)
> grp->faults[i] = p->numa_faults[i];
>
This is a little obscure now. Functionally it is copying both numa_faults and
numa_faults_from but a casual reading of that will get confused. Minimally
it needs a comment explaining what is being copied here. Also, why did we
not use memcpy?
> grp->total_faults = p->total_numa_faults;
> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>
> double_lock(&my_grp->lock, &grp->lock);
>
> - for (i = 0; i < 2*nr_node_ids; i++) {
> + for (i = 0; i < 4*nr_node_ids; i++) {
> my_grp->faults[i] -= p->numa_faults[i];
> grp->faults[i] += p->numa_faults[i];
> }
The same obscure trick is used throughout and I'm not sure how
maintainable that will be. Would it be better to be explicit about this?
/* NUMA hinting faults may be either shared or private faults */
#define NR_NUMA_HINT_FAULT_TYPES 2
/* Track shared and private faults
#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES*2)
> @@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)
>
> if (grp) {
> spin_lock(&grp->lock);
> - for (i = 0; i < 2*nr_node_ids; i++)
> + for (i = 0; i < 4*nr_node_ids; i++)
> grp->faults[i] -= p->numa_faults[i];
> grp->total_faults -= p->total_numa_faults;
>
> @@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)
>
> p->numa_faults = NULL;
> p->numa_faults_buffer = NULL;
> + p->numa_faults_from = NULL;
> + p->numa_faults_from_buffer = NULL;
> kfree(numa_faults);
> }
>
> @@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> {
> struct task_struct *p = current;
> bool migrated = flags & TNF_MIGRATED;
> + int this_node = task_node(current);
> int priv;
>
> if (!numabalancing_enabled)
> @@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>
> /* Allocate buffer to track faults on a per-node basis */
> if (unlikely(!p->numa_faults)) {
> - int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
> + int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;
>
> /* numa_faults and numa_faults_buffer share the allocation */
> p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
> @@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> return;
>
> BUG_ON(p->numa_faults_buffer);
> - p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
> + p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
> + p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
> + p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
> p->total_numa_faults = 0;
> memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
> }
> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> p->numa_pages_migrated += pages;
>
> p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
> + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
> p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
this_node and node is similarly ambiguous in terms of name. Rename of
data_node and cpu_node would have been clearer.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
2014-01-21 12:21 ` Mel Gorman
@ 2014-01-21 22:26 ` Rik van Riel
-1 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-01-21 22:26 UTC (permalink / raw)
To: Mel Gorman; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod
On 01/21/2014 07:21 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:03PM -0500, riel@redhat.com wrote:
>> +++ b/include/linux/sched.h
>> @@ -1492,6 +1492,14 @@ struct task_struct {
>> unsigned long *numa_faults_buffer;
>>
>> /*
>> + * Track the nodes where faults are incurred. This is not very
>> + * interesting on a per-task basis, but it help with smarter
>> + * numa memory placement for groups of processes.
>> + */
>> + unsigned long *numa_faults_from;
>> + unsigned long *numa_faults_from_buffer;
>> +
>
> As an aside I wonder if we can derive any useful metric from this
It may provide for a better way to tune the numa scan interval
than the current code, since the "local vs remote" ratio is not
going to provide us much useful info when dealing with a workload
that is spread across multiple numa nodes.
>> grp->total_faults = p->total_numa_faults;
>> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>
>> double_lock(&my_grp->lock, &grp->lock);
>>
>> - for (i = 0; i < 2*nr_node_ids; i++) {
>> + for (i = 0; i < 4*nr_node_ids; i++) {
>> my_grp->faults[i] -= p->numa_faults[i];
>> grp->faults[i] += p->numa_faults[i];
>> }
>
> The same obscure trick is used throughout and I'm not sure how
> maintainable that will be. Would it be better to be explicit about this?
I have made a cleanup patch for this, using the defines you
suggested.
>> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>> p->numa_pages_migrated += pages;
>>
>> p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
>> + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
>> p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
>
> this_node and node is similarly ambiguous in terms of name. Rename of
> data_node and cpu_node would have been clearer.
I added a patch in the next version of the series.
Don't want to make the series too large, though :)
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
@ 2014-01-21 22:26 ` Rik van Riel
0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-01-21 22:26 UTC (permalink / raw)
To: Mel Gorman; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod
On 01/21/2014 07:21 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:03PM -0500, riel@redhat.com wrote:
>> +++ b/include/linux/sched.h
>> @@ -1492,6 +1492,14 @@ struct task_struct {
>> unsigned long *numa_faults_buffer;
>>
>> /*
>> + * Track the nodes where faults are incurred. This is not very
>> + * interesting on a per-task basis, but it help with smarter
>> + * numa memory placement for groups of processes.
>> + */
>> + unsigned long *numa_faults_from;
>> + unsigned long *numa_faults_from_buffer;
>> +
>
> As an aside I wonder if we can derive any useful metric from this
It may provide for a better way to tune the numa scan interval
than the current code, since the "local vs remote" ratio is not
going to provide us much useful info when dealing with a workload
that is spread across multiple numa nodes.
>> grp->total_faults = p->total_numa_faults;
>> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>
>> double_lock(&my_grp->lock, &grp->lock);
>>
>> - for (i = 0; i < 2*nr_node_ids; i++) {
>> + for (i = 0; i < 4*nr_node_ids; i++) {
>> my_grp->faults[i] -= p->numa_faults[i];
>> grp->faults[i] += p->numa_faults[i];
>> }
>
> The same obscure trick is used throughout and I'm not sure how
> maintainable that will be. Would it be better to be explicit about this?
I have made a cleanup patch for this, using the defines you
suggested.
>> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>> p->numa_pages_migrated += pages;
>>
>> p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
>> + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
>> p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
>
> this_node and node is similarly ambiguous in terms of name. Rename of
> data_node and cpu_node would have been clearer.
I added a patch in the next version of the series.
Don't want to make the series too large, though :)
--
All rights reversed
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
2014-01-21 22:26 ` Rik van Riel
@ 2014-01-24 14:14 ` Mel Gorman
-1 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-01-24 14:14 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod
On Tue, Jan 21, 2014 at 05:26:39PM -0500, Rik van Riel wrote:
> On 01/21/2014 07:21 AM, Mel Gorman wrote:
> > On Mon, Jan 20, 2014 at 02:21:03PM -0500, riel@redhat.com wrote:
>
> >> +++ b/include/linux/sched.h
> >> @@ -1492,6 +1492,14 @@ struct task_struct {
> >> unsigned long *numa_faults_buffer;
> >>
> >> /*
> >> + * Track the nodes where faults are incurred. This is not very
> >> + * interesting on a per-task basis, but it help with smarter
> >> + * numa memory placement for groups of processes.
> >> + */
> >> + unsigned long *numa_faults_from;
> >> + unsigned long *numa_faults_from_buffer;
> >> +
> >
> > As an aside I wonder if we can derive any useful metric from this
>
> It may provide for a better way to tune the numa scan interval
> than the current code, since the "local vs remote" ratio is not
> going to provide us much useful info when dealing with a workload
> that is spread across multiple numa nodes.
>
Agreed. Local vs Remote handles the easier cases, particularly where the
workload has been configured to have aspects of it fit within NUMA nodes
(e.g. multiple JVMs, multiple virtual machines etc) but it's nowhere near
as useful for large single-image workloads spanning the full machine
I think in this New World Order that for single instance workloads we
would instead take into account the balance of all remote nodes. So if
all remote nodes are roughly even in terms of usage and we've decided to
interleave then slow the scan rate if the remote active nodes are evenly used
It's not something for this series just yet but I have observed a higher
system CPU usage as a result of this series. It's still far lower than
the overhead we had in the past but this is one potential idea that would
allow us to reduce the system overhead again in the future.
> >> grp->total_faults = p->total_numa_faults;
> >> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>
> >> double_lock(&my_grp->lock, &grp->lock);
> >>
> >> - for (i = 0; i < 2*nr_node_ids; i++) {
> >> + for (i = 0; i < 4*nr_node_ids; i++) {
> >> my_grp->faults[i] -= p->numa_faults[i];
> >> grp->faults[i] += p->numa_faults[i];
> >> }
> >
> > The same obscure trick is used throughout and I'm not sure how
> > maintainable that will be. Would it be better to be explicit about this?
>
> I have made a cleanup patch for this, using the defines you
> suggested.
>
> >> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> >> p->numa_pages_migrated += pages;
> >>
> >> p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
> >> + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
> >> p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
> >
> > this_node and node is similarly ambiguous in terms of name. Rename of
> > data_node and cpu_node would have been clearer.
>
> I added a patch in the next version of the series.
>
> Don't want to make the series too large, though :)
>
Understood, it's a bit of a mouthful already.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered
@ 2014-01-24 14:14 ` Mel Gorman
0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-01-24 14:14 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel, linux-mm, peterz, mingo, chegu_vinod
On Tue, Jan 21, 2014 at 05:26:39PM -0500, Rik van Riel wrote:
> On 01/21/2014 07:21 AM, Mel Gorman wrote:
> > On Mon, Jan 20, 2014 at 02:21:03PM -0500, riel@redhat.com wrote:
>
> >> +++ b/include/linux/sched.h
> >> @@ -1492,6 +1492,14 @@ struct task_struct {
> >> unsigned long *numa_faults_buffer;
> >>
> >> /*
> >> + * Track the nodes where faults are incurred. This is not very
> >> + * interesting on a per-task basis, but it help with smarter
> >> + * numa memory placement for groups of processes.
> >> + */
> >> + unsigned long *numa_faults_from;
> >> + unsigned long *numa_faults_from_buffer;
> >> +
> >
> > As an aside I wonder if we can derive any useful metric from this
>
> It may provide for a better way to tune the numa scan interval
> than the current code, since the "local vs remote" ratio is not
> going to provide us much useful info when dealing with a workload
> that is spread across multiple numa nodes.
>
Agreed. Local vs Remote handles the easier cases, particularly where the
workload has been configured to have aspects of it fit within NUMA nodes
(e.g. multiple JVMs, multiple virtual machines etc) but it's nowhere near
as useful for large single-image workloads spanning the full machine
I think in this New World Order that for single instance workloads we
would instead take into account the balance of all remote nodes. So if
all remote nodes are roughly even in terms of usage and we've decided to
interleave then slow the scan rate if the remote active nodes are evenly used
It's not something for this series just yet but I have observed a higher
system CPU usage as a result of this series. It's still far lower than
the overhead we had in the past but this is one potential idea that would
allow us to reduce the system overhead again in the future.
> >> grp->total_faults = p->total_numa_faults;
> >> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>
> >> double_lock(&my_grp->lock, &grp->lock);
> >>
> >> - for (i = 0; i < 2*nr_node_ids; i++) {
> >> + for (i = 0; i < 4*nr_node_ids; i++) {
> >> my_grp->faults[i] -= p->numa_faults[i];
> >> grp->faults[i] += p->numa_faults[i];
> >> }
> >
> > The same obscure trick is used throughout and I'm not sure how
> > maintainable that will be. Would it be better to be explicit about this?
>
> I have made a cleanup patch for this, using the defines you
> suggested.
>
> >> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> >> p->numa_pages_migrated += pages;
> >>
> >> p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
> >> + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
> >> p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
> >
> > this_node and node is similarly ambiguous in terms of name. Rename of
> > data_node and cpu_node would have been clearer.
>
> I added a patch in the next version of the series.
>
> Don't want to make the series too large, though :)
>
Understood, it's a bit of a mouthful already.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 19+ messages in thread