From: Peter Zijlstra <peterz@infradead.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
Ingo Molnar <mingo@kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: [PATCH] sched, numa: Improve scanner
Date: Thu, 25 Jul 2013 12:41:30 +0200 [thread overview]
Message-ID: <20130725104130.GP27075@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <1373901620-2021-1-git-send-email-mgorman@suse.de>
Subject: sched, numa: Improve scanner
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Jul 23 17:02:38 CEST 2013
With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.
This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.
Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.
Before:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3232 [022] .... 212.787402: task_numa_work: working
thread 0/0-3232 [022] .... 212.888473: task_numa_work: working
thread 0/0-3232 [022] .... 212.989538: task_numa_work: working
thread 0/0-3232 [022] .... 213.090602: task_numa_work: working
thread 0/0-3232 [022] .... 213.191667: task_numa_work: working
thread 0/0-3232 [022] .... 213.292734: task_numa_work: working
thread 0/0-3232 [022] .... 213.393804: task_numa_work: working
thread 0/0-3232 [022] .... 213.494869: task_numa_work: working
thread 0/0-3232 [022] .... 213.596937: task_numa_work: working
thread 0/0-3232 [022] .... 213.699000: task_numa_work: working
thread 0/0-3232 [022] .... 213.801067: task_numa_work: working
thread 0/0-3232 [022] .... 213.903155: task_numa_work: working
thread 0/0-3232 [022] .... 214.005201: task_numa_work: working
thread 0/0-3232 [022] .... 214.107266: task_numa_work: working
thread 0/0-3232 [022] .... 214.209342: task_numa_work: working
After:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3253 [005] .... 136.865051: task_numa_work: working
thread 0/2-3255 [026] .... 136.965134: task_numa_work: working
thread 0/3-3256 [024] .... 137.065217: task_numa_work: working
thread 0/3-3256 [024] .... 137.165302: task_numa_work: working
thread 0/3-3256 [024] .... 137.265382: task_numa_work: working
thread 0/0-3253 [004] .... 137.366465: task_numa_work: working
thread 0/2-3255 [026] .... 137.466549: task_numa_work: working
thread 0/0-3253 [004] .... 137.566629: task_numa_work: working
thread 0/0-3253 [004] .... 137.666711: task_numa_work: working
thread 0/1-3254 [028] .... 137.766799: task_numa_work: working
thread 0/0-3253 [004] .... 137.866876: task_numa_work: working
thread 0/2-3255 [026] .... 137.966960: task_numa_work: working
thread 0/1-3254 [028] .... 138.067041: task_numa_work: working
thread 0/2-3255 [026] .... 138.167123: task_numa_work: working
thread 0/3-3256 [024] .... 138.267207: task_numa_work: working
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1316,6 +1316,12 @@ void task_numa_work(struct callback_head
return;
/*
+ * Delay this task enough that another task of this mm will likely win
+ * the next time around.
+ */
+ p->node_stamp += 2 * TICK_NSEC;
+
+ /*
* Do not set pte_numa if the current running node is rate-limited.
* This loses statistics on the fault but if we are unwilling to
* migrate to this node, it is less likely we can do useful work
@@ -1405,7 +1411,7 @@ void task_tick_numa(struct rq *rq, struc
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
curr->numa_scan_period = task_scan_min(curr);
- curr->node_stamp = now;
+ curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <peterz@infradead.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
Ingo Molnar <mingo@kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: [PATCH] sched, numa: Improve scanner
Date: Thu, 25 Jul 2013 12:41:30 +0200 [thread overview]
Message-ID: <20130725104130.GP27075@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <1373901620-2021-1-git-send-email-mgorman@suse.de>
Subject: sched, numa: Improve scanner
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Jul 23 17:02:38 CEST 2013
With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.
This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.
Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.
Before:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3232 [022] .... 212.787402: task_numa_work: working
thread 0/0-3232 [022] .... 212.888473: task_numa_work: working
thread 0/0-3232 [022] .... 212.989538: task_numa_work: working
thread 0/0-3232 [022] .... 213.090602: task_numa_work: working
thread 0/0-3232 [022] .... 213.191667: task_numa_work: working
thread 0/0-3232 [022] .... 213.292734: task_numa_work: working
thread 0/0-3232 [022] .... 213.393804: task_numa_work: working
thread 0/0-3232 [022] .... 213.494869: task_numa_work: working
thread 0/0-3232 [022] .... 213.596937: task_numa_work: working
thread 0/0-3232 [022] .... 213.699000: task_numa_work: working
thread 0/0-3232 [022] .... 213.801067: task_numa_work: working
thread 0/0-3232 [022] .... 213.903155: task_numa_work: working
thread 0/0-3232 [022] .... 214.005201: task_numa_work: working
thread 0/0-3232 [022] .... 214.107266: task_numa_work: working
thread 0/0-3232 [022] .... 214.209342: task_numa_work: working
After:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3253 [005] .... 136.865051: task_numa_work: working
thread 0/2-3255 [026] .... 136.965134: task_numa_work: working
thread 0/3-3256 [024] .... 137.065217: task_numa_work: working
thread 0/3-3256 [024] .... 137.165302: task_numa_work: working
thread 0/3-3256 [024] .... 137.265382: task_numa_work: working
thread 0/0-3253 [004] .... 137.366465: task_numa_work: working
thread 0/2-3255 [026] .... 137.466549: task_numa_work: working
thread 0/0-3253 [004] .... 137.566629: task_numa_work: working
thread 0/0-3253 [004] .... 137.666711: task_numa_work: working
thread 0/1-3254 [028] .... 137.766799: task_numa_work: working
thread 0/0-3253 [004] .... 137.866876: task_numa_work: working
thread 0/2-3255 [026] .... 137.966960: task_numa_work: working
thread 0/1-3254 [028] .... 138.067041: task_numa_work: working
thread 0/2-3255 [026] .... 138.167123: task_numa_work: working
thread 0/3-3256 [024] .... 138.267207: task_numa_work: working
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1316,6 +1316,12 @@ void task_numa_work(struct callback_head
return;
/*
+ * Delay this task enough that another task of this mm will likely win
+ * the next time around.
+ */
+ p->node_stamp += 2 * TICK_NSEC;
+
+ /*
* Do not set pte_numa if the current running node is rate-limited.
* This loses statistics on the fault but if we are unwilling to
* migrate to this node, it is less likely we can do useful work
@@ -1405,7 +1411,7 @@ void task_tick_numa(struct rq *rq, struc
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
curr->numa_scan_period = task_scan_min(curr);
- curr->node_stamp = now;
+ curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
next prev parent reply other threads:[~2013-07-25 10:41 UTC|newest]
Thread overview: 200+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-15 15:20 [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5 Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 01/18] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 02/18] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-17 10:50 ` Peter Zijlstra
2013-07-17 10:50 ` Peter Zijlstra
2013-07-31 7:54 ` Mel Gorman
2013-07-31 7:54 ` Mel Gorman
2013-07-29 10:10 ` Peter Zijlstra
2013-07-29 10:10 ` Peter Zijlstra
2013-07-31 7:54 ` Mel Gorman
2013-07-31 7:54 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 03/18] mm: numa: Account for THP numa hinting faults on the correct node Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-17 0:33 ` Hillf Danton
2013-07-17 0:33 ` Hillf Danton
2013-07-17 1:26 ` Wanpeng Li
2013-07-17 1:26 ` Wanpeng Li
2013-07-15 15:20 ` [PATCH 04/18] mm: numa: Do not migrate or account for hinting faults on the zero page Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-17 11:00 ` Peter Zijlstra
2013-07-17 11:00 ` Peter Zijlstra
2013-07-31 8:11 ` Mel Gorman
2013-07-31 8:11 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 05/18] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 06/18] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 07/18] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-25 10:40 ` [PATCH] sched, numa: migrates_degrades_locality() Peter Zijlstra
2013-07-25 10:40 ` Peter Zijlstra
2013-07-31 8:44 ` Mel Gorman
2013-07-31 8:44 ` Mel Gorman
2013-07-31 8:50 ` Peter Zijlstra
2013-07-31 8:50 ` Peter Zijlstra
2013-07-15 15:20 ` [PATCH 08/18] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-17 1:31 ` Hillf Danton
2013-07-17 1:31 ` Hillf Danton
2013-07-31 9:07 ` Mel Gorman
2013-07-31 9:07 ` Mel Gorman
2013-07-31 9:38 ` Srikar Dronamraju
2013-07-31 9:38 ` Srikar Dronamraju
2013-08-01 4:47 ` Srikar Dronamraju
2013-08-01 4:47 ` Srikar Dronamraju
2013-08-01 15:38 ` Mel Gorman
2013-08-01 15:38 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 09/18] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-17 2:17 ` Hillf Danton
2013-07-17 2:17 ` Hillf Danton
2013-07-31 9:08 ` Mel Gorman
2013-07-31 9:08 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 10/18] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 11/18] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 12/18] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 13/18] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-17 5:22 ` Sam Ben
2013-07-17 5:22 ` Sam Ben
2013-07-31 9:13 ` Mel Gorman
2013-07-31 9:13 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 14/18] sched: Remove check that skips small VMAs Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 15/18] sched: Set preferred NUMA node based on number of private faults Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-18 1:53 ` [PATCH 15/18] fix compilation with !CONFIG_NUMA_BALANCING Rik van Riel
2013-07-18 1:53 ` Rik van Riel
2013-07-31 9:19 ` Mel Gorman
2013-07-31 9:19 ` Mel Gorman
2013-07-26 11:20 ` [PATCH 15/18] sched: Set preferred NUMA node based on number of private faults Peter Zijlstra
2013-07-26 11:20 ` Peter Zijlstra
2013-07-31 9:29 ` Mel Gorman
2013-07-31 9:29 ` Mel Gorman
2013-07-31 9:34 ` Peter Zijlstra
2013-07-31 9:34 ` Peter Zijlstra
2013-07-31 10:10 ` Mel Gorman
2013-07-31 10:10 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 16/18] sched: Avoid overloading CPUs on a preferred NUMA node Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 20:03 ` Peter Zijlstra
2013-07-15 20:03 ` Peter Zijlstra
2013-07-16 8:23 ` Mel Gorman
2013-07-16 8:23 ` Mel Gorman
2013-07-16 10:35 ` Peter Zijlstra
2013-07-16 10:35 ` Peter Zijlstra
2013-07-16 15:55 ` Hillf Danton
2013-07-16 15:55 ` Hillf Danton
2013-07-16 16:01 ` Mel Gorman
2013-07-16 16:01 ` Mel Gorman
2013-07-17 10:54 ` Peter Zijlstra
2013-07-17 10:54 ` Peter Zijlstra
2013-07-31 9:49 ` Mel Gorman
2013-07-31 9:49 ` Mel Gorman
2013-08-01 7:10 ` Srikar Dronamraju
2013-08-01 7:10 ` Srikar Dronamraju
2013-08-01 15:42 ` Mel Gorman
2013-08-01 15:42 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 17/18] sched: Retry migration of tasks to CPU on a preferred node Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-25 10:33 ` Peter Zijlstra
2013-07-25 10:33 ` Peter Zijlstra
2013-07-31 10:03 ` Mel Gorman
2013-07-31 10:03 ` Mel Gorman
2013-07-31 10:05 ` Peter Zijlstra
2013-07-31 10:05 ` Peter Zijlstra
2013-07-31 10:07 ` Mel Gorman
2013-07-31 10:07 ` Mel Gorman
2013-07-25 10:35 ` Peter Zijlstra
2013-07-25 10:35 ` Peter Zijlstra
2013-08-01 5:13 ` Srikar Dronamraju
2013-08-01 5:13 ` Srikar Dronamraju
2013-08-01 15:46 ` Mel Gorman
2013-08-01 15:46 ` Mel Gorman
2013-07-15 15:20 ` [PATCH 18/18] sched: Swap tasks when reschuling if a CPU on a target node is imbalanced Mel Gorman
2013-07-15 15:20 ` Mel Gorman
2013-07-15 20:11 ` Peter Zijlstra
2013-07-15 20:11 ` Peter Zijlstra
2013-07-16 9:41 ` Mel Gorman
2013-07-16 9:41 ` Mel Gorman
2013-08-01 4:59 ` Srikar Dronamraju
2013-08-01 4:59 ` Srikar Dronamraju
2013-08-01 15:48 ` Mel Gorman
2013-08-01 15:48 ` Mel Gorman
2013-07-15 20:14 ` [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5 Peter Zijlstra
2013-07-15 20:14 ` Peter Zijlstra
2013-07-16 15:10 ` Srikar Dronamraju
2013-07-16 15:10 ` Srikar Dronamraju
2013-07-25 10:36 ` Peter Zijlstra
2013-07-25 10:36 ` Peter Zijlstra
2013-07-31 10:30 ` Mel Gorman
2013-07-31 10:30 ` Mel Gorman
2013-07-31 10:48 ` Peter Zijlstra
2013-07-31 10:48 ` Peter Zijlstra
2013-07-31 11:57 ` Mel Gorman
2013-07-31 11:57 ` Mel Gorman
2013-07-31 15:30 ` Peter Zijlstra
2013-07-31 15:30 ` Peter Zijlstra
2013-07-31 16:11 ` Mel Gorman
2013-07-31 16:11 ` Mel Gorman
2013-07-31 16:39 ` Peter Zijlstra
2013-07-31 16:39 ` Peter Zijlstra
2013-08-01 15:51 ` Mel Gorman
2013-08-01 15:51 ` Mel Gorman
2013-07-25 10:38 ` [PATCH] mm, numa: Sanitize task_numa_fault() callsites Peter Zijlstra
2013-07-25 10:38 ` Peter Zijlstra
2013-07-31 11:25 ` Mel Gorman
2013-07-31 11:25 ` Mel Gorman
2013-07-25 10:41 ` Peter Zijlstra [this message]
2013-07-25 10:41 ` [PATCH] sched, numa: Improve scanner Peter Zijlstra
2013-07-25 10:46 ` [PATCH] mm, sched, numa: Create a per-task MPOL_INTERLEAVE policy Peter Zijlstra
2013-07-25 10:46 ` Peter Zijlstra
2013-07-26 9:55 ` Peter Zijlstra
2013-07-26 9:55 ` Peter Zijlstra
2013-08-26 16:10 ` Peter Zijlstra
2013-08-26 16:10 ` Peter Zijlstra
2013-08-26 16:14 ` Peter Zijlstra
2013-08-26 16:14 ` Peter Zijlstra
2013-07-30 11:24 ` [PATCH] mm, numa: Change page last {nid,pid} into {cpu,pid} Peter Zijlstra
2013-07-30 11:24 ` Peter Zijlstra
2013-08-01 22:33 ` Rik van Riel
2013-08-01 22:33 ` Rik van Riel
2013-07-30 11:38 ` [PATCH] sched, numa: Use {cpu, pid} to create task groups for shared faults Peter Zijlstra
2013-07-30 11:38 ` Peter Zijlstra
2013-07-31 15:07 ` Peter Zijlstra
2013-07-31 15:07 ` Peter Zijlstra
2013-07-31 15:38 ` Peter Zijlstra
2013-07-31 15:38 ` Peter Zijlstra
2013-07-31 15:45 ` Don Morris
2013-07-31 15:45 ` Don Morris
2013-07-31 16:05 ` Peter Zijlstra
2013-07-31 16:05 ` Peter Zijlstra
2013-08-02 16:47 ` [PATCH -v3] " Peter Zijlstra
2013-08-02 16:47 ` Peter Zijlstra
2013-08-02 16:50 ` [PATCH] mm, numa: Do not group on RO pages Peter Zijlstra
2013-08-02 16:50 ` Peter Zijlstra
2013-08-02 19:56 ` Peter Zijlstra
2013-08-02 19:56 ` Peter Zijlstra
2013-08-05 19:36 ` [PATCH] numa,sched: use group fault statistics in numa placement Rik van Riel
2013-08-05 19:36 ` Rik van Riel
2013-08-09 13:55 ` Don Morris
2013-08-28 16:41 ` [PATCH -v3] sched, numa: Use {cpu, pid} to create task groups for shared faults Peter Zijlstra
2013-08-28 16:41 ` Peter Zijlstra
2013-08-28 17:10 ` Rik van Riel
2013-08-28 17:10 ` Rik van Riel
2013-08-01 6:23 ` [PATCH,RFC] numa,sched: use group fault statistics in numa placement Rik van Riel
2013-08-01 6:23 ` Rik van Riel
2013-08-01 10:37 ` Peter Zijlstra
2013-08-01 10:37 ` Peter Zijlstra
2013-08-01 16:35 ` Rik van Riel
2013-08-01 16:35 ` Rik van Riel
2013-08-01 22:36 ` [RFC PATCH -v2] " Rik van Riel
2013-08-01 22:36 ` Rik van Riel
2013-07-30 13:58 ` [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5 Andrew Theurer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130725104130.GP27075@twins.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=aarcange@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=srikar@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.