* [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving
@ 2014-04-11 17:00 riel
2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: riel @ 2014-04-11 17:00 UTC (permalink / raw)
To: linux-kernel; +Cc: mingo, peterz, chegu_vinod, mgorman
The pseudo-interleaving code deals fairly well with the placement
of tasks that are part of workloads that span multiple NUMA nodes,
but the code has a number of corner cases left that can result in
higher than desired overhead.
This patch series reduces the overhead slightly, mostly visible
through a lower number of page migrations, while leaving the
throughput of the workload essentially unchanged.
On smaller NUMA systems, these patches should have little or no
effect. On a 4 node test system, I did see a reduction in the
number of page migrations running SPECjbb2005; autonuma-benchmark
appears to be unaffected.
NUMA page migrations on an 8 node system running SPECjbb2005,
courtesy of Vinod:
vanilla with patch
8 - 1 socket wide: 9138324 8918971
4 - 2 socket wide: 8239914 7315148
2 - 4 socket wide: 5732744 3849624
1 - 8 socket wide: 3348475 2660347
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 1/3] sched,numa: count pages on active node as local 2014-04-11 17:00 [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving riel @ 2014-04-11 17:00 ` riel 2014-04-11 17:34 ` Joe Perches ` (2 more replies) 2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel 2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel 2 siblings, 3 replies; 19+ messages in thread From: riel @ 2014-04-11 17:00 UTC (permalink / raw) To: linux-kernel; +Cc: mingo, peterz, chegu_vinod, mgorman From: Rik van Riel <riel@redhat.com> The NUMA code is smart enough to distribute the memory of workloads that span multiple NUMA nodes across those NUMA nodes. However, it still has a pretty high scan rate for such workloads, because any memory that is left on a node other than the node of the CPU that faulted on the memory is counted as non-local, which causes the scan rate to go up. Counting the memory on any node where the task's numa group is actively running as local, allows the scan rate to slow down once the application is settled in. This should reduce the overhead of the automatic NUMA placement code, when a workload spans multiple NUMA nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> --- kernel/sched/fair.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e9bd0b..12de50a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1737,6 +1737,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) struct task_struct *p = current; bool migrated = flags & TNF_MIGRATED; int cpu_node = task_node(current); + int local = !!(flags & TNF_FAULT_LOCAL); int priv; if (!numabalancing_enabled) @@ -1785,6 +1786,17 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) task_numa_group(p, last_cpupid, flags, &priv); } + /* + * If a workload spans multiple NUMA nodes, a shared fault that + * occurs wholly within the set of nodes that the workload is + * actively using should be counted as local. This allows the + * scan rate to slow down when a workload has settled down. + */ + if (!priv && !local && p->numa_group && + node_isset(cpu_node, p->numa_group->active_nodes) && + node_isset(mem_node, p->numa_group->active_nodes)) + local = 1; + task_numa_placement(p); /* @@ -1799,7 +1811,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) p->numa_faults_buffer_memory[task_faults_idx(mem_node, priv)] += pages; p->numa_faults_buffer_cpu[task_faults_idx(cpu_node, priv)] += pages; - p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages; + p->numa_faults_locality[local] += pages; } static void reset_ptenuma_scan(struct task_struct *p) -- 1.8.5.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] sched,numa: count pages on active node as local 2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel @ 2014-04-11 17:34 ` Joe Perches 2014-04-11 17:41 ` Rik van Riel 2014-04-25 9:04 ` Mel Gorman 2014-05-08 10:42 ` [tip:sched/core] sched/numa: Count " tip-bot for Rik van Riel 2 siblings, 1 reply; 19+ messages in thread From: Joe Perches @ 2014-04-11 17:34 UTC (permalink / raw) To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote: > This should reduce the overhead of the automatic NUMA placement > code, when a workload spans multiple NUMA nodes. trivial style note: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c [] > @@ -1737,6 +1737,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) > struct task_struct *p = current; > bool migrated = flags & TNF_MIGRATED; > int cpu_node = task_node(current); > + int local = !!(flags & TNF_FAULT_LOCAL); Perhaps local would look nicer as bool and be better placed next to migrated. bool migrated = flags & TNF_MIGRATED; bool local = flags & TNF_FAULT_LOCAL; int cpu_node = task_node(current); > @@ -1785,6 +1786,17 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) > + if (!priv && !local && p->numa_group && > + node_isset(cpu_node, p->numa_group->active_nodes) && > + node_isset(mem_node, p->numa_group->active_nodes)) > + local = 1; local = true; ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] sched,numa: count pages on active node as local 2014-04-11 17:34 ` Joe Perches @ 2014-04-11 17:41 ` Rik van Riel 2014-04-11 18:01 ` Joe Perches 0 siblings, 1 reply; 19+ messages in thread From: Rik van Riel @ 2014-04-11 17:41 UTC (permalink / raw) To: Joe Perches; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman On 04/11/2014 01:34 PM, Joe Perches wrote: > On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote: >> This should reduce the overhead of the automatic NUMA placement >> code, when a workload spans multiple NUMA nodes. > > trivial style note: > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > [] >> @@ -1737,6 +1737,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) >> struct task_struct *p = current; >> bool migrated = flags & TNF_MIGRATED; >> int cpu_node = task_node(current); >> + int local = !!(flags & TNF_FAULT_LOCAL); > > Perhaps local would look nicer as bool > and be better placed next to migrated. The problem is, at the end of the function, local is used as an array index... p->numa_faults_locality[local] += pages; } I'm not sure I really want to use a bool as an array index :) -- All rights reversed ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] sched,numa: count pages on active node as local 2014-04-11 17:41 ` Rik van Riel @ 2014-04-11 18:01 ` Joe Perches 0 siblings, 0 replies; 19+ messages in thread From: Joe Perches @ 2014-04-11 18:01 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman On Fri, 2014-04-11 at 13:41 -0400, Rik van Riel wrote: > On 04/11/2014 01:34 PM, Joe Perches wrote: > > On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote: > >> This should reduce the overhead of the automatic NUMA placement > >> code, when a workload spans multiple NUMA nodes. [] > > Perhaps local would look nicer as bool > > and be better placed next to migrated. > > The problem is, at the end of the function, local is used as an > array index... Oh, thanks. > p->numa_faults_locality[local] += pages; > > I'm not sure I really want to use a bool as an array index :) Nor I particularly but using bool as an array index is already done in at least kernel/workqueue.c and kernel/trace/blktrace.c Doesn't make it right of course. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] sched,numa: count pages on active node as local 2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel 2014-04-11 17:34 ` Joe Perches @ 2014-04-25 9:04 ` Mel Gorman 2014-05-08 10:42 ` [tip:sched/core] sched/numa: Count " tip-bot for Rik van Riel 2 siblings, 0 replies; 19+ messages in thread From: Mel Gorman @ 2014-04-25 9:04 UTC (permalink / raw) To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod On Fri, Apr 11, 2014 at 01:00:27PM -0400, riel@redhat.com wrote: > From: Rik van Riel <riel@redhat.com> > > The NUMA code is smart enough to distribute the memory of workloads > that span multiple NUMA nodes across those NUMA nodes. > > However, it still has a pretty high scan rate for such workloads, > because any memory that is left on a node other than the node of > the CPU that faulted on the memory is counted as non-local, which > causes the scan rate to go up. > > Counting the memory on any node where the task's numa group is > actively running as local, allows the scan rate to slow down > once the application is settled in. > > This should reduce the overhead of the automatic NUMA placement > code, when a workload spans multiple NUMA nodes. > > Signed-off-by: Rik van Riel <riel@redhat.com> > Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [tip:sched/core] sched/numa: Count pages on active node as local 2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel 2014-04-11 17:34 ` Joe Perches 2014-04-25 9:04 ` Mel Gorman @ 2014-05-08 10:42 ` tip-bot for Rik van Riel 2 siblings, 0 replies; 19+ messages in thread From: tip-bot for Rik van Riel @ 2014-05-08 10:42 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, torvalds, peterz, riel, chegu_vinod, mgorman, tglx Commit-ID: 792568ec6a31ca560ca4d528782cbc6cd2cea8b0 Gitweb: http://git.kernel.org/tip/792568ec6a31ca560ca4d528782cbc6cd2cea8b0 Author: Rik van Riel <riel@redhat.com> AuthorDate: Fri, 11 Apr 2014 13:00:27 -0400 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 7 May 2014 13:33:45 +0200 sched/numa: Count pages on active node as local The NUMA code is smart enough to distribute the memory of workloads that span multiple NUMA nodes across those NUMA nodes. However, it still has a pretty high scan rate for such workloads, because any memory that is left on a node other than the node of the CPU that faulted on the memory is counted as non-local, which causes the scan rate to go up. Counting the memory on any node where the task's numa group is actively running as local, allows the scan rate to slow down once the application is settled in. This should reduce the overhead of the automatic NUMA placement code, when a workload spans multiple NUMA nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1397235629-16328-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5d859ec..f6457b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1738,6 +1738,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) struct task_struct *p = current; bool migrated = flags & TNF_MIGRATED; int cpu_node = task_node(current); + int local = !!(flags & TNF_FAULT_LOCAL); int priv; if (!numabalancing_enabled) @@ -1786,6 +1787,17 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) task_numa_group(p, last_cpupid, flags, &priv); } + /* + * If a workload spans multiple NUMA nodes, a shared fault that + * occurs wholly within the set of nodes that the workload is + * actively using should be counted as local. This allows the + * scan rate to slow down when a workload has settled down. + */ + if (!priv && !local && p->numa_group && + node_isset(cpu_node, p->numa_group->active_nodes) && + node_isset(mem_node, p->numa_group->active_nodes)) + local = 1; + task_numa_placement(p); /* @@ -1800,7 +1812,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) p->numa_faults_buffer_memory[task_faults_idx(mem_node, priv)] += pages; p->numa_faults_buffer_cpu[task_faults_idx(cpu_node, priv)] += pages; - p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages; + p->numa_faults_locality[local] += pages; } static void reset_ptenuma_scan(struct task_struct *p) ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/3] sched,numa: retry placement more frequently when misplaced 2014-04-11 17:00 [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving riel 2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel @ 2014-04-11 17:00 ` riel 2014-04-11 17:46 ` Joe Perches ` (2 more replies) 2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel 2 siblings, 3 replies; 19+ messages in thread From: riel @ 2014-04-11 17:00 UTC (permalink / raw) To: linux-kernel; +Cc: mingo, peterz, chegu_vinod, mgorman From: Rik van Riel <riel@redhat.com> When tasks have not converged on their preferred nodes yet, we want to retry fairly often, to make sure we do not migrate a task's memory to an undesirable location, only to have to move it again later. This patch reduces the interval at which migration is retried, when the task's numa_scan_period is small. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> --- kernel/sched/fair.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 12de50a..babd316 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1326,12 +1326,15 @@ static int task_numa_migrate(struct task_struct *p) /* Attempt to migrate a task to a CPU on the preferred node. */ static void numa_migrate_preferred(struct task_struct *p) { + unsigned long interval = HZ; + /* This task has no NUMA fault statistics yet */ if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults_memory)) return; /* Periodically retry migrating the task to the preferred node */ - p->numa_migrate_retry = jiffies + HZ; + interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16); + p->numa_migrate_retry = jiffies + interval; /* Success if task is already running on preferred CPU */ if (task_node(p) == p->numa_preferred_nid) -- 1.8.5.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] sched,numa: retry placement more frequently when misplaced 2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel @ 2014-04-11 17:46 ` Joe Perches 2014-04-11 18:03 ` Rik van Riel 2014-04-25 9:05 ` Mel Gorman 2014-05-08 10:42 ` [tip:sched/core] sched/numa: Retry " tip-bot for Rik van Riel 2 siblings, 1 reply; 19+ messages in thread From: Joe Perches @ 2014-04-11 17:46 UTC (permalink / raw) To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote: > This patch reduces the interval at which migration is retried, > when the task's numa_scan_period is small. More style trivia and a question. > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c [] > @@ -1326,12 +1326,15 @@ static int task_numa_migrate(struct task_struct *p) > /* Attempt to migrate a task to a CPU on the preferred node. */ > static void numa_migrate_preferred(struct task_struct *p) > { > + unsigned long interval = HZ; Perhaps it'd be better without the unnecessary initialization. > /* This task has no NUMA fault statistics yet */ > if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults_memory)) > return; > > /* Periodically retry migrating the task to the preferred node */ > - p->numa_migrate_retry = jiffies + HZ; > + interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16); and use interval = min_t(unsigned long, HZ, msecs_to_jiffies(p->numa_scan_period) / 16); btw; why 16? Can msecs_to_jiffies(p->numa_scan_period) ever be < 16? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] sched,numa: retry placement more frequently when misplaced 2014-04-11 17:46 ` Joe Perches @ 2014-04-11 18:03 ` Rik van Riel 2014-04-14 8:19 ` Ingo Molnar 0 siblings, 1 reply; 19+ messages in thread From: Rik van Riel @ 2014-04-11 18:03 UTC (permalink / raw) To: Joe Perches; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman On 04/11/2014 01:46 PM, Joe Perches wrote: > On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote: >> This patch reduces the interval at which migration is retried, >> when the task's numa_scan_period is small. > > More style trivia and a question. > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > [] >> @@ -1326,12 +1326,15 @@ static int task_numa_migrate(struct task_struct *p) >> /* Attempt to migrate a task to a CPU on the preferred node. */ >> static void numa_migrate_preferred(struct task_struct *p) >> { >> + unsigned long interval = HZ; > > Perhaps it'd be better without the unnecessary initialization. > >> /* This task has no NUMA fault statistics yet */ >> if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults_memory)) >> return; >> >> /* Periodically retry migrating the task to the preferred node */ >> - p->numa_migrate_retry = jiffies + HZ; >> + interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16); > > and use > > interval = min_t(unsigned long, HZ, > msecs_to_jiffies(p->numa_scan_period) / 16); That's what I had before, but spilling things over across multiple lines like that didn't exactly help readability. > btw; why 16? > > Can msecs_to_jiffies(p->numa_scan_period) ever be < 16? I picked 16 because there is a cost tradeoff between unmapping and faulting (and potentially migrating) a task's memory, which is very expensive, and searching for a better NUMA node to run on, which is potentially slightly expensive. This way we may run on the wrong NUMA node for around 6% of the time between unmapping all of the task's memory (and faulting it back in with NUMA hinting faults), before retrying migration of the task to a better node. I suppose it is possible for a sysadmin to set the minimum numa scan period to under 16 milliseconds, but if your system is trying to unmap all of a task's memory every 16 milliseconds, and fault it back in, task placement is likely to be the least of your problems :) -- All rights reversed ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] sched,numa: retry placement more frequently when misplaced 2014-04-11 18:03 ` Rik van Riel @ 2014-04-14 8:19 ` Ingo Molnar 0 siblings, 0 replies; 19+ messages in thread From: Ingo Molnar @ 2014-04-14 8:19 UTC (permalink / raw) To: Rik van Riel Cc: Joe Perches, linux-kernel, peterz, chegu_vinod, mgorman, the arch/x86 maintainers * Rik van Riel <riel@redhat.com> wrote: > On 04/11/2014 01:46 PM, Joe Perches wrote: > > On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote: > >> This patch reduces the interval at which migration is retried, > >> when the task's numa_scan_period is small. > > > > More style trivia and a question. [...] > > interval = min_t(unsigned long, HZ, > > msecs_to_jiffies(p->numa_scan_period) / 16); > > That's what I had before, but spilling things over across multiple > lines like that didn't exactly help readability. Joe, the low quality 'trivial reviews' you are doing are becoming a burden on people, so could you please either exclude arch/x86/ from your "waste other people's time" email filter; or improve the quality (and relevance) of your replies, so that you consider not just trivialities but the bigger picture as well? Most people who enter kernel development start with trivial patches and within a few months of learning the kernel ropes rise up to more serious, more useful topics. This process is constructive: it gives the Linux kernel a constant influx of useful cleanup patches, while it also gives newbies a smooth, easy path of entry into kernel hacking. Thus we maintainers are friendly and inclusive to newbie-originated cleanups. But the problem is that you remained stuck at the cleanup level for about a decade, and you don't seem to be willing to improve ... The thing is, as of today there's hundreds of thousands of 'cleanups' flagged by checkpatch.pl on the latest kernel tree. That body of "easy changes" is more than enough to keep newbies busy, but had all those lines of code been commented on for trivialities by cleanup-only people like you - forcing another 1 million often bogus commits into the development process - it would have bogged Linux down to a standstill and would have prevented _real_ review from occuring. Every time you comment on a patch, considering only trivialities, you risk crowding out some real reviewer who might mistakenly believe when seeing a reply to the email thread that the patch in question got a thorough review... So what you are doing is starting to become net destructive: - it is crowding out real review - the 'solutions' you suggest often result in worse code - you burden contributors and maintainers with make-work 'cleanups' created by a professional cleanup bureaucrat. Stop doing these low-substance reviews! Do some real reviews (which can very much include trivial comments as well), or try to improve your review abilities by writing real kernel code. Thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] sched,numa: retry placement more frequently when misplaced 2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel 2014-04-11 17:46 ` Joe Perches @ 2014-04-25 9:05 ` Mel Gorman 2014-05-08 10:42 ` [tip:sched/core] sched/numa: Retry " tip-bot for Rik van Riel 2 siblings, 0 replies; 19+ messages in thread From: Mel Gorman @ 2014-04-25 9:05 UTC (permalink / raw) To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod On Fri, Apr 11, 2014 at 01:00:28PM -0400, riel@redhat.com wrote: > From: Rik van Riel <riel@redhat.com> > > When tasks have not converged on their preferred nodes yet, we want > to retry fairly often, to make sure we do not migrate a task's memory > to an undesirable location, only to have to move it again later. > > This patch reduces the interval at which migration is retried, > when the task's numa_scan_period is small. > > Signed-off-by: Rik van Riel <riel@redhat.com> > Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [tip:sched/core] sched/numa: Retry placement more frequently when misplaced 2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel 2014-04-11 17:46 ` Joe Perches 2014-04-25 9:05 ` Mel Gorman @ 2014-05-08 10:42 ` tip-bot for Rik van Riel 2 siblings, 0 replies; 19+ messages in thread From: tip-bot for Rik van Riel @ 2014-05-08 10:42 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, torvalds, peterz, riel, chegu_vinod, mgorman, tglx Commit-ID: 5085e2a328849bdee6650b32d52c87c3788ab01c Gitweb: http://git.kernel.org/tip/5085e2a328849bdee6650b32d52c87c3788ab01c Author: Rik van Riel <riel@redhat.com> AuthorDate: Fri, 11 Apr 2014 13:00:28 -0400 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 7 May 2014 13:33:46 +0200 sched/numa: Retry placement more frequently when misplaced When tasks have not converged on their preferred nodes yet, we want to retry fairly often, to make sure we do not migrate a task's memory to an undesirable location, only to have to move it again later. This patch reduces the interval at which migration is retried, when the task's numa_scan_period is small. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1397235629-16328-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f6457b6..ecea8d9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1326,12 +1326,15 @@ static int task_numa_migrate(struct task_struct *p) /* Attempt to migrate a task to a CPU on the preferred node. */ static void numa_migrate_preferred(struct task_struct *p) { + unsigned long interval = HZ; + /* This task has no NUMA fault statistics yet */ if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults_memory)) return; /* Periodically retry migrating the task to the preferred node */ - p->numa_migrate_retry = jiffies + HZ; + interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16); + p->numa_migrate_retry = jiffies + interval; /* Success if task is already running on preferred CPU */ if (task_node(p) == p->numa_preferred_nid) ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node 2014-04-11 17:00 [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving riel 2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel 2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel @ 2014-04-11 17:00 ` riel 2014-04-14 12:56 ` Peter Zijlstra ` (2 more replies) 2 siblings, 3 replies; 19+ messages in thread From: riel @ 2014-04-11 17:00 UTC (permalink / raw) To: linux-kernel; +Cc: mingo, peterz, chegu_vinod, mgorman From: Rik van Riel <riel@redhat.com> Setting the numa_preferred_node for a task in task_numa_migrate does nothing on a 2-node system. Either we migrate to the node that already was our preferred node, or we stay where we were. On a 4-node system, it can slightly decrease overhead, by not calling the NUMA code as much. Since every node tends to be directly connected to every other node, running on the wrong node for a while does not do much damage. However, on an 8 node system, there are far more bad nodes than there are good ones, and pretending that a second choice is actually the preferred node can greatly delay, or even prevent, a workload from converging. The only time we can safely pretend that a second choice node is the preferred node is when the task is part of a workload that spans multiple NUMA nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> --- kernel/sched/fair.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index babd316..302facf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1301,7 +1301,16 @@ static int task_numa_migrate(struct task_struct *p) if (env.best_cpu == -1) return -EAGAIN; - sched_setnuma(p, env.dst_nid); + /* + * If the task is part of a workload that spans multiple NUMA nodes, + * and is migrating into one of the workload's active nodes, remember + * this node as the task's preferred numa node, so the workload can + * settle down. + * A task that migrated to a second choice node will be better off + * trying for a better one later. Do not set the preferred node here. + */ + if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes)) + sched_setnuma(p, env.dst_nid); /* * Reset the scan period if the task is being rescheduled on an -- 1.8.5.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node 2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel @ 2014-04-14 12:56 ` Peter Zijlstra 2014-04-15 14:35 ` Rik van Riel 2014-04-25 9:09 ` Mel Gorman 2014-05-08 10:43 ` [tip:sched/core] sched/numa: Do " tip-bot for Rik van Riel 2 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2014-04-14 12:56 UTC (permalink / raw) To: riel; +Cc: linux-kernel, mingo, chegu_vinod, mgorman On Fri, Apr 11, 2014 at 01:00:29PM -0400, riel@redhat.com wrote: > From: Rik van Riel <riel@redhat.com> > > Setting the numa_preferred_node for a task in task_numa_migrate > does nothing on a 2-node system. Either we migrate to the node > that already was our preferred node, or we stay where we were. > > On a 4-node system, it can slightly decrease overhead, by not > calling the NUMA code as much. Since every node tends to be > directly connected to every other node, running on the wrong > node for a while does not do much damage. > > However, on an 8 node system, there are far more bad nodes > than there are good ones, and pretending that a second choice > is actually the preferred node can greatly delay, or even > prevent, a workload from converging. > > The only time we can safely pretend that a second choice > node is the preferred node is when the task is part of a > workload that spans multiple NUMA nodes. > > Signed-off-by: Rik van Riel <riel@redhat.com> > Tested-by: Vinod Chegu <chegu_vinod@hp.com> > --- > kernel/sched/fair.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index babd316..302facf 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1301,7 +1301,16 @@ static int task_numa_migrate(struct task_struct *p) > if (env.best_cpu == -1) > return -EAGAIN; > > - sched_setnuma(p, env.dst_nid); > + /* > + * If the task is part of a workload that spans multiple NUMA nodes, > + * and is migrating into one of the workload's active nodes, remember I read 'into' as: !node_isset(env.src_nid, ...) && node_isset(env.dst_nid, ...) The code doesn't seem to do this. > + * this node as the task's preferred numa node, so the workload can > + * settle down. > + * A task that migrated to a second choice node will be better off > + * trying for a better one later. Do not set the preferred node here. > + */ > + if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes)) > + sched_setnuma(p, env.dst_nid); OK, so I was totally confused on this one. What I missed was that we set the primary choice over in task_numa_placement(). I'm not really happy with the changelog; but I'm also struggling to identify what exactly is missing. Or rather, the thing makes me confused, and not feel like it actually explains it proper. That said; I tend to more or less agree with the actual change, but.. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node 2014-04-14 12:56 ` Peter Zijlstra @ 2014-04-15 14:35 ` Rik van Riel 2014-04-15 16:51 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: Rik van Riel @ 2014-04-15 14:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-kernel, mingo, chegu_vinod, mgorman On 04/14/2014 08:56 AM, Peter Zijlstra wrote: > On Fri, Apr 11, 2014 at 01:00:29PM -0400, riel@redhat.com wrote: >> From: Rik van Riel <riel@redhat.com> >> >> Setting the numa_preferred_node for a task in task_numa_migrate >> does nothing on a 2-node system. Either we migrate to the node >> that already was our preferred node, or we stay where we were. >> >> On a 4-node system, it can slightly decrease overhead, by not >> calling the NUMA code as much. Since every node tends to be >> directly connected to every other node, running on the wrong >> node for a while does not do much damage. >> >> However, on an 8 node system, there are far more bad nodes >> than there are good ones, and pretending that a second choice >> is actually the preferred node can greatly delay, or even >> prevent, a workload from converging. >> >> The only time we can safely pretend that a second choice >> node is the preferred node is when the task is part of a >> workload that spans multiple NUMA nodes. >> >> Signed-off-by: Rik van Riel <riel@redhat.com> >> Tested-by: Vinod Chegu <chegu_vinod@hp.com> >> --- >> kernel/sched/fair.c | 11 ++++++++++- >> 1 file changed, 10 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index babd316..302facf 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1301,7 +1301,16 @@ static int task_numa_migrate(struct task_struct *p) >> if (env.best_cpu == -1) >> return -EAGAIN; >> >> - sched_setnuma(p, env.dst_nid); >> + /* >> + * If the task is part of a workload that spans multiple NUMA nodes, >> + * and is migrating into one of the workload's active nodes, remember > > I read 'into' as: > !node_isset(env.src_nid, ...) && node_isset(env.dst_nid, ...) > > The code doesn't seem to do this. s/into/to/ makes the comment and the code match again :) >> + * this node as the task's preferred numa node, so the workload can >> + * settle down. >> + * A task that migrated to a second choice node will be better off >> + * trying for a better one later. Do not set the preferred node here. >> + */ >> + if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes)) >> + sched_setnuma(p, env.dst_nid); > > OK, so I was totally confused on this one. > > What I missed was that we set the primary choice over in > task_numa_placement(). > > I'm not really happy with the changelog; but I'm also struggling to > identify what exactly is missing. Or rather, the thing makes me > confused, and not feel like it actually explains it proper. > > That said; I tend to more or less agree with the actual change, but.. I have looked at the comment and the changelog some more, and it is not clear to me what you are missing, or what I could be explaining better... ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node 2014-04-15 14:35 ` Rik van Riel @ 2014-04-15 16:51 ` Peter Zijlstra 0 siblings, 0 replies; 19+ messages in thread From: Peter Zijlstra @ 2014-04-15 16:51 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, mingo, chegu_vinod, mgorman On Tue, Apr 15, 2014 at 10:35:29AM -0400, Rik van Riel wrote: > I have looked at the comment and the changelog some > more, and it is not clear to me what you are missing, > or what I could be explaining better... Yeah, can't blame you. I couldn't explain myself either, other than having this feeling there's something there. In any case, I can't blame you for that, and I did queue the patches. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node 2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel 2014-04-14 12:56 ` Peter Zijlstra @ 2014-04-25 9:09 ` Mel Gorman 2014-05-08 10:43 ` [tip:sched/core] sched/numa: Do " tip-bot for Rik van Riel 2 siblings, 0 replies; 19+ messages in thread From: Mel Gorman @ 2014-04-25 9:09 UTC (permalink / raw) To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod On Fri, Apr 11, 2014 at 01:00:29PM -0400, riel@redhat.com wrote: > From: Rik van Riel <riel@redhat.com> > > Setting the numa_preferred_node for a task in task_numa_migrate > does nothing on a 2-node system. Either we migrate to the node > that already was our preferred node, or we stay where we were. > > On a 4-node system, it can slightly decrease overhead, by not > calling the NUMA code as much. Since every node tends to be > directly connected to every other node, running on the wrong > node for a while does not do much damage. > Guess what size of machine I do the vast bulk of testing of automatic NUMA balancing on! > However, on an 8 node system, there are far more bad nodes > than there are good ones, and pretending that a second choice > is actually the preferred node can greatly delay, or even > prevent, a workload from converging. > > The only time we can safely pretend that a second choice > node is the preferred node is when the task is part of a > workload that spans multiple NUMA nodes. > > Signed-off-by: Rik van Riel <riel@redhat.com> > Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [tip:sched/core] sched/numa: Do not set preferred_node on migration to a second choice node 2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel 2014-04-14 12:56 ` Peter Zijlstra 2014-04-25 9:09 ` Mel Gorman @ 2014-05-08 10:43 ` tip-bot for Rik van Riel 2 siblings, 0 replies; 19+ messages in thread From: tip-bot for Rik van Riel @ 2014-05-08 10:43 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, torvalds, peterz, riel, chegu_vinod, mgorman, tglx Commit-ID: 68d1b02a58f5d9f584c1fb2923ed60ec68cbbd9b Gitweb: http://git.kernel.org/tip/68d1b02a58f5d9f584c1fb2923ed60ec68cbbd9b Author: Rik van Riel <riel@redhat.com> AuthorDate: Fri, 11 Apr 2014 13:00:29 -0400 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 7 May 2014 13:33:47 +0200 sched/numa: Do not set preferred_node on migration to a second choice node Setting the numa_preferred_node for a task in task_numa_migrate does nothing on a 2-node system. Either we migrate to the node that already was our preferred node, or we stay where we were. On a 4-node system, it can slightly decrease overhead, by not calling the NUMA code as much. Since every node tends to be directly connected to every other node, running on the wrong node for a while does not do much damage. However, on an 8 node system, there are far more bad nodes than there are good ones, and pretending that a second choice is actually the preferred node can greatly delay, or even prevent, a workload from converging. The only time we can safely pretend that a second choice node is the preferred node is when the task is part of a workload that spans multiple NUMA nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1397235629-16328-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ecea8d9..051903f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1301,7 +1301,16 @@ static int task_numa_migrate(struct task_struct *p) if (env.best_cpu == -1) return -EAGAIN; - sched_setnuma(p, env.dst_nid); + /* + * If the task is part of a workload that spans multiple NUMA nodes, + * and is migrating into one of the workload's active nodes, remember + * this node as the task's preferred numa node, so the workload can + * settle down. + * A task that migrated to a second choice node will be better off + * trying for a better one later. Do not set the preferred node here. + */ + if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes)) + sched_setnuma(p, env.dst_nid); /* * Reset the scan period if the task is being rescheduled on an ^ permalink raw reply related [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-05-08 10:44 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-04-11 17:00 [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving riel 2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel 2014-04-11 17:34 ` Joe Perches 2014-04-11 17:41 ` Rik van Riel 2014-04-11 18:01 ` Joe Perches 2014-04-25 9:04 ` Mel Gorman 2014-05-08 10:42 ` [tip:sched/core] sched/numa: Count " tip-bot for Rik van Riel 2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel 2014-04-11 17:46 ` Joe Perches 2014-04-11 18:03 ` Rik van Riel 2014-04-14 8:19 ` Ingo Molnar 2014-04-25 9:05 ` Mel Gorman 2014-05-08 10:42 ` [tip:sched/core] sched/numa: Retry " tip-bot for Rik van Riel 2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel 2014-04-14 12:56 ` Peter Zijlstra 2014-04-15 14:35 ` Rik van Riel 2014-04-15 16:51 ` Peter Zijlstra 2014-04-25 9:09 ` Mel Gorman 2014-05-08 10:43 ` [tip:sched/core] sched/numa: Do " tip-bot for Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox