* [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing
@ 2015-05-27 19:04 riel
2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: riel @ 2015-05-27 19:04 UTC (permalink / raw)
To: linux-kernel; +Cc: mgorman, jhladky, peterz, mingo, dedekind1
A previous attempt to resolve a major conflict between load balancing and
NUMA balancing, changeset 095bebf61a46 ("sched/numa: Do not move past the
balance point if unbalanced"), introduced its own problems.
Revert that changeset, and introduce a new fix, which actually seems to
resolve the issues observed in Jirka's tests.
A test where the main thread creates a large memory area, and spawns
a worker thread to iterate over the memory (placed on another node
by select_task_rq_fair), after which the main thread goes to sleep
and waits for the worker thread to loop over all the memory now sees
the worker thread migrated to where the memory is, instead of having
all the memory migrated over like before.
Jirka has run a number of performance tests on several systems:
single instance SpecJBB 2005 performance is 7-15% higher on a 4 node
system, with higher gains on systems with more cores per socket.
Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
little or no changes with the revert of 095bebf61a46 and this patch.
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") 2015-05-27 19:04 [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing riel @ 2015-05-27 19:04 ` riel 2015-06-07 17:47 ` [tip:sched/core] Revert " tip-bot for Rik van Riel 2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel 2015-05-29 7:44 ` [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing Artem Bityutskiy 2 siblings, 1 reply; 11+ messages in thread From: riel @ 2015-05-27 19:04 UTC (permalink / raw) To: linux-kernel; +Cc: mgorman, jhladky, peterz, mingo, dedekind1 From: Rik van Riel <riel@redhat.com> Commit 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") broke convergence of workloads with just one runnable thread, by making it impossible for the one runnable thread on the system to move from one NUMA node to another. Instead, the thread would remain where it was, and pull all the memory across to its location, which is much slower than just migrating the thread to where the memory is. The next patch has a better fix for the issue that 095bebf61a46 tried to address. Reported-by: Jirka Hladky <jhladky@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> --- kernel/sched/fair.c | 41 +++++++++++++++-------------------------- 1 file changed, 15 insertions(+), 26 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ffeaa4105e48..c47bf0dffb34 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1198,11 +1198,9 @@ static void task_numa_assign(struct task_numa_env *env, static bool load_too_imbalanced(long src_load, long dst_load, struct task_numa_env *env) { + long imb, old_imb; + long orig_src_load, orig_dst_load; long src_capacity, dst_capacity; - long orig_src_load; - long load_a, load_b; - long moved_load; - long imb; /* * The load is corrected for the CPU capacity available on each node. @@ -1215,39 +1213,30 @@ static bool load_too_imbalanced(long src_load, long dst_load, dst_capacity = env->dst_stats.compute_capacity; /* We care about the slope of the imbalance, not the direction. */ - load_a = dst_load; - load_b = src_load; - if (load_a < load_b) - swap(load_a, load_b); + if (dst_load < src_load) + swap(dst_load, src_load); /* Is the difference below the threshold? */ - imb = load_a * src_capacity * 100 - - load_b * dst_capacity * env->imbalance_pct; + imb = dst_load * src_capacity * 100 - + src_load * dst_capacity * env->imbalance_pct; if (imb <= 0) return false; /* * The imbalance is above the allowed threshold. - * Allow a move that brings us closer to a balanced situation, - * without moving things past the point of balance. + * Compare it with the old imbalance. */ orig_src_load = env->src_stats.load; + orig_dst_load = env->dst_stats.load; - /* - * In a task swap, there will be one load moving from src to dst, - * and another moving back. This is the net sum of both moves. - * A simple task move will always have a positive value. - * Allow the move if it brings the system closer to a balanced - * situation, without crossing over the balance point. - */ - moved_load = orig_src_load - src_load; + if (orig_dst_load < orig_src_load) + swap(orig_dst_load, orig_src_load); - if (moved_load > 0) - /* Moving src -> dst. Did we overshoot balance? */ - return src_load * dst_capacity < dst_load * src_capacity; - else - /* Moving dst -> src. Did we overshoot balance? */ - return dst_load * src_capacity < src_load * dst_capacity; + old_imb = orig_dst_load * src_capacity * 100 - + orig_src_load * dst_capacity * env->imbalance_pct; + + /* Would this change make things worse? */ + return (imb > old_imb); } /* -- 2.1.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [tip:sched/core] Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") 2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel @ 2015-06-07 17:47 ` tip-bot for Rik van Riel 0 siblings, 0 replies; 11+ messages in thread From: tip-bot for Rik van Riel @ 2015-06-07 17:47 UTC (permalink / raw) To: linux-tip-commits Cc: peterz, tglx, torvalds, mingo, hpa, akpm, jhladky, riel, linux-kernel Commit-ID: e4991b240c622f0441c21f4869e13209abc08c5e Gitweb: http://git.kernel.org/tip/e4991b240c622f0441c21f4869e13209abc08c5e Author: Rik van Riel <riel@redhat.com> AuthorDate: Wed, 27 May 2015 15:04:27 -0400 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Sun, 7 Jun 2015 15:57:44 +0200 Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") Commit 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") broke convergence of workloads with just one runnable thread, by making it impossible for the one runnable thread on the system to move from one NUMA node to another. Instead, the thread would remain where it was, and pull all the memory across to its location, which is much slower than just migrating the thread to where the memory is. The next patch has a better fix for the issue that 095bebf61a46 tried to address. Reported-by: Jirka Hladky <jhladky@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dedekind1@gmail.com Cc: mgorman@suse.de Link: http://lkml.kernel.org/r/1432753468-7785-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 41 +++++++++++++++-------------------------- 1 file changed, 15 insertions(+), 26 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 84ada05..723d69e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1198,11 +1198,9 @@ static void task_numa_assign(struct task_numa_env *env, static bool load_too_imbalanced(long src_load, long dst_load, struct task_numa_env *env) { + long imb, old_imb; + long orig_src_load, orig_dst_load; long src_capacity, dst_capacity; - long orig_src_load; - long load_a, load_b; - long moved_load; - long imb; /* * The load is corrected for the CPU capacity available on each node. @@ -1215,39 +1213,30 @@ static bool load_too_imbalanced(long src_load, long dst_load, dst_capacity = env->dst_stats.compute_capacity; /* We care about the slope of the imbalance, not the direction. */ - load_a = dst_load; - load_b = src_load; - if (load_a < load_b) - swap(load_a, load_b); + if (dst_load < src_load) + swap(dst_load, src_load); /* Is the difference below the threshold? */ - imb = load_a * src_capacity * 100 - - load_b * dst_capacity * env->imbalance_pct; + imb = dst_load * src_capacity * 100 - + src_load * dst_capacity * env->imbalance_pct; if (imb <= 0) return false; /* * The imbalance is above the allowed threshold. - * Allow a move that brings us closer to a balanced situation, - * without moving things past the point of balance. + * Compare it with the old imbalance. */ orig_src_load = env->src_stats.load; + orig_dst_load = env->dst_stats.load; - /* - * In a task swap, there will be one load moving from src to dst, - * and another moving back. This is the net sum of both moves. - * A simple task move will always have a positive value. - * Allow the move if it brings the system closer to a balanced - * situation, without crossing over the balance point. - */ - moved_load = orig_src_load - src_load; + if (orig_dst_load < orig_src_load) + swap(orig_dst_load, orig_src_load); - if (moved_load > 0) - /* Moving src -> dst. Did we overshoot balance? */ - return src_load * dst_capacity < dst_load * src_capacity; - else - /* Moving dst -> src. Did we overshoot balance? */ - return dst_load * src_capacity < src_load * dst_capacity; + old_imb = orig_dst_load * src_capacity * 100 - + orig_src_load * dst_capacity * env->imbalance_pct; + + /* Would this change make things worse? */ + return (imb > old_imb); } /* ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination 2015-05-27 19:04 [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing riel 2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel @ 2015-05-27 19:04 ` riel 2015-05-28 8:29 ` Mel Gorman 2015-05-28 11:07 ` Srikar Dronamraju 2015-05-29 7:44 ` [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing Artem Bityutskiy 2 siblings, 2 replies; 11+ messages in thread From: riel @ 2015-05-27 19:04 UTC (permalink / raw) To: linux-kernel; +Cc: mgorman, jhladky, peterz, mingo, dedekind1 From: Rik van Riel <riel@redhat.com> Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the preferred node") fixes an issue where workloads would never converge on a fully loaded (or overloaded) system. However, it introduces a regression on less than fully loaded systems, where workloads converge on a few NUMA nodes, instead of properly staying spread out across the whole system. This leads to a reduction in available memory bandwidth, and usable CPU cache, with predictable performance problems. The root cause appears to be an interaction between the load balancer and NUMA balancing, where the short term load represented by the load balancer differs from the long term load the NUMA balancing code would like to base its decisions on. Simply reverting a43455a1 would re-introduce the non-convergence of workloads on fully loaded systems, so that is not a good option. As an aside, the check done before a43455a1 only applied to a task's preferred node, not to other candidate nodes in the system, so the converge-on-too-few-nodes problem still happens, just to a lesser degree. Instead, try to compensate for the impedance mismatch between the load balancer and NUMA balancing by only ever considering a lesser loaded node as a destination for NUMA balancing, regardless of whether the task is trying to move to the preferred node, or to another node. This patch also addresses the issue that a system with a single runnable thread would never migrate that thread to near its memory, introduced by 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced"). A test where the main thread creates a large memory area, and spawns a worker thread to iterate over the memory (placed on another node by select_task_rq_fair), after which the main thread goes to sleep and waits for the worker thread to loop over all the memory now sees the worker thread migrated to where the memory is, instead of having all the memory migrated over like before. Jirka has run a number of performance tests on several systems: single instance SpecJBB 2005 performance is 7-15% higher on a 4 node system, with higher gains on systems with more cores per socket. Multi-instance SpecJBB 2005 (one per node), linpack, and stream see little or no changes with the revert of 095bebf61a46 and this patch. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Artem Bityutski <dedekind1@gmail.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Jirka Hladky <jhladky@redhat.com> --- kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c47bf0dffb34..f655f2ad155d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env, } } +/* Only move tasks to a NUMA node less busy than the current node. */ +static bool numa_has_capacity(struct task_numa_env *env) +{ + struct numa_stats *src = &env->src_stats; + struct numa_stats *dst = &env->dst_stats; + + if (src->has_free_capacity && !dst->has_free_capacity) + return false; + + /* + * Only consider a task move if the source has a higher destination + * than the destination, corrected for CPU capacity on each node. + * + * src->load dst->load + * --------------------- vs --------------------- + * src->compute_capacity dst->compute_capacity + */ + if (src->load * dst->compute_capacity > + dst->load * src->compute_capacity) + return true; + + return false; +} + static int task_numa_migrate(struct task_struct *p) { struct task_numa_env env = { @@ -1452,7 +1476,8 @@ static int task_numa_migrate(struct task_struct *p) update_numa_stats(&env.dst_stats, env.dst_nid); /* Try to find a spot on the preferred nid. */ - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); /* * Look at other nodes in these cases: @@ -1483,7 +1508,8 @@ static int task_numa_migrate(struct task_struct *p) env.dist = dist; env.dst_nid = nid; update_numa_stats(&env.dst_stats, env.dst_nid); - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); } } -- 2.1.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination 2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel @ 2015-05-28 8:29 ` Mel Gorman 2015-05-28 11:07 ` Srikar Dronamraju 1 sibling, 0 replies; 11+ messages in thread From: Mel Gorman @ 2015-05-28 8:29 UTC (permalink / raw) To: riel; +Cc: linux-kernel, jhladky, peterz, mingo, dedekind1 On Wed, May 27, 2015 at 03:04:28PM -0400, riel@redhat.com wrote: > From: Rik van Riel <riel@redhat.com> > > Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the > preferred node") fixes an issue where workloads would never converge > on a fully loaded (or overloaded) system. > > However, it introduces a regression on less than fully loaded systems, > where workloads converge on a few NUMA nodes, instead of properly staying > spread out across the whole system. This leads to a reduction in available > memory bandwidth, and usable CPU cache, with predictable performance problems. > > The root cause appears to be an interaction between the load balancer and > NUMA balancing, where the short term load represented by the load balancer > differs from the long term load the NUMA balancing code would like to base > its decisions on. > > Simply reverting a43455a1 would re-introduce the non-convergence of > workloads on fully loaded systems, so that is not a good option. As > an aside, the check done before a43455a1 only applied to a task's > preferred node, not to other candidate nodes in the system, so the > converge-on-too-few-nodes problem still happens, just to a lesser > degree. > > Instead, try to compensate for the impedance mismatch between the > load balancer and NUMA balancing by only ever considering a lesser > loaded node as a destination for NUMA balancing, regardless of > whether the task is trying to move to the preferred node, or to another > node. > > This patch also addresses the issue that a system with a single runnable > thread would never migrate that thread to near its memory, introduced by > 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced"). > > A test where the main thread creates a large memory area, and spawns > a worker thread to iterate over the memory (placed on another node > by select_task_rq_fair), after which the main thread goes to sleep > and waits for the worker thread to loop over all the memory now sees > the worker thread migrated to where the memory is, instead of having > all the memory migrated over like before. > > Jirka has run a number of performance tests on several systems: > single instance SpecJBB 2005 performance is 7-15% higher on a 4 node > system, with higher gains on systems with more cores per socket. > Multi-instance SpecJBB 2005 (one per node), linpack, and stream see > little or no changes with the revert of 095bebf61a46 and this patch. > > Signed-off-by: Rik van Riel <riel@redhat.com> > Reported-by: Artem Bityutski <dedekind1@gmail.com> > Reported-by: Jirka Hladky <jhladky@redhat.com> > Tested-by: Jirka Hladky <jhladky@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination 2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel 2015-05-28 8:29 ` Mel Gorman @ 2015-05-28 11:07 ` Srikar Dronamraju 2015-05-28 13:33 ` Rik van Riel 1 sibling, 1 reply; 11+ messages in thread From: Srikar Dronamraju @ 2015-05-28 11:07 UTC (permalink / raw) To: riel; +Cc: linux-kernel, mgorman, jhladky, peterz, mingo, dedekind1 * riel@redhat.com <riel@redhat.com> [2015-05-27 15:04:28]: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c47bf0dffb34..f655f2ad155d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env, > } > } > > +/* Only move tasks to a NUMA node less busy than the current node. */ > +static bool numa_has_capacity(struct task_numa_env *env) > +{ > + struct numa_stats *src = &env->src_stats; > + struct numa_stats *dst = &env->dst_stats; > + > + if (src->has_free_capacity && !dst->has_free_capacity) > + return false; > + > + /* > + * Only consider a task move if the source has a higher destination > + * than the destination, corrected for CPU capacity on each node. In the above comment, did you mean source has higher load than the destination? > + * > + * src->load dst->load > + * --------------------- vs --------------------- > + * src->compute_capacity dst->compute_capacity > + */ > + if (src->load * dst->compute_capacity > > + dst->load * src->compute_capacity) > + return true; > + > + return false; > +} > + > -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination 2015-05-28 11:07 ` Srikar Dronamraju @ 2015-05-28 13:33 ` Rik van Riel 2015-05-28 13:37 ` Peter Zijlstra 0 siblings, 1 reply; 11+ messages in thread From: Rik van Riel @ 2015-05-28 13:33 UTC (permalink / raw) To: Srikar Dronamraju Cc: linux-kernel, mgorman, jhladky, peterz, mingo, dedekind1 On 05/28/2015 07:07 AM, Srikar Dronamraju wrote: > * riel@redhat.com <riel@redhat.com> [2015-05-27 15:04:28]: > >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index c47bf0dffb34..f655f2ad155d 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env, >> } >> } >> >> +/* Only move tasks to a NUMA node less busy than the current node. */ >> +static bool numa_has_capacity(struct task_numa_env *env) >> +{ >> + struct numa_stats *src = &env->src_stats; >> + struct numa_stats *dst = &env->dst_stats; >> + >> + if (src->has_free_capacity && !dst->has_free_capacity) >> + return false; >> + >> + /* >> + * Only consider a task move if the source has a higher destination >> + * than the destination, corrected for CPU capacity on each node. > > In the above comment, did you mean source has higher load than the > destination? Uh yes, indeed. Thanks for spotting that. Peter, do you want me to send a v2 with the typo fixed, or is it easier for you to fix up the typo before committing? -- All rights reversed ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination 2015-05-28 13:33 ` Rik van Riel @ 2015-05-28 13:37 ` Peter Zijlstra 2015-05-28 13:52 ` [PATCH v2 " Rik van Riel 0 siblings, 1 reply; 11+ messages in thread From: Peter Zijlstra @ 2015-05-28 13:37 UTC (permalink / raw) To: Rik van Riel Cc: Srikar Dronamraju, linux-kernel, mgorman, jhladky, mingo, dedekind1 On Thu, May 28, 2015 at 09:33:25AM -0400, Rik van Riel wrote: > Peter, do you want me to send a v2 with the typo fixed, or > is it easier for you to fix up the typo before committing? Please send a new one, and also fix the Changelog to only include 12 char SHA1 references :-) ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 2/2] numa,sched: only consider less busy nodes as numa balancing destination 2015-05-28 13:37 ` Peter Zijlstra @ 2015-05-28 13:52 ` Rik van Riel 2015-06-07 17:47 ` [tip:sched/core] sched/numa: Only consider less busy nodes as numa balancing destinations tip-bot for Rik van Riel 0 siblings, 1 reply; 11+ messages in thread From: Rik van Riel @ 2015-05-28 13:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Srikar Dronamraju, linux-kernel, mgorman, jhladky, mingo, dedekind1 Here you are :) ---8<--- Subject: numa,sched: only consider less busy nodes as numa balancing destination Changeset a43455a1d572 ("sched/numa: Ensure task_numa_migrate() checks the preferred node") fixes an issue where workloads would never converge on a fully loaded (or overloaded) system. However, it introduces a regression on less than fully loaded systems, where workloads converge on a few NUMA nodes, instead of properly staying spread out across the whole system. This leads to a reduction in available memory bandwidth, and usable CPU cache, with predictable performance problems. The root cause appears to be an interaction between the load balancer and NUMA balancing, where the short term load represented by the load balancer differs from the long term load the NUMA balancing code would like to base its decisions on. Simply reverting a43455a1d572 would re-introduce the non-convergence of workloads on fully loaded systems, so that is not a good option. As an aside, the check done before a43455a1d572 only applied to a task's preferred node, not to other candidate nodes in the system, so the converge-on-too-few-nodes problem still happens, just to a lesser degree. Instead, try to compensate for the impedance mismatch between the load balancer and NUMA balancing by only ever considering a lesser loaded node as a destination for NUMA balancing, regardless of whether the task is trying to move to the preferred node, or to another node. This patch also addresses the issue that a system with a single runnable thread would never migrate that thread to near its memory, introduced by 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced"). A test where the main thread creates a large memory area, and spawns a worker thread to iterate over the memory (placed on another node by select_task_rq_fair), after which the main thread goes to sleep and waits for the worker thread to loop over all the memory now sees the worker thread migrated to where the memory is, instead of having all the memory migrated over like before. Jirka has run a number of performance tests on several systems: single instance SpecJBB 2005 performance is 7-15% higher on a 4 node system, with higher gains on systems with more cores per socket. Multi-instance SpecJBB 2005 (one per node), linpack, and stream see little or no changes with the revert of 095bebf61a46 and this patch. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Artem Bityutski <dedekind1@gmail.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Artem Bityutskiy <dedekind1@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> --- kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c47bf0dffb34..424d6ec1f61d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env, } } +/* Only move tasks to a NUMA node less busy than the current node. */ +static bool numa_has_capacity(struct task_numa_env *env) +{ + struct numa_stats *src = &env->src_stats; + struct numa_stats *dst = &env->dst_stats; + + if (src->has_free_capacity && !dst->has_free_capacity) + return false; + + /* + * Only consider a task move if the source has a higher load + * than the destination, corrected for CPU capacity on each node. + * + * src->load dst->load + * --------------------- vs --------------------- + * src->compute_capacity dst->compute_capacity + */ + if (src->load * dst->compute_capacity > + dst->load * src->compute_capacity) + return true; + + return false; +} + static int task_numa_migrate(struct task_struct *p) { struct task_numa_env env = { @@ -1452,7 +1476,8 @@ static int task_numa_migrate(struct task_struct *p) update_numa_stats(&env.dst_stats, env.dst_nid); /* Try to find a spot on the preferred nid. */ - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); /* * Look at other nodes in these cases: @@ -1483,7 +1508,8 @@ static int task_numa_migrate(struct task_struct *p) env.dist = dist; env.dst_nid = nid; update_numa_stats(&env.dst_stats, env.dst_nid); - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); } } ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [tip:sched/core] sched/numa: Only consider less busy nodes as numa balancing destinations 2015-05-28 13:52 ` [PATCH v2 " Rik van Riel @ 2015-06-07 17:47 ` tip-bot for Rik van Riel 0 siblings, 0 replies; 11+ messages in thread From: tip-bot for Rik van Riel @ 2015-06-07 17:47 UTC (permalink / raw) To: linux-tip-commits Cc: peterz, mingo, riel, jhladky, torvalds, tglx, srikar, dedekind1, linux-kernel, mgorman, akpm, hpa Commit-ID: 6f9aad0bc37286c0441b57f0ba8cffee50715426 Gitweb: http://git.kernel.org/tip/6f9aad0bc37286c0441b57f0ba8cffee50715426 Author: Rik van Riel <riel@redhat.com> AuthorDate: Thu, 28 May 2015 09:52:49 -0400 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Sun, 7 Jun 2015 15:57:45 +0200 sched/numa: Only consider less busy nodes as numa balancing destinations Changeset a43455a1d572 ("sched/numa: Ensure task_numa_migrate() checks the preferred node") fixes an issue where workloads would never converge on a fully loaded (or overloaded) system. However, it introduces a regression on less than fully loaded systems, where workloads converge on a few NUMA nodes, instead of properly staying spread out across the whole system. This leads to a reduction in available memory bandwidth, and usable CPU cache, with predictable performance problems. The root cause appears to be an interaction between the load balancer and NUMA balancing, where the short term load represented by the load balancer differs from the long term load the NUMA balancing code would like to base its decisions on. Simply reverting a43455a1d572 would re-introduce the non-convergence of workloads on fully loaded systems, so that is not a good option. As an aside, the check done before a43455a1d572 only applied to a task's preferred node, not to other candidate nodes in the system, so the converge-on-too-few-nodes problem still happens, just to a lesser degree. Instead, try to compensate for the impedance mismatch between the load balancer and NUMA balancing by only ever considering a lesser loaded node as a destination for NUMA balancing, regardless of whether the task is trying to move to the preferred node, or to another node. This patch also addresses the issue that a system with a single runnable thread would never migrate that thread to near its memory, introduced by 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced"). A test where the main thread creates a large memory area, and spawns a worker thread to iterate over the memory (placed on another node by select_task_rq_fair), after which the main thread goes to sleep and waits for the worker thread to loop over all the memory now sees the worker thread migrated to where the memory is, instead of having all the memory migrated over like before. Jirka has run a number of performance tests on several systems: single instance SpecJBB 2005 performance is 7-15% higher on a 4 node system, with higher gains on systems with more cores per socket. Multi-instance SpecJBB 2005 (one per node), linpack, and stream see little or no changes with the revert of 095bebf61a46 and this patch. Reported-by: Artem Bityutski <dedekind1@gmail.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Artem Bityutskiy <dedekind1@gmail.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150528095249.3083ade0@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 723d69e..4b6e5f6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env, } } +/* Only move tasks to a NUMA node less busy than the current node. */ +static bool numa_has_capacity(struct task_numa_env *env) +{ + struct numa_stats *src = &env->src_stats; + struct numa_stats *dst = &env->dst_stats; + + if (src->has_free_capacity && !dst->has_free_capacity) + return false; + + /* + * Only consider a task move if the source has a higher load + * than the destination, corrected for CPU capacity on each node. + * + * src->load dst->load + * --------------------- vs --------------------- + * src->compute_capacity dst->compute_capacity + */ + if (src->load * dst->compute_capacity > + dst->load * src->compute_capacity) + return true; + + return false; +} + static int task_numa_migrate(struct task_struct *p) { struct task_numa_env env = { @@ -1452,7 +1476,8 @@ static int task_numa_migrate(struct task_struct *p) update_numa_stats(&env.dst_stats, env.dst_nid); /* Try to find a spot on the preferred nid. */ - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); /* * Look at other nodes in these cases: @@ -1483,7 +1508,8 @@ static int task_numa_migrate(struct task_struct *p) env.dist = dist; env.dst_nid = nid; update_numa_stats(&env.dst_stats, env.dst_nid); - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); } } ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing 2015-05-27 19:04 [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing riel 2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel 2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel @ 2015-05-29 7:44 ` Artem Bityutskiy 2 siblings, 0 replies; 11+ messages in thread From: Artem Bityutskiy @ 2015-05-29 7:44 UTC (permalink / raw) To: riel; +Cc: linux-kernel, mgorman, jhladky, peterz, mingo On Wed, 2015-05-27 at 15:04 -0400, riel@redhat.com wrote: > A previous attempt to resolve a major conflict between load balancing and > NUMA balancing, changeset 095bebf61a46 ("sched/numa: Do not move past the > balance point if unbalanced"), introduced its own problems. > > Revert that changeset, and introduce a new fix, which actually seems to > resolve the issues observed in Jirka's tests. > > A test where the main thread creates a large memory area, and spawns > a worker thread to iterate over the memory (placed on another node > by select_task_rq_fair), after which the main thread goes to sleep > and waits for the worker thread to loop over all the memory now sees > the worker thread migrated to where the memory is, instead of having > all the memory migrated over like before. > > Jirka has run a number of performance tests on several systems: > single instance SpecJBB 2005 performance is 7-15% higher on a 4 node > system, with higher gains on systems with more cores per socket. > Multi-instance SpecJBB 2005 (one per node), linpack, and stream see > little or no changes with the revert of 095bebf61a46 and this patch. [Re-sending since it didn't hit the mailing list first time, due to HTML] Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Hi Rik, I've executed our eCommerce Web workload benchmark. Last time I did not revert 095bebf61a46, now I tested this patch-set. Let me summarize everything. Here is the average web server response time in millisecs for various kernels. 1. v4.1-rc1 - 1450 2. v4.1-rc1 + a43455a1d572daf7b730fe12eb747d1e17411365 reverted - 300 3. v4.1-rc1 + NUMA disabled - 480 4. v4.1-rc5 + this patch-set - 1260 So as you see, for our workload reverting a43455a1d572daf7b730fe12eb747d1e17411365 results in Web server being most responsive (reminder - this is about a 2-socket Haswell-EP server). Just disabling NUMA also gives a big improvement, but not as good as reverting the "offending" (from our workload's POW) patch. This patch-set does result in a noticeable improvement too. Artem. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-06-07 17:48 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-27 19:04 [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing riel
2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel
2015-06-07 17:47 ` [tip:sched/core] Revert " tip-bot for Rik van Riel
2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel
2015-05-28 8:29 ` Mel Gorman
2015-05-28 11:07 ` Srikar Dronamraju
2015-05-28 13:33 ` Rik van Riel
2015-05-28 13:37 ` Peter Zijlstra
2015-05-28 13:52 ` [PATCH v2 " Rik van Riel
2015-06-07 17:47 ` [tip:sched/core] sched/numa: Only consider less busy nodes as numa balancing destinations tip-bot for Rik van Riel
2015-05-29 7:44 ` [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing Artem Bityutskiy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox