* [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential
@ 2010-09-28 0:29 Nikhil Rao
2010-09-28 0:29 ` [PATCH 1/3] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Nikhil Rao @ 2010-09-28 0:29 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
Cc: Venkatesh Pallipadi, linux-kernel, Nikhil Rao
Hi all,
I have attached a series of patches that improve load balancing when there is a
large weight differential between tasks. These patches are based off the
feedback Peter Zijlstra gave in an earlier post (see http://thread.gmane.org/gmane.linux.kernel/1015966).
They can be applied to v2.6.36-rc5 or -tip without conflicts.
Tested with the following setup.
- Test machine is a 16 cpu box (quad-socket, quad-core).
- Baseline is v2.6.36-rc5 kernel
We spawn 16 SCHED_IDLE soaker threads and one SCHED_NORMAL task. On the
baseline kernel, the machine has ~18% idle time. With these patches applied on
top of baseline, idle time drops to 0%.
v2.6.36-rc5
04:58:46 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
04:58:47 PM all 81.47 0.00 0.25 0.00 0.00 0.00 0.00 18.28 13796.00
04:58:48 PM all 81.20 0.00 0.25 0.00 0.00 0.00 0.00 18.55 13816.00
04:58:49 PM all 80.93 0.19 0.25 0.00 0.00 0.06 0.00 18.57 13965.00
04:58:50 PM all 81.40 0.00 0.25 0.00 0.00 0.00 0.00 18.35 13837.37
04:58:51 PM all 81.19 0.00 0.31 0.00 0.00 0.00 0.00 18.50 13592.08
04:58:52 PM all 81.25 0.00 0.25 0.00 0.00 0.00 0.00 18.50 13721.00
04:58:53 PM all 81.19 0.00 0.25 0.00 0.00 0.00 0.00 18.56 13764.00
04:58:54 PM all 81.25 0.00 0.25 0.00 0.00 0.00 0.00 18.50 13841.41
04:58:55 PM all 80.30 0.00 1.19 0.00 0.00 0.00 0.00 18.51 14989.11
04:58:56 PM all 80.77 0.00 0.50 0.00 0.00 0.00 0.00 18.73 13964.65
Average: all 81.09 0.02 0.37 0.00 0.00 0.01 0.00 18.51 13929.53
v2.6.36-rc5 + patches
05:00:06 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:00:07 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16364.00
05:00:08 PM all 99.81 0.06 0.12 0.00 0.00 0.00 0.00 0.00 16348.00
05:00:09 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16330.00
05:00:10 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16317.00
05:00:11 PM all 99.88 0.06 0.06 0.00 0.00 0.00 0.00 0.00 16327.00
05:00:12 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16323.00
05:00:13 PM all 99.88 0.00 0.12 0.00 0.00 0.00 0.00 0.00 16323.00
05:00:14 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16321.00
05:00:15 PM all 99.63 0.06 0.25 0.00 0.00 0.06 0.00 0.00 16354.00
05:00:16 PM all 99.62 0.00 0.38 0.00 0.00 0.00 0.00 0.00 19059.60
Average: all 99.85 0.02 0.13 0.00 0.00 0.01 0.00 0.00 16604.20
Comments, feedback welcome.
-Thanks,
Nikhil
Nikhil Rao (3):
sched: set group_imb only a task can be pulled from the busiest cpu
sched: drop group_capacity to 1 only if remote group has no running
tasks
sched: do not consider SCHED_IDLE tasks to be cache hot
kernel/sched.c | 3 +++
kernel/sched_fair.c | 12 ++++++++----
2 files changed, 11 insertions(+), 4 deletions(-)
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 1/3] sched: set group_imb only a task can be pulled from the busiest cpu 2010-09-28 0:29 [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential Nikhil Rao @ 2010-09-28 0:29 ` Nikhil Rao 2010-09-28 0:29 ` [PATCH 2/3] sched: drop group_capacity to 1 only if remote group has no running tasks Nikhil Rao ` (2 subsequent siblings) 3 siblings, 0 replies; 15+ messages in thread From: Nikhil Rao @ 2010-09-28 0:29 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Mike Galbraith Cc: Venkatesh Pallipadi, linux-kernel, Nikhil Rao When cycling through sched groups to determine the busiest group, set group_imb only if the busiest cpu has more than 1 runnable task. This patch fixes the case where two cpus in a group have one runnable task each, but there is a large weight differential between these two tasks. The load balancer is unable to migrate any task from this group, and hence do not consider this group to be imbalanced. Signed-off-by: Nikhil Rao <ncrao@google.com> --- kernel/sched_fair.c | 10 +++++++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index a171138..de8a6a0 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2378,7 +2378,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, int local_group, const struct cpumask *cpus, int *balance, struct sg_lb_stats *sgs) { - unsigned long load, max_cpu_load, min_cpu_load; + unsigned long load, max_cpu_load, min_cpu_load, max_nr_running; int i; unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long avg_load_per_task = 0; @@ -2389,6 +2389,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, /* Tally up the load of all CPUs in the group */ max_cpu_load = 0; min_cpu_load = ~0UL; + max_nr_running = 0; for_each_cpu_and(i, sched_group_cpus(group), cpus) { struct rq *rq = cpu_rq(i); @@ -2406,8 +2407,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, load = target_load(i, load_idx); } else { load = source_load(i, load_idx); - if (load > max_cpu_load) + if (load > max_cpu_load) { max_cpu_load = load; + max_nr_running = rq->nr_running; + } if (min_cpu_load > load) min_cpu_load = load; } @@ -2447,7 +2450,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, if (sgs->sum_nr_running) avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; - if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task) + if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task && + max_nr_running > 1) sgs->group_imb = 1; sgs->group_capacity = -- 1.7.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/3] sched: drop group_capacity to 1 only if remote group has no running tasks 2010-09-28 0:29 [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential Nikhil Rao 2010-09-28 0:29 ` [PATCH 1/3] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao @ 2010-09-28 0:29 ` Nikhil Rao 2010-09-28 23:04 ` Suresh Siddha 2010-09-28 0:29 ` [PATCH 3/3] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao 2010-09-28 13:57 ` [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential Mike Galbraith 3 siblings, 1 reply; 15+ messages in thread From: Nikhil Rao @ 2010-09-28 0:29 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Mike Galbraith Cc: Venkatesh Pallipadi, linux-kernel, Nikhil Rao When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the remote sched group has no running tasks. This addresses the case where you have two tasks on one socket and the other socket is idle, in which case you drop the capacity to 1. If the remote group has >=1 running task, then there is no difference from a cache-sharing perspective. Signed-off-by: Nikhil Rao <ncrao@google.com> --- kernel/sched_fair.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index de8a6a0..33a7985 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2548,7 +2548,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, * first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. */ - if (prefer_sibling) + if (prefer_sibling && !sgs.sum_nr_running) sgs.group_capacity = min(sgs.group_capacity, 1UL); if (local_group) { -- 1.7.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] sched: drop group_capacity to 1 only if remote group has no running tasks 2010-09-28 0:29 ` [PATCH 2/3] sched: drop group_capacity to 1 only if remote group has no running tasks Nikhil Rao @ 2010-09-28 23:04 ` Suresh Siddha 2010-10-11 21:20 ` Nikhil Rao 0 siblings, 1 reply; 15+ messages in thread From: Suresh Siddha @ 2010-09-28 23:04 UTC (permalink / raw) To: Nikhil Rao Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Venkatesh Pallipadi, linux-kernel@vger.kernel.org On Mon, 2010-09-27 at 17:29 -0700, Nikhil Rao wrote: > When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 > only if the remote sched group has no running tasks. This addresses the case > where you have two tasks on one socket and the other socket is idle, in which > case you drop the capacity to 1. If the remote group has >=1 running task, then > there is no difference from a cache-sharing perspective. > > Signed-off-by: Nikhil Rao <ncrao@google.com> > --- > kernel/sched_fair.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index de8a6a0..33a7985 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -2548,7 +2548,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, > * first, lower the sg capacity to one so that we'll try > * and move all the excess tasks away. > */ > - if (prefer_sibling) > + if (prefer_sibling && !sgs.sum_nr_running) > sgs.group_capacity = min(sgs.group_capacity, 1UL); > > if (local_group) { Nikhil, Doesn't this break the case of: two sockets with dual-core and HT. Four tasks currently scheduled as: three on socket-0 (two threads on core-0 running two tasks and 1 thread on core-1 running one task). One on socket-1 (one thread on core-0 running a task, with other core-1 idle) We would like to move the task from core-0 socket-0 to core-1 socket-1, while we are load balancing at the socket level (it might be smp or numa level depending on system). thanks, suresh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] sched: drop group_capacity to 1 only if remote group has no running tasks 2010-09-28 23:04 ` Suresh Siddha @ 2010-10-11 21:20 ` Nikhil Rao 0 siblings, 0 replies; 15+ messages in thread From: Nikhil Rao @ 2010-10-11 21:20 UTC (permalink / raw) To: Suresh Siddha Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Venkatesh Pallipadi, linux-kernel@vger.kernel.org Hi Suresh, Sorry for the delayed reply. On Tue, Sep 28, 2010 at 4:04 PM, Suresh Siddha <suresh.b.siddha@intel.com> wrote: > On Mon, 2010-09-27 at 17:29 -0700, Nikhil Rao wrote: >> When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 >> only if the remote sched group has no running tasks. This addresses the case >> where you have two tasks on one socket and the other socket is idle, in which >> case you drop the capacity to 1. If the remote group has >=1 running task, then >> there is no difference from a cache-sharing perspective. >> >> Signed-off-by: Nikhil Rao <ncrao@google.com> >> --- >> kernel/sched_fair.c | 2 +- >> 1 files changed, 1 insertions(+), 1 deletions(-) >> >> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c >> index de8a6a0..33a7985 100644 >> --- a/kernel/sched_fair.c >> +++ b/kernel/sched_fair.c >> @@ -2548,7 +2548,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, >> * first, lower the sg capacity to one so that we'll try >> * and move all the excess tasks away. >> */ >> - if (prefer_sibling) >> + if (prefer_sibling && !sgs.sum_nr_running) >> sgs.group_capacity = min(sgs.group_capacity, 1UL); >> >> if (local_group) { > > Nikhil, Doesn't this break the case of: > > two sockets with dual-core and HT. Four tasks currently scheduled as: > three on socket-0 (two threads on core-0 running two tasks and 1 thread > on core-1 running one task). One on socket-1 (one thread on core-0 > running a task, with other core-1 idle) > > We would like to move the task from core-0 socket-0 to core-1 socket-1, > while we are load balancing at the socket level (it might be smp or numa > level depending on system). > > thanks, > suresh > Thanks for raising this issue. Yes, when you have a quad-core, dual-socket machine, the additional check will prevent group_capacity from dropping down to 1. In this situation, we want to decrease group_capacity if the local group has extra capacity (i.e. this_nr_running < this_group_weight) [credit goes to Venki for this insight]. This also works when you have a niced task, which is what this patch was trying to fix. I have attached a modified version of the patch below. Does this look OK? -Thanks, Nikhil --- diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index de8a6a0..e0f697a 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2030,6 +2030,7 @@ struct sd_lb_stats { unsigned long this_load; unsigned long this_load_per_task; unsigned long this_nr_running; + unsigned long this_group_capacity; /* Statistics of the busiest group */ unsigned long max_load; @@ -2548,7 +2549,8 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, * first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. */ - if (prefer_sibling) + if (prefer_sibling && !local_group && + sds->this_nr_running < sds->this_group_capacity) sgs.group_capacity = min(sgs.group_capacity, 1UL); if (local_group) { @@ -2556,6 +2558,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, sds->this = sg; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; + sds->this_group_capacity = sgs.group_capacity; } else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) { sds->max_load = sgs.avg_load; sds->busiest = sg; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/3] sched: do not consider SCHED_IDLE tasks to be cache hot 2010-09-28 0:29 [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential Nikhil Rao 2010-09-28 0:29 ` [PATCH 1/3] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao 2010-09-28 0:29 ` [PATCH 2/3] sched: drop group_capacity to 1 only if remote group has no running tasks Nikhil Rao @ 2010-09-28 0:29 ` Nikhil Rao 2010-09-28 13:57 ` [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential Mike Galbraith 3 siblings, 0 replies; 15+ messages in thread From: Nikhil Rao @ 2010-09-28 0:29 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Mike Galbraith Cc: Venkatesh Pallipadi, linux-kernel, Nikhil Rao This patch adds a check in task_hot to return if the task has SCHED_IDLE policy. SCHED_IDLE tasks have very low weight, and when run with regular weight tasks, are typically scheduled many milliseconds apart. There is no benefit from considering SCHED_IDLE tasks cache hot for load balancing. Signed-off-by: Nikhil Rao <ncrao@google.com> --- kernel/sched.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index ed09d4f..874efde 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2003,6 +2003,9 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd) if (p->sched_class != &fair_sched_class) return 0; + if (p->policy == SCHED_IDLE) + return 0; + /* * Buddy candidates are cache hot: */ -- 1.7.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential 2010-09-28 0:29 [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential Nikhil Rao ` (2 preceding siblings ...) 2010-09-28 0:29 ` [PATCH 3/3] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao @ 2010-09-28 13:57 ` Mike Galbraith 2010-09-28 21:15 ` Nikhil Rao 3 siblings, 1 reply; 15+ messages in thread From: Mike Galbraith @ 2010-09-28 13:57 UTC (permalink / raw) To: Nikhil Rao; +Cc: Ingo Molnar, Peter Zijlstra, Venkatesh Pallipadi, linux-kernel On Mon, 2010-09-27 at 17:29 -0700, Nikhil Rao wrote: > Hi all, > > I have attached a series of patches that improve load balancing when there is a > large weight differential between tasks. These patches are based off the > feedback Peter Zijlstra gave in an earlier post (see http://thread.gmane.org/gmane.linux.kernel/1015966). > They can be applied to v2.6.36-rc5 or -tip without conflicts. > > Tested with the following setup. > - Test machine is a 16 cpu box (quad-socket, quad-core). > - Baseline is v2.6.36-rc5 kernel > > We spawn 16 SCHED_IDLE soaker threads and one SCHED_NORMAL task. On the > baseline kernel, the machine has ~18% idle time. With these patches applied on > top of baseline, idle time drops to 0%. Hm. I can get it stuck with one core idle on ym little quad. top - 15:53:22 up 11 min, 17 users, load average: 5.05, 4.40, 2.51 Tasks: 270 total, 7 running, 263 sleeping, 0 stopped, 0 zombie Cpu(s): 75.3%us, 0.0%sy, 0.0%ni, 24.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 7455 root 5 -15 7996 340 256 R 100 0.0 0:59.93 1 pert 7421 root 20 0 7996 340 256 R 50 0.0 4:20.01 3 pert 7422 root 20 0 7996 340 256 R 50 0.0 3:45.81 2 pert 7423 root 20 0 7996 340 256 R 50 0.0 4:09.45 2 pert 7424 root 20 0 7996 344 256 R 50 0.0 4:12.75 3 pert ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential 2010-09-28 13:57 ` [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential Mike Galbraith @ 2010-09-28 21:15 ` Nikhil Rao 2010-09-29 1:45 ` Mike Galbraith 0 siblings, 1 reply; 15+ messages in thread From: Nikhil Rao @ 2010-09-28 21:15 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Peter Zijlstra, Venkatesh Pallipadi, linux-kernel On Tue, Sep 28, 2010 at 6:57 AM, Mike Galbraith <efault@gmx.de> wrote: > On Mon, 2010-09-27 at 17:29 -0700, Nikhil Rao wrote: >> Hi all, >> >> I have attached a series of patches that improve load balancing when there is a >> large weight differential between tasks. These patches are based off the >> feedback Peter Zijlstra gave in an earlier post (see http://thread.gmane.org/gmane.linux.kernel/1015966). >> They can be applied to v2.6.36-rc5 or -tip without conflicts. >> >> Tested with the following setup. >> - Test machine is a 16 cpu box (quad-socket, quad-core). >> - Baseline is v2.6.36-rc5 kernel >> >> We spawn 16 SCHED_IDLE soaker threads and one SCHED_NORMAL task. On the >> baseline kernel, the machine has ~18% idle time. With these patches applied on >> top of baseline, idle time drops to 0%. > > Hm. I can get it stuck with one core idle on ym little quad. > > top - 15:53:22 up 11 min, 17 users, load average: 5.05, 4.40, 2.51 > Tasks: 270 total, 7 running, 263 sleeping, 0 stopped, 0 zombie > Cpu(s): 75.3%us, 0.0%sy, 0.0%ni, 24.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND > 7455 root 5 -15 7996 340 256 R 100 0.0 0:59.93 1 pert > 7421 root 20 0 7996 340 256 R 50 0.0 4:20.01 3 pert > 7422 root 20 0 7996 340 256 R 50 0.0 3:45.81 2 pert > 7423 root 20 0 7996 340 256 R 50 0.0 4:09.45 2 pert > 7424 root 20 0 7996 344 256 R 50 0.0 4:12.75 3 pert > > Mike, Thanks for running this. I've not been able to reproduce what you are seeing on the few test machines that I have (different combinations of MC, CPU and NODE domains). Can you please give me more info about your setup? -Thanks, Nikhil ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential 2010-09-28 21:15 ` Nikhil Rao @ 2010-09-29 1:45 ` Mike Galbraith 2010-09-29 19:32 ` Nikhil Rao 0 siblings, 1 reply; 15+ messages in thread From: Mike Galbraith @ 2010-09-29 1:45 UTC (permalink / raw) To: Nikhil Rao; +Cc: Ingo Molnar, Peter Zijlstra, Venkatesh Pallipadi, linux-kernel On Tue, 2010-09-28 at 14:15 -0700, Nikhil Rao wrote: > Thanks for running this. I've not been able to reproduce what you are > seeing on the few test machines that I have (different combinations of > MC, CPU and NODE domains). Can you please give me more info about > your setup? It's a plain-jane Q6600 box, so has only MC and CPU domains. It doesn't necessarily _instantly_ "stick", can take a couple tries, or a little time. -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential 2010-09-29 1:45 ` Mike Galbraith @ 2010-09-29 19:32 ` Nikhil Rao 2010-10-04 3:08 ` Mike Galbraith 0 siblings, 1 reply; 15+ messages in thread From: Nikhil Rao @ 2010-09-29 19:32 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Peter Zijlstra, Venkatesh Pallipadi, linux-kernel On Tue, Sep 28, 2010 at 6:45 PM, Mike Galbraith <efault@gmx.de> wrote: > On Tue, 2010-09-28 at 14:15 -0700, Nikhil Rao wrote: > >> Thanks for running this. I've not been able to reproduce what you are >> seeing on the few test machines that I have (different combinations of >> MC, CPU and NODE domains). Can you please give me more info about >> your setup? > > It's a plain-jane Q6600 box, so has only MC and CPU domains. > > It doesn't necessarily _instantly_ "stick", can take a couple tries, or > a little time. The closest I have is a quad-core dual-socket machine (MC, CPU domains). And I'm having trouble reproducing it on that machine as well :-( I ran 5 soaker threads (one of them niced to -15) for a few hours and didn't see the problem. Can you please give me some trace data & schedstats to work with? Looking at the patch/code, I suspect active migration on the CPU scheduling domain pushes the nice 0 task (running on the same socket as the nice -15 task) to the other socket. This leaves you with an idle core on the nice -15 socket, and with soaker threads there is no way to come back to a 100% utilized state. One possible explanation is the group capacity for a sched group in the CPU sched domain is rounded to 1 (instead of 2). I have a patch below that throws a hammer at the problem and uses group weight instead of group capacity (this is experimental, will refine it if it works). Can you please see if that solves the problem? diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 6d934e8..3fdd669 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2057,6 +2057,7 @@ struct sg_lb_stats { unsigned long sum_nr_running; /* Nr tasks running in the group */ unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long group_capacity; + unsigned long group_weight; int group_imb; /* Is there an imbalance in the group ? */ }; @@ -2458,6 +2459,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); if (!sgs->group_capacity) sgs->group_capacity = fix_small_capacity(sd, group); + + sgs->group_weight = cpumask_weight(sched_group_cpus(group)); } /** @@ -2480,6 +2483,9 @@ static bool update_sd_pick_busiest(struct sched_domain *sd, if (sgs->avg_load <= sds->max_load) return false; + if (sgs->sum_nr_running <= sgs->group_weight) + return false; + if (sgs->sum_nr_running > sgs->group_capacity) return true; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential 2010-09-29 19:32 ` Nikhil Rao @ 2010-10-04 3:08 ` Mike Galbraith 2010-10-06 8:23 ` Nikhil Rao 0 siblings, 1 reply; 15+ messages in thread From: Mike Galbraith @ 2010-10-04 3:08 UTC (permalink / raw) To: Nikhil Rao; +Cc: Ingo Molnar, Peter Zijlstra, Venkatesh Pallipadi, linux-kernel Sorry for the late reply. (fired up your patchlet bright and early so it didn't rot in my inbox any longer;) On Wed, 2010-09-29 at 12:32 -0700, Nikhil Rao wrote: > On Tue, Sep 28, 2010 at 6:45 PM, Mike Galbraith <efault@gmx.de> wrote: > > On Tue, 2010-09-28 at 14:15 -0700, Nikhil Rao wrote: > > > >> Thanks for running this. I've not been able to reproduce what you are > >> seeing on the few test machines that I have (different combinations of > >> MC, CPU and NODE domains). Can you please give me more info about > >> your setup? > > > > It's a plain-jane Q6600 box, so has only MC and CPU domains. > > > > It doesn't necessarily _instantly_ "stick", can take a couple tries, or > > a little time. > > The closest I have is a quad-core dual-socket machine (MC, CPU > domains). And I'm having trouble reproducing it on that machine as > well :-( I ran 5 soaker threads (one of them niced to -15) for a few > hours and didn't see the problem. Can you please give me some trace > data & schedstats to work with? Booting with isolcpus or offlining the excess should help. > Looking at the patch/code, I suspect active migration on the CPU > scheduling domain pushes the nice 0 task (running on the same socket > as the nice -15 task) to the other socket. This leaves you with an > idle core on the nice -15 socket, and with soaker threads there is no > way to come back to a 100% utilized state. One possible explanation is > the group capacity for a sched group in the CPU sched domain is > rounded to 1 (instead of 2). I have a patch below that throws a hammer > at the problem and uses group weight instead of group capacity (this > is experimental, will refine it if it works). Can you please see if > that solves the problem? Nope, didn't help. I'll poke at it, but am squabbling elsewhere atm. -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential 2010-10-04 3:08 ` Mike Galbraith @ 2010-10-06 8:23 ` Nikhil Rao 2010-10-08 7:22 ` Mike Galbraith 0 siblings, 1 reply; 15+ messages in thread From: Nikhil Rao @ 2010-10-06 8:23 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Peter Zijlstra, Venkatesh Pallipadi, linux-kernel On Sun, Oct 3, 2010 at 8:08 PM, Mike Galbraith <efault@gmx.de> wrote: > On Wed, 2010-09-29 at 12:32 -0700, Nikhil Rao wrote: >> The closest I have is a quad-core dual-socket machine (MC, CPU >> domains). And I'm having trouble reproducing it on that machine as >> well :-( I ran 5 soaker threads (one of them niced to -15) for a few >> hours and didn't see the problem. Can you please give me some trace >> data & schedstats to work with? > > Booting with isolcpus or offlining the excess should help. > Sorry for the late reply. Booting with isolcpus did the trick, thanks. ... and now to dig into why this is happening. -Thanks, Nikhil >> Looking at the patch/code, I suspect active migration on the CPU >> scheduling domain pushes the nice 0 task (running on the same socket >> as the nice -15 task) to the other socket. This leaves you with an >> idle core on the nice -15 socket, and with soaker threads there is no >> way to come back to a 100% utilized state. One possible explanation is >> the group capacity for a sched group in the CPU sched domain is >> rounded to 1 (instead of 2). I have a patch below that throws a hammer >> at the problem and uses group weight instead of group capacity (this >> is experimental, will refine it if it works). Can you please see if >> that solves the problem? > > Nope, didn't help. I'll poke at it, but am squabbling elsewhere atm. > > -Mike > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential 2010-10-06 8:23 ` Nikhil Rao @ 2010-10-08 7:22 ` Mike Galbraith 2010-10-08 20:34 ` Nikhil Rao 0 siblings, 1 reply; 15+ messages in thread From: Mike Galbraith @ 2010-10-08 7:22 UTC (permalink / raw) To: Nikhil Rao; +Cc: Ingo Molnar, Peter Zijlstra, Venkatesh Pallipadi, linux-kernel On Wed, 2010-10-06 at 01:23 -0700, Nikhil Rao wrote: > On Sun, Oct 3, 2010 at 8:08 PM, Mike Galbraith <efault@gmx.de> wrote: > > On Wed, 2010-09-29 at 12:32 -0700, Nikhil Rao wrote: > >> The closest I have is a quad-core dual-socket machine (MC, CPU > >> domains). And I'm having trouble reproducing it on that machine as > >> well :-( I ran 5 soaker threads (one of them niced to -15) for a few > >> hours and didn't see the problem. Can you please give me some trace > >> data & schedstats to work with? > > > > Booting with isolcpus or offlining the excess should help. > > > > Sorry for the late reply. Booting with isolcpus did the trick, thanks. > > ... and now to dig into why this is happening. I was poking it (again) yesterday, and it's kind of annoying. I can't call this behavior black/white broken. It's freeing up a cache for a very high priority task, which is kinda nice, but SMP nice is costing 25% of my box's processor power in this case too. Hrmph. -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential 2010-10-08 7:22 ` Mike Galbraith @ 2010-10-08 20:34 ` Nikhil Rao 2010-10-10 10:15 ` Mike Galbraith 0 siblings, 1 reply; 15+ messages in thread From: Nikhil Rao @ 2010-10-08 20:34 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Peter Zijlstra, Venkatesh Pallipadi, linux-kernel On Fri, Oct 8, 2010 at 12:22 AM, Mike Galbraith <efault@gmx.de> wrote: > On Wed, 2010-10-06 at 01:23 -0700, Nikhil Rao wrote: >> On Sun, Oct 3, 2010 at 8:08 PM, Mike Galbraith <efault@gmx.de> wrote: >> > On Wed, 2010-09-29 at 12:32 -0700, Nikhil Rao wrote: >> >> The closest I have is a quad-core dual-socket machine (MC, CPU >> >> domains). And I'm having trouble reproducing it on that machine as >> >> well :-( I ran 5 soaker threads (one of them niced to -15) for a few >> >> hours and didn't see the problem. Can you please give me some trace >> >> data & schedstats to work with? >> > >> > Booting with isolcpus or offlining the excess should help. >> > >> >> Sorry for the late reply. Booting with isolcpus did the trick, thanks. >> >> ... and now to dig into why this is happening. > > I was poking it (again) yesterday, and it's kind of annoying. I can't > call this behavior black/white broken. It's freeing up a cache for a > very high priority task, which is kinda nice, but SMP nice is costing > 25% of my box's processor power in this case too. Hrmph. > I agree that freeing up the cache for the high priority task is a nice side-effect of weight-based balancing. However, with sufficient number of low weight tasks on the system, or with a small nudge to affinity masks, the niced task will end up sharing cache with low weight tasks. In that sense, I think this is a tad bit more black than white :-) It would be nice to make the load balancer more cache aware, but that's for a different RFC. :-) Further, once a sched group reaches a certain "bad state", where the niced task is the only task in a sched group with more than 1 cpu, it does not recover from that state easily. This leads to the sub-optimal utilization situation that we have been chasing down. In this situation, even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. A sched group reaches this state because either (i). a niced task is pulled into an empty sched group, or (ii). all other tasks in the sched group are pulled away from the group. The patches in this patchset try to prevent the latter, i.e. prevent low weight tasks from being pulled away from the sched group. However, there are still many ways to end up in the bad state. From empirical evidence, it seems to happen more probability on a machine with fewer cpus. I have verified that with the appropriate test setup, this also happens on the quad-socket, quad-core machines as well (i.e. set affinity of the normal tasks to socket-0 and niced task to socket-1, and then reset affinities). I have attached a patch that tackles the problem in different way. Instead of preventing the sched group from entering the bad state, it shortcuts the checks in fbg if the group has extra capacity, where extra capacity is defined as group_capacity > nr_running. The patch exposes a sched feature called PREFER_UTILIZATION (disabled by default). When this is enabled, f_b_g shortcuts the checks if the local group has capacity. This actually works quite well. I tested this on a quad-core dual-socket (with isolcpus) and waited for the machine to enter the bad state. On flipping the sched feature, utilization immediately shoots up to 100% (of non-isolated cores). I have some data below. This is very experimental and has not been tested beyond this case and some basic load balance tests. If you see a better way to do this please let me know. w/ PREFER_UTILIZATION disabled Cpu(s): 34.3% us, 0.2% sy, 0.0% ni, 65.1% id, 0.4% wa, 0.0% hi, 0.0% si Mem: 16463308k total, 996368k used, 15466940k free, 12304k buffers Swap: 0k total, 0k used, 0k free, 756244k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7651 root 5 -15 5876 84 0 R 98 0.0 37:35.97 lat 7652 root 20 0 5876 84 0 R 49 0.0 19:49.02 lat 7654 root 20 0 5876 84 0 R 49 0.0 20:48.93 lat 7655 root 20 0 5876 84 0 R 49 0.0 19:25.74 lat 7653 root 20 0 5876 84 0 R 47 0.0 20:02.16 lat w/ PREFER_UTILIZATION enabled Cpu(s): 52.3% us, 0.0% sy, 0.0% ni, 47.6% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 16463308k total, 1002852k used, 15460456k free, 12304k buffers Swap: 0k total, 0k used, 0k free, 756312k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7651 root 5 -15 5876 84 0 R 100 0.0 38:12.37 lat 7655 root 20 0 5876 84 0 R 99 0.0 19:49.99 lat 7652 root 20 0 5876 84 0 R 80 0.0 20:09.80 lat 7653 root 20 0 5876 84 0 R 60 0.0 20:22.13 lat 7654 root 20 0 5876 84 0 R 58 0.0 21:07.88 lat --- diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 6d934e8..04e5553 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2030,12 +2030,14 @@ struct sd_lb_stats { unsigned long this_load; unsigned long this_load_per_task; unsigned long this_nr_running; + unsigned long this_has_capacity; /* Statistics of the busiest group */ unsigned long max_load; unsigned long busiest_load_per_task; unsigned long busiest_nr_running; unsigned long busiest_group_capacity; + unsigned long busiest_has_capacity; int group_imb; /* Is there imbalance in this sd */ #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) @@ -2058,6 +2060,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long group_capacity; int group_imb; /* Is there an imbalance in the group ? */ + int group_has_capacity; /* Is there extra capacity in the group? */ }; /** @@ -2458,6 +2461,9 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); if (!sgs->group_capacity) sgs->group_capacity = fix_small_capacity(sd, group); + + if (sgs->group_capacity > sgs->sum_nr_running) + sgs->group_has_capacity = 1; } /** @@ -2556,12 +2562,14 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, sds->this = sg; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; + sds->this_has_capacity = sgs.group_has_capacity; } else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) { sds->max_load = sgs.avg_load; sds->busiest = sg; sds->busiest_nr_running = sgs.sum_nr_running; sds->busiest_group_capacity = sgs.group_capacity; sds->busiest_load_per_task = sgs.sum_weighted_load; + sds->busiest_has_capacity = sgs.group_has_capacity; sds->group_imb = sgs.group_imb; } @@ -2820,6 +2828,10 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, if (!sds.busiest || sds.busiest_nr_running == 0) goto out_balanced; + if (sched_feat(PREFER_UTILIZATION) && + sds.this_has_capacity && !sds.busiest_has_capacity) + goto force_balance; + if (sds.this_load >= sds.max_load) goto out_balanced; @@ -2831,6 +2843,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load) goto out_balanced; +force_balance: /* Looks like there is an imbalance. Compute it */ calculate_imbalance(&sds, this_cpu, imbalance); return sds.busiest; diff --git a/kernel/sched_features.h b/kernel/sched_features.h index 83c66e8..9b93862 100644 --- a/kernel/sched_features.h +++ b/kernel/sched_features.h @@ -61,3 +61,9 @@ SCHED_FEAT(ASYM_EFF_LOAD, 1) * release the lock. Decreases scheduling overhead. */ SCHED_FEAT(OWNER_SPIN, 1) + +/* + * Prefer utilization over fairness when balancing tasks with large weight + * differential. + */ +SCHED_FEAT(PREFER_UTILIZATION, 0) ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential 2010-10-08 20:34 ` Nikhil Rao @ 2010-10-10 10:15 ` Mike Galbraith 0 siblings, 0 replies; 15+ messages in thread From: Mike Galbraith @ 2010-10-10 10:15 UTC (permalink / raw) To: Nikhil Rao; +Cc: Ingo Molnar, Peter Zijlstra, Venkatesh Pallipadi, linux-kernel On Fri, 2010-10-08 at 13:34 -0700, Nikhil Rao wrote: > I have attached a patch that tackles the problem in different way. > Instead of preventing the sched group from entering the bad state, it > shortcuts the checks in fbg if the group has extra capacity, where > extra capacity is defined as group_capacity > nr_running. The patch > exposes a sched feature called PREFER_UTILIZATION (disabled by > default). When this is enabled, f_b_g shortcuts the checks if the > local group has capacity. This actually works quite well. Yeah, it does seem to work well. I don't like the sched feature much though, a domain flag seems more appropriate. I bent your patch up a bit to correct utilization woes during NEWIDLE balancing instead.. still seems to work fine. --- kernel/sched_fair.c | 30 +++++++++++++++++++++++++++--- 1 file changed, 27 insertions(+), 3 deletions(-) Index: linux-2.6.36.git/kernel/sched_fair.c =================================================================== --- linux-2.6.36.git.orig/kernel/sched_fair.c +++ linux-2.6.36.git/kernel/sched_fair.c @@ -1764,6 +1764,10 @@ static void pull_task(struct rq *src_rq, set_task_cpu(p, this_cpu); activate_task(this_rq, p, 0); check_preempt_curr(this_rq, p, 0); + + /* re-arm NEWIDLE balancing when moving tasks */ + src_rq->avg_idle = this_rq->avg_idle = 2*sysctl_sched_migration_cost; + this_rq->idle_stamp = 0; } /* @@ -2030,12 +2034,14 @@ struct sd_lb_stats { unsigned long this_load; unsigned long this_load_per_task; unsigned long this_nr_running; + unsigned long this_has_capacity; /* Statistics of the busiest group */ unsigned long max_load; unsigned long busiest_load_per_task; unsigned long busiest_nr_running; unsigned long busiest_group_capacity; + unsigned long busiest_has_capacity; int group_imb; /* Is there imbalance in this sd */ #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) @@ -2058,6 +2064,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long group_capacity; int group_imb; /* Is there an imbalance in the group ? */ + int group_has_capacity; /* Is there extra capacity in the group? */ }; /** @@ -2454,6 +2461,9 @@ static inline void update_sg_lb_stats(st DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); if (!sgs->group_capacity) sgs->group_capacity = fix_small_capacity(sd, group); + + if (sgs->group_capacity > sgs->sum_nr_running) + sgs->group_has_capacity = 1; } /** @@ -2552,12 +2562,14 @@ static inline void update_sd_lb_stats(st sds->this = sg; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; + sds->this_has_capacity = sgs.group_has_capacity; } else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) { sds->max_load = sgs.avg_load; sds->busiest = sg; sds->busiest_nr_running = sgs.sum_nr_running; sds->busiest_group_capacity = sgs.group_capacity; sds->busiest_load_per_task = sgs.sum_weighted_load; + sds->busiest_has_capacity = sgs.group_has_capacity; sds->group_imb = sgs.group_imb; } @@ -2754,6 +2766,15 @@ static inline void calculate_imbalance(s return fix_small_imbalance(sds, this_cpu, imbalance); } + +bool check_utilization(struct sd_lb_stats *sds) +{ + if (!sds->this_has_capacity || sds->busiest_has_capacity) + return false; + + return true; +} + /******* find_busiest_group() helpers end here *********************/ /** @@ -2816,6 +2837,10 @@ find_busiest_group(struct sched_domain * if (!sds.busiest || sds.busiest_nr_running == 0) goto out_balanced; + /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ + if (idle == CPU_NEWLY_IDLE && check_utilization(&sds)) + goto force_balance; + if (sds.this_load >= sds.max_load) goto out_balanced; @@ -2827,6 +2852,7 @@ find_busiest_group(struct sched_domain * if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load) goto out_balanced; +force_balance: /* Looks like there is an imbalance. Compute it */ calculate_imbalance(&sds, this_cpu, imbalance); return sds.busiest; @@ -3153,10 +3179,8 @@ static void idle_balance(int this_cpu, s interval = msecs_to_jiffies(sd->balance_interval); if (time_after(next_balance, sd->last_balance + interval)) next_balance = sd->last_balance + interval; - if (pulled_task) { - this_rq->idle_stamp = 0; + if (pulled_task) break; - } } raw_spin_lock(&this_rq->lock); ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2010-10-11 21:20 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-28 0:29 [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential Nikhil Rao 2010-09-28 0:29 ` [PATCH 1/3] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao 2010-09-28 0:29 ` [PATCH 2/3] sched: drop group_capacity to 1 only if remote group has no running tasks Nikhil Rao 2010-09-28 23:04 ` Suresh Siddha 2010-10-11 21:20 ` Nikhil Rao 2010-09-28 0:29 ` [PATCH 3/3] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao 2010-09-28 13:57 ` [PATCH 0/3][RFC] Improve load balancing when tasks have large weight differential Mike Galbraith 2010-09-28 21:15 ` Nikhil Rao 2010-09-29 1:45 ` Mike Galbraith 2010-09-29 19:32 ` Nikhil Rao 2010-10-04 3:08 ` Mike Galbraith 2010-10-06 8:23 ` Nikhil Rao 2010-10-08 7:22 ` Mike Galbraith 2010-10-08 20:34 ` Nikhil Rao 2010-10-10 10:15 ` Mike Galbraith
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.