* [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3
@ 2010-10-15 20:12 Nikhil Rao
2010-10-15 20:12 ` [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Nikhil Rao @ 2010-10-15 20:12 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
Venkatesh Pallipadi
Cc: linux-kernel, Satoru Takeuchi, Nikhil Rao
Hi all,
Please find attached a a series of patches that improve load balancing
when there is a large weight differential between tasks (such as when nicing a
task or when using SCHED_IDLE). These patches are based off feedback given by
Peter Zijlstra and Mike Galbraith in earlier posts.
Previous versions:
-v0: http://thread.gmane.org/gmane.linux.kernel/1015966
Large weight differential leads to inefficient load balancing
-v1: http://thread.gmane.org/gmane.linux.kernel/1041721
Improve load balancing when tasks have large weight differential
-v2: http://thread.gmane.org/gmane.linux.kernel/1048073
Improve load balancing when tasks have large weight differential -v2
Changes from -v2:
- Swap patches 3 and 4, which allows us to reuse sds->this_has_capacity to
check if the local group has extra capacity.
- Drop this_group_capacity from sd_lb_stats
- Update comments and changelog descriptions to describe the patches better.
- Add an unlikely() hint to the SCHED_IDLE policy check in task_hot() based on
feedback from Satoru Takeuchi.
These patches can be applied to v2.6.36-rc7 or -tip without conflicts. Below
are some tests that highlight the improvements with this patchset.
1. 16 SCHED_IDLE soakers, 1 SCHED_NORMAL task on 16 cpu machine.
Tested on a quad-cpu, quad-socket. Steps to reproduce:
- spawn 16 SCHED_IDLE tasks
- spawn one nice 0 task
- system utilization immediately drops to 80% on v2.6.36-rc7
v2.6.36-rc7
10:38:46 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
10:38:47 AM all 80.69 0.00 0.50 0.00 0.00 0.00 0.00 18.82 14008.00
10:38:48 AM all 85.09 0.06 0.50 0.00 0.00 0.00 0.00 14.35 14690.00
10:38:49 AM all 86.83 0.06 0.44 0.00 0.00 0.00 0.00 12.67 14314.85
10:38:50 AM all 79.89 0.00 0.37 0.00 0.00 0.00 0.00 19.74 14035.35
10:38:51 AM all 87.94 0.06 0.44 0.00 0.00 0.00 0.00 11.56 14991.00
10:38:52 AM all 83.27 0.06 0.37 0.00 0.00 0.00 0.00 16.29 14319.00
10:38:53 AM all 94.37 0.13 0.50 0.00 0.00 0.00 0.00 5.00 15930.00
10:38:54 AM all 87.06 0.06 0.62 0.00 0.00 0.06 0.00 12.19 14946.00
10:38:55 AM all 88.68 0.06 0.38 0.00 0.00 0.00 0.00 10.88 14767.00
10:38:56 AM all 80.16 0.00 1.06 0.00 0.00 0.00 0.00 18.78 13892.08
Average: all 85.38 0.05 0.52 0.00 0.00 0.01 0.00 14.05 14588.91
v2.6.36-rc7 + patchset:
12:58:29 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
12:58:30 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16384.00
12:58:31 PM all 99.75 0.00 0.25 0.00 0.00 0.00 0.00 0.00 16428.00
12:58:32 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16345.00
12:58:33 PM all 99.75 0.00 0.25 0.00 0.00 0.00 0.00 0.00 16383.00
12:58:34 PM all 99.75 0.00 0.19 0.00 0.00 0.06 0.00 0.00 16333.00
12:58:35 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16359.00
12:58:36 PM all 99.75 0.00 0.25 0.00 0.00 0.00 0.00 0.00 16523.23
12:58:37 PM all 99.75 0.00 0.25 0.00 0.00 0.00 0.00 0.00 16352.00
12:58:38 PM all 98.75 0.00 1.25 0.00 0.00 0.00 0.00 0.00 17128.00
12:58:39 PM all 99.31 0.06 0.62 0.00 0.00 0.00 0.00 0.00 16757.00
Average: all 99.63 0.01 0.36 0.00 0.00 0.01 0.00 0.00 16499.20
2. Sub-optimal utilizataion in presence of niced task.
Tested on a dual-socket/quad-core w/ two cores on each socket disabled.
Steps to reproduce:
- spawn 4 nice 0 soakers and one nice -15 soaker
- force all tasks onto one cpu by setting affinities
- reset affinity masks
v2.6.36-rc7:
Cpu(s): 34.3% us, 0.2% sy, 0.0% ni, 65.1% id, 0.4% wa, 0.0% hi, 0.0% si
Mem: 16463308k total, 996368k used, 15466940k free, 12304k buffers
Swap: 0k total, 0k used, 0k free, 756244k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7651 root 5 -15 5876 84 0 R 98 0.0 37:35.97 soaker
7652 root 20 0 5876 84 0 R 49 0.0 19:49.02 soaker
7654 root 20 0 5876 84 0 R 49 0.0 20:48.93 soaker
7655 root 20 0 5876 84 0 R 49 0.0 19:25.74 soaker
7653 root 20 0 5876 84 0 R 47 0.0 20:02.16 soaker
v2.6.36-rc7 + patchset:
Cpu(s): 52.5% us, 0.3% sy, 0.0% ni, 47.1% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 16463912k total, 1012832k used, 15451080k free, 9988k buffers
Swap: 0k total, 0k used, 0k free, 762896k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2749 root 5 -15 5876 88 0 R 100 0.0 9:05.58 soaker
2750 root 20 0 5876 88 0 R 99 0.0 6:19.94 soaker
2751 root 20 0 5876 88 0 R 70 0.0 7:51.42 soaker
2753 root 20 0 5876 88 0 R 67 0.0 6:09.91 soaker
2752 root 20 0 5876 88 0 R 55 0.0 6:33.41 soaker
Comments, feedback welcome.
-Thanks,
Nikhil
Nikhil Rao (4):
sched: do not consider SCHED_IDLE tasks to be cache hot
sched: set group_imb only a task can be pulled from the busiest cpu
sched: force balancing on newidle balance if local group has capacity
sched: drop group_capacity to 1 only if local group has extra
capacity
kernel/sched.c | 3 +++
kernel/sched_fair.c | 47 +++++++++++++++++++++++++++++++++++++++--------
2 files changed, 42 insertions(+), 8 deletions(-)
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot
2010-10-15 20:12 [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3 Nikhil Rao
@ 2010-10-15 20:12 ` Nikhil Rao
2010-10-18 19:23 ` [tip:sched/core] sched: Do " tip-bot for Nikhil Rao
2010-10-15 20:12 ` [PATCH 2/4] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao
` (3 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Nikhil Rao @ 2010-10-15 20:12 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
Venkatesh Pallipadi
Cc: linux-kernel, Satoru Takeuchi, Nikhil Rao
This patch adds a check in task_hot to return if the task has SCHED_IDLE
policy. SCHED_IDLE tasks have very low weight, and when run with regular
workloads, are typically scheduled many milliseconds apart. There is no
need to consider these tasks hot for load balancing.
Signed-off-by: Nikhil Rao <ncrao@google.com>
---
kernel/sched.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..9e01848 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2003,6 +2003,9 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
if (p->sched_class != &fair_sched_class)
return 0;
+ if (unlikely(p->policy == SCHED_IDLE))
+ return 0;
+
/*
* Buddy candidates are cache hot:
*/
--
1.7.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/4] sched: set group_imb only a task can be pulled from the busiest cpu
2010-10-15 20:12 [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3 Nikhil Rao
2010-10-15 20:12 ` [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao
@ 2010-10-15 20:12 ` Nikhil Rao
2010-10-15 20:12 ` [PATCH 3/4] sched: force balancing on newidle balance if local group has capacity Nikhil Rao
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Nikhil Rao @ 2010-10-15 20:12 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
Venkatesh Pallipadi
Cc: linux-kernel, Satoru Takeuchi, Nikhil Rao
When cycling through sched groups to determine the busiest group, set
group_imb only if the busiest cpu has more than 1 runnable task. This patch
fixes the case where two cpus in a group have one runnable task each, but there
is a large weight differential between these two tasks. The load balancer is
unable to migrate any task from this group, and hence do not consider this
group to be imbalanced.
Signed-off-by: Nikhil Rao <ncrao@google.com>
---
kernel/sched_fair.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..0dd1021 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2378,7 +2378,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
int local_group, const struct cpumask *cpus,
int *balance, struct sg_lb_stats *sgs)
{
- unsigned long load, max_cpu_load, min_cpu_load;
+ unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
unsigned long avg_load_per_task = 0;
@@ -2389,6 +2389,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
/* Tally up the load of all CPUs in the group */
max_cpu_load = 0;
min_cpu_load = ~0UL;
+ max_nr_running = 0;
for_each_cpu_and(i, sched_group_cpus(group), cpus) {
struct rq *rq = cpu_rq(i);
@@ -2406,8 +2407,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
load = target_load(i, load_idx);
} else {
load = source_load(i, load_idx);
- if (load > max_cpu_load)
+ if (load > max_cpu_load) {
max_cpu_load = load;
+ max_nr_running = rq->nr_running;
+ }
if (min_cpu_load > load)
min_cpu_load = load;
}
@@ -2447,7 +2450,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
if (sgs->sum_nr_running)
avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
- if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
+ if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task &&
+ max_nr_running > 1)
sgs->group_imb = 1;
sgs->group_capacity =
--
1.7.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/4] sched: force balancing on newidle balance if local group has capacity
2010-10-15 20:12 [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3 Nikhil Rao
2010-10-15 20:12 ` [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao
2010-10-15 20:12 ` [PATCH 2/4] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao
@ 2010-10-15 20:12 ` Nikhil Rao
2010-10-18 19:23 ` [tip:sched/core] sched: Force " tip-bot for Nikhil Rao
2010-10-15 20:12 ` [PATCH 4/4] sched: drop group_capacity to 1 only if local group has extra capacity Nikhil Rao
2010-10-15 20:44 ` [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3 Peter Zijlstra
4 siblings, 1 reply; 9+ messages in thread
From: Nikhil Rao @ 2010-10-15 20:12 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
Venkatesh Pallipadi
Cc: linux-kernel, Satoru Takeuchi, Nikhil Rao
This patch forces a load balance on a newly idle cpu when the local group has
extra capacity and the busiest group does not have any. It improves system
utilization when balancing tasks with a large weight differential.
Under certain situations, such as a niced down task (i.e. nice = -15) in the
presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
kicks away other tasks because of its large weight. This leads to sub-optimal
utilization of the machine. Even though the sched group has capacity, it does
not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
With this patch, if the local group has extra capacity, we shortcut the checks
in f_b_g() and try to pull a task over. A sched group has extra capacity if the
group capacity is greater than the number of running tasks in that group.
Thanks to Mike Galbraith for discussions leading to this patch and for the
insight to reuse SD_NEWIDLE_BALANCE.
Signed-off-by: Nikhil Rao <ncrao@google.com>
---
kernel/sched_fair.c | 28 +++++++++++++++++++++++++---
1 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0dd1021..7d8ed95 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1764,6 +1764,10 @@ static void pull_task(struct rq *src_rq, struct task_struct *p,
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
check_preempt_curr(this_rq, p, 0);
+
+ /* re-arm NEWIDLE balancing when moving tasks */
+ src_rq->avg_idle = this_rq->avg_idle = 2*sysctl_sched_migration_cost;
+ this_rq->idle_stamp = 0;
}
/*
@@ -2030,12 +2034,14 @@ struct sd_lb_stats {
unsigned long this_load;
unsigned long this_load_per_task;
unsigned long this_nr_running;
+ unsigned long this_has_capacity;
/* Statistics of the busiest group */
unsigned long max_load;
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
+ unsigned long busiest_has_capacity;
int group_imb; /* Is there imbalance in this sd */
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -2058,6 +2064,7 @@ struct sg_lb_stats {
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long group_capacity;
int group_imb; /* Is there an imbalance in the group ? */
+ int group_has_capacity; /* Is there extra capacity in the group? */
};
/**
@@ -2458,6 +2465,9 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
if (!sgs->group_capacity)
sgs->group_capacity = fix_small_capacity(sd, group);
+
+ if (sgs->group_capacity > sgs->sum_nr_running)
+ sgs->group_has_capacity = 1;
}
/**
@@ -2556,12 +2566,14 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
sds->this = sg;
sds->this_nr_running = sgs.sum_nr_running;
sds->this_load_per_task = sgs.sum_weighted_load;
+ sds->this_has_capacity = sgs.group_has_capacity;
} else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
sds->max_load = sgs.avg_load;
sds->busiest = sg;
sds->busiest_nr_running = sgs.sum_nr_running;
sds->busiest_group_capacity = sgs.group_capacity;
sds->busiest_load_per_task = sgs.sum_weighted_load;
+ sds->busiest_has_capacity = sgs.group_has_capacity;
sds->group_imb = sgs.group_imb;
}
@@ -2758,6 +2770,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
return fix_small_imbalance(sds, this_cpu, imbalance);
}
+
/******* find_busiest_group() helpers end here *********************/
/**
@@ -2809,6 +2822,11 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
* 4) This group is more busy than the avg busieness at this
* sched_domain.
* 5) The imbalance is within the specified limit.
+ *
+ * Note: when doing newidle balance, if the local group has excess
+ * capacity (i.e. nr_running < group_capacity) and the busiest group
+ * does not have any capacity, we force a load balance to pull tasks
+ * to the local group. In this case, we skip past checks 3, 4 and 5.
*/
if (!(*balance))
goto ret;
@@ -2820,6 +2838,11 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (!sds.busiest || sds.busiest_nr_running == 0)
goto out_balanced;
+ /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
+ if (idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
+ !sds.busiest_has_capacity)
+ goto force_balance;
+
if (sds.this_load >= sds.max_load)
goto out_balanced;
@@ -2831,6 +2854,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
goto out_balanced;
+force_balance:
/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);
return sds.busiest;
@@ -3157,10 +3181,8 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
interval = msecs_to_jiffies(sd->balance_interval);
if (time_after(next_balance, sd->last_balance + interval))
next_balance = sd->last_balance + interval;
- if (pulled_task) {
- this_rq->idle_stamp = 0;
+ if (pulled_task)
break;
- }
}
raw_spin_lock(&this_rq->lock);
--
1.7.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/4] sched: drop group_capacity to 1 only if local group has extra capacity
2010-10-15 20:12 [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3 Nikhil Rao
` (2 preceding siblings ...)
2010-10-15 20:12 ` [PATCH 3/4] sched: force balancing on newidle balance if local group has capacity Nikhil Rao
@ 2010-10-15 20:12 ` Nikhil Rao
2010-10-18 19:24 ` [tip:sched/core] sched: Drop " tip-bot for Nikhil Rao
2010-10-15 20:44 ` [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3 Peter Zijlstra
4 siblings, 1 reply; 9+ messages in thread
From: Nikhil Rao @ 2010-10-15 20:12 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Suresh Siddha,
Venkatesh Pallipadi
Cc: linux-kernel, Satoru Takeuchi, Nikhil Rao
When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. The extra check prevents the case
where you always pull from the heaviest group when it is already under-utilized
(possible with a large weight task outweighs the tasks on the system).
For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
and each task is running on one core. In this case, we observe the following
events when balancing at the NUMA domain:
- find_busiest_group() will always pick the sched group containing the niced
task to be the busiest group.
- find_busiest_queue() will then always pick one of the cpus running the
nice0 task (never picks the cpu with the nice -15 task since
weighted_cpuload > imbalance).
- The load balancer fails to migrate the task since it is the running task
and increments sd->nr_balance_failed.
- It repeats the above steps a few more times until sd->nr_balance_failed > 5,
at which point it kicks off the active load balancer, wakes up the migration
thread and kicks the nice 0 task off the cpu.
The load balancer doesn't stop until we kick out all nice 0 tasks from
the sched group, leaving you with 3 idle cpus and one cpu running the
nice -15 task.
When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are
not relevant because the niced task has a very large weight.
In this patch, we add an extra condition to the "if(prefer_sibling)" check in
update_sd_lb_stats(). We drop the capacity of a group only if the local group
has extra capacity, ie. nr_running < group_capacity. This patch preserves the
original intent of the prefer_siblings check (to spread tasks across the system
in low utilization scenarios) and fixes the case above.
It helps in the following ways:
- In low utilization cases (where nr_tasks << nr_cpus), we still drop
group_capacity down to 1 if we prefer siblings.
- On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
likely be > sgs.group_capacity.
- When balancing large weight tasks, if the local group does not have extra
capacity, we do not pick the group with the niced task as the busiest group.
This prevents failed balances, active migration and the under-utilization
described above.
Signed-off-by: Nikhil Rao <ncrao@google.com>
---
kernel/sched_fair.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 7d8ed95..098283a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2556,9 +2556,14 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
/*
* In case the child domain prefers tasks go to siblings
* first, lower the sg capacity to one so that we'll try
- * and move all the excess tasks away.
+ * and move all the excess tasks away. We lower the capacity
+ * of a group only if the local group has the capacity to fit
+ * these excess tasks, i.e. nr_running < group_capacity. The
+ * extra check prevents the case where you always pull from the
+ * heaviest group when it is already under-utilized (possible
+ * with a large weight task outweighs the tasks on the system).
*/
- if (prefer_sibling)
+ if (prefer_sibling && !local_group && sds->this_has_capacity)
sgs.group_capacity = min(sgs.group_capacity, 1UL);
if (local_group) {
--
1.7.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3
2010-10-15 20:12 [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3 Nikhil Rao
` (3 preceding siblings ...)
2010-10-15 20:12 ` [PATCH 4/4] sched: drop group_capacity to 1 only if local group has extra capacity Nikhil Rao
@ 2010-10-15 20:44 ` Peter Zijlstra
4 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2010-10-15 20:44 UTC (permalink / raw)
To: Nikhil Rao
Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Venkatesh Pallipadi,
linux-kernel, Satoru Takeuchi
On Fri, 2010-10-15 at 13:12 -0700, Nikhil Rao wrote:
>
> Please find attached a a series of patches that improve load balancing
> when there is a large weight differential between tasks (such as when nicing a
> task or when using SCHED_IDLE). These patches are based off feedback given by
> Peter Zijlstra and Mike Galbraith in earlier posts.
Thanks, I've queued them up, we'll see what happens :-)
^ permalink raw reply [flat|nested] 9+ messages in thread
* [tip:sched/core] sched: Do not consider SCHED_IDLE tasks to be cache hot
2010-10-15 20:12 ` [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao
@ 2010-10-18 19:23 ` tip-bot for Nikhil Rao
0 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Nikhil Rao @ 2010-10-18 19:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, ncrao, tglx, mingo
Commit-ID: ef8002f6848236de5adc613063ebeabddea8a6fb
Gitweb: http://git.kernel.org/tip/ef8002f6848236de5adc613063ebeabddea8a6fb
Author: Nikhil Rao <ncrao@google.com>
AuthorDate: Wed, 13 Oct 2010 12:09:35 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:15 +0200
sched: Do not consider SCHED_IDLE tasks to be cache hot
This patch adds a check in task_hot to return if the task has SCHED_IDLE
policy. SCHED_IDLE tasks have very low weight, and when run with regular
workloads, are typically scheduled many milliseconds apart. There is no
need to consider these tasks hot for load balancing.
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 728081a..771b518 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2025,6 +2025,9 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
if (p->sched_class != &fair_sched_class)
return 0;
+ if (unlikely(p->policy == SCHED_IDLE))
+ return 0;
+
/*
* Buddy candidates are cache hot:
*/
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [tip:sched/core] sched: Force balancing on newidle balance if local group has capacity
2010-10-15 20:12 ` [PATCH 3/4] sched: force balancing on newidle balance if local group has capacity Nikhil Rao
@ 2010-10-18 19:23 ` tip-bot for Nikhil Rao
0 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Nikhil Rao @ 2010-10-18 19:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, ncrao, tglx, mingo
Commit-ID: fab476228ba37907ad75216d0fd9732ada9c119e
Gitweb: http://git.kernel.org/tip/fab476228ba37907ad75216d0fd9732ada9c119e
Author: Nikhil Rao <ncrao@google.com>
AuthorDate: Fri, 15 Oct 2010 13:12:29 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:18 +0200
sched: Force balancing on newidle balance if local group has capacity
This patch forces a load balance on a newly idle cpu when the local group has
extra capacity and the busiest group does not have any. It improves system
utilization when balancing tasks with a large weight differential.
Under certain situations, such as a niced down task (i.e. nice = -15) in the
presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
kicks away other tasks because of its large weight. This leads to sub-optimal
utilization of the machine. Even though the sched group has capacity, it does
not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
With this patch, if the local group has extra capacity, we shortcut the checks
in f_b_g() and try to pull a task over. A sched group has extra capacity if the
group capacity is greater than the number of running tasks in that group.
Thanks to Mike Galbraith for discussions leading to this patch and for the
insight to reuse SD_NEWIDLE_BALANCE.
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched_fair.c | 28 +++++++++++++++++++++++++---
1 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3656480..032b548 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1764,6 +1764,10 @@ static void pull_task(struct rq *src_rq, struct task_struct *p,
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
check_preempt_curr(this_rq, p, 0);
+
+ /* re-arm NEWIDLE balancing when moving tasks */
+ src_rq->avg_idle = this_rq->avg_idle = 2*sysctl_sched_migration_cost;
+ this_rq->idle_stamp = 0;
}
/*
@@ -2030,12 +2034,14 @@ struct sd_lb_stats {
unsigned long this_load;
unsigned long this_load_per_task;
unsigned long this_nr_running;
+ unsigned long this_has_capacity;
/* Statistics of the busiest group */
unsigned long max_load;
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
+ unsigned long busiest_has_capacity;
int group_imb; /* Is there imbalance in this sd */
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -2058,6 +2064,7 @@ struct sg_lb_stats {
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long group_capacity;
int group_imb; /* Is there an imbalance in the group ? */
+ int group_has_capacity; /* Is there extra capacity in the group? */
};
/**
@@ -2456,6 +2463,9 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
if (!sgs->group_capacity)
sgs->group_capacity = fix_small_capacity(sd, group);
+
+ if (sgs->group_capacity > sgs->sum_nr_running)
+ sgs->group_has_capacity = 1;
}
/**
@@ -2554,12 +2564,14 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
sds->this = sg;
sds->this_nr_running = sgs.sum_nr_running;
sds->this_load_per_task = sgs.sum_weighted_load;
+ sds->this_has_capacity = sgs.group_has_capacity;
} else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) {
sds->max_load = sgs.avg_load;
sds->busiest = sg;
sds->busiest_nr_running = sgs.sum_nr_running;
sds->busiest_group_capacity = sgs.group_capacity;
sds->busiest_load_per_task = sgs.sum_weighted_load;
+ sds->busiest_has_capacity = sgs.group_has_capacity;
sds->group_imb = sgs.group_imb;
}
@@ -2756,6 +2768,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu,
return fix_small_imbalance(sds, this_cpu, imbalance);
}
+
/******* find_busiest_group() helpers end here *********************/
/**
@@ -2807,6 +2820,11 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
* 4) This group is more busy than the avg busieness at this
* sched_domain.
* 5) The imbalance is within the specified limit.
+ *
+ * Note: when doing newidle balance, if the local group has excess
+ * capacity (i.e. nr_running < group_capacity) and the busiest group
+ * does not have any capacity, we force a load balance to pull tasks
+ * to the local group. In this case, we skip past checks 3, 4 and 5.
*/
if (!(*balance))
goto ret;
@@ -2818,6 +2836,11 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (!sds.busiest || sds.busiest_nr_running == 0)
goto out_balanced;
+ /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
+ if (idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
+ !sds.busiest_has_capacity)
+ goto force_balance;
+
if (sds.this_load >= sds.max_load)
goto out_balanced;
@@ -2829,6 +2852,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
goto out_balanced;
+force_balance:
/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);
return sds.busiest;
@@ -3162,10 +3186,8 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
interval = msecs_to_jiffies(sd->balance_interval);
if (time_after(next_balance, sd->last_balance + interval))
next_balance = sd->last_balance + interval;
- if (pulled_task) {
- this_rq->idle_stamp = 0;
+ if (pulled_task)
break;
- }
}
raw_spin_lock(&this_rq->lock);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [tip:sched/core] sched: Drop group_capacity to 1 only if local group has extra capacity
2010-10-15 20:12 ` [PATCH 4/4] sched: drop group_capacity to 1 only if local group has extra capacity Nikhil Rao
@ 2010-10-18 19:24 ` tip-bot for Nikhil Rao
0 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Nikhil Rao @ 2010-10-18 19:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, ncrao, tglx, mingo
Commit-ID: 75dd321d79d495a0ee579e6249ebc38ddbb2667f
Gitweb: http://git.kernel.org/tip/75dd321d79d495a0ee579e6249ebc38ddbb2667f
Author: Nikhil Rao <ncrao@google.com>
AuthorDate: Fri, 15 Oct 2010 13:12:30 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:19 +0200
sched: Drop group_capacity to 1 only if local group has extra capacity
When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. The extra check prevents the case
where you always pull from the heaviest group when it is already under-utilized
(possible with a large weight task outweighs the tasks on the system).
For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
and each task is running on one core. In this case, we observe the following
events when balancing at the NUMA domain:
- find_busiest_group() will always pick the sched group containing the niced
task to be the busiest group.
- find_busiest_queue() will then always pick one of the cpus running the
nice0 task (never picks the cpu with the nice -15 task since
weighted_cpuload > imbalance).
- The load balancer fails to migrate the task since it is the running task
and increments sd->nr_balance_failed.
- It repeats the above steps a few more times until sd->nr_balance_failed > 5,
at which point it kicks off the active load balancer, wakes up the migration
thread and kicks the nice 0 task off the cpu.
The load balancer doesn't stop until we kick out all nice 0 tasks from
the sched group, leaving you with 3 idle cpus and one cpu running the
nice -15 task.
When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are
not relevant because the niced task has a very large weight.
In this patch, we add an extra condition to the "if(prefer_sibling)" check in
update_sd_lb_stats(). We drop the capacity of a group only if the local group
has extra capacity, ie. nr_running < group_capacity. This patch preserves the
original intent of the prefer_siblings check (to spread tasks across the system
in low utilization scenarios) and fixes the case above.
It helps in the following ways:
- In low utilization cases (where nr_tasks << nr_cpus), we still drop
group_capacity down to 1 if we prefer siblings.
- On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
likely be > sgs.group_capacity.
- When balancing large weight tasks, if the local group does not have extra
capacity, we do not pick the group with the niced task as the busiest group.
This prevents failed balances, active migration and the under-utilization
described above.
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched_fair.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 032b548..f1c615f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2554,9 +2554,14 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
/*
* In case the child domain prefers tasks go to siblings
* first, lower the sg capacity to one so that we'll try
- * and move all the excess tasks away.
+ * and move all the excess tasks away. We lower the capacity
+ * of a group only if the local group has the capacity to fit
+ * these excess tasks, i.e. nr_running < group_capacity. The
+ * extra check prevents the case where you always pull from the
+ * heaviest group when it is already under-utilized (possible
+ * with a large weight task outweighs the tasks on the system).
*/
- if (prefer_sibling)
+ if (prefer_sibling && !local_group && sds->this_has_capacity)
sgs.group_capacity = min(sgs.group_capacity, 1UL);
if (local_group) {
^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-10-18 19:24 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-15 20:12 [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3 Nikhil Rao
2010-10-15 20:12 ` [PATCH 1/4] sched: do not consider SCHED_IDLE tasks to be cache hot Nikhil Rao
2010-10-18 19:23 ` [tip:sched/core] sched: Do " tip-bot for Nikhil Rao
2010-10-15 20:12 ` [PATCH 2/4] sched: set group_imb only a task can be pulled from the busiest cpu Nikhil Rao
2010-10-15 20:12 ` [PATCH 3/4] sched: force balancing on newidle balance if local group has capacity Nikhil Rao
2010-10-18 19:23 ` [tip:sched/core] sched: Force " tip-bot for Nikhil Rao
2010-10-15 20:12 ` [PATCH 4/4] sched: drop group_capacity to 1 only if local group has extra capacity Nikhil Rao
2010-10-18 19:24 ` [tip:sched/core] sched: Drop " tip-bot for Nikhil Rao
2010-10-15 20:44 ` [PATCH 0/4][RFC v3] Improve load balancing when tasks have large weight differential -v3 Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox