From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
To: "Chris Friesen" <cfriesen@nortel.com>
Cc: linux-kernel@vger.kernel.org, mingo@elte.hu,
a.p.zijlstra@chello.nl, pj@sgi.com,
Balbir Singh <balbir@in.ibm.com>,
aneesh.kumar@linux.vnet.ibm.com, dhaval@linux.vnet.ibm.com
Subject: Re: fair group scheduler not so fair?
Date: Thu, 29 May 2008 22:16:07 +0530 [thread overview]
Message-ID: <20080529164607.GC12836@linux.vnet.ibm.com> (raw)
In-Reply-To: <483DA5E7.5050600@nortel.com>
On Wed, May 28, 2008 at 12:35:19PM -0600, Chris Friesen wrote:
> Looking much better, but still some fairness issues with more complex
> setups.
>
> pid 2477 in A, others in B
> 2477 99.5%
> 2478 49.9%
> 2479 49.9%
>
> move 2478 to A
> 2479 99.9%
> 2477 49.9%
> 2478 49.9%
>
> So far so good. I then created C, and moved 2478 to it. A 3-second "top"
> gave almost a 15% error from the desired behaviour for one group:
>
> 2479 76.2%
> 2477 72.2%
> 2478 51.0%
>
>
> A 10-sec average was better, but we still see errors of 6%:
> 2478 72.8%
> 2477 64.0%
> 2479 63.2%
Found couple of issues:
1. A minor bug in load_balance_fair() in calculation of moved_load:
moved_load /= busiest_cfs_rq->load.weight + 1;
In place of busiest_cfs_rq->load.weight, the load before
moving tasks needs to be used. Fix in the updated patch below.
2. We walk task groups sequentially in load_balance_fair() w/o
necessarily looking for the best group. This results in
load_balance_fair() in picking non optimal group/tasks
to pull. I have a hack below (strict = 1/0) to rectify this
problem, but we need a better algo to pick the best group
from which to pull tasks.
3. sd->imbalance_pct (default = 125) specifies how much imbalance we
tolerate. Lower the value, better will be the fairness. To check
this, I changed default to 105, which is giving me better
results.
With the updated patch and imbalance_pct = 105, here's how my 60-sec avg looks
like:
4353 root 20 0 1384 232 176 R 67.0 0.0 2:47.23 1 hoga
4354 root 20 0 1384 228 176 R 66.5 0.0 2:44.65 1 hogb
4355 root 20 0 1384 228 176 R 66.3 0.0 2:28.18 0 hogb
Error is < 1%
> I then set up a scenario with 3 tasks in A, 2 in B, and 1 in C. A
> 10-second "top" gave errors of up to 6.5%:
> 2500 60.1%
> 2491 37.5%
> 2492 37.4%
> 2489 25.0%
> 2488 19.9%
> 2490 19.9%
>
> a re-test gave errors of up to 8.1%:
>
> 2534 74.8%
> 2533 30.1%
> 2532 30.0%
> 2529 25.0%
> 2530 20.0%
> 2531 20.0%
>
> Another retest gave perfect results initially:
>
> 2559 66.5%
> 2560 33.4%
> 2561 33.3%
> 2564 22.3%
> 2562 22.2%
> 2563 22.1%
>
> but moving 2564 from group A to C and then back to A disturbed the perfect
> division of time and resulted in almost the same utilization pattern as
> above:
>
> 2559 74.9%
> 2560 30.0%
> 2561 29.6%
> 2564 25.3%
> 2562 20.0%
> 2563 20.0%
Again with the updated patch and changed imbalance_pct, here's what I see:
4458 root 20 0 1384 232 176 R 66.3 0.0 2:11.04 0 hogc
4457 root 20 0 1384 232 176 R 33.7 0.0 1:06.19 0 hogb
4456 root 20 0 1384 232 176 R 33.4 0.0 1:06.59 0 hogb
4455 root 20 0 1384 228 176 R 22.5 0.0 0:44.09 0 hoga
4453 root 20 0 1384 232 176 R 22.3 0.0 0:44.10 1 hoga
4454 root 20 0 1384 228 176 R 22.2 0.0 0:43.94 1 hoga
(Error < 1%)
In summary, can you do this before running your tests:
1. Apply updated patch below on top of 2.6.26-rc3 + Peter's patches
(http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
2. Setup test env as below:
# mkdir /cgroup
# mount -t cgroup -ocpu none /cgroup
# cd /cgroup
# #Move all tasks to 'sys' group and give it low shares
# mkdir sys
# cd sys
# for i in `cat ../tasks`
do
echo $i > tasks
done
# echo 100 > cpu.shares
# cd /proc/sys/kernel/sched_domain
# for i in `find . -name imbalance_pct`; do echo 105 > $i; done
---
init/Kconfig | 2 +-
kernel/sched.c | 12 +++++++++---
kernel/sched_debug.c | 3 ++-
kernel/sched_fair.c | 26 +++++++++++++++++---------
4 files changed, 29 insertions(+), 14 deletions(-)
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -349,7 +349,7 @@ config RT_GROUP_SCHED
See Documentation/sched-rt-group.txt for more information.
choice
- depends on GROUP_SCHED
+ depends on GROUP_SCHED && (FAIR_GROUP_SCHED || RT_GROUP_SCHED)
prompt "Basis for grouping tasks"
default USER_SCHED
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1398,7 +1398,7 @@ static unsigned long
balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move, struct sched_domain *sd,
enum cpu_idle_type idle, int *all_pinned,
- int *this_best_prio, struct rq_iterator *iterator);
+ int *this_best_prio, struct rq_iterator *iterator, int strict);
static int
iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
@@ -1534,6 +1534,9 @@ tg_shares_up(struct task_group *tg, int
unsigned long shares = 0;
int i;
+ if (!tg->parent)
+ return;
+
for_each_cpu_mask(i, sd->span) {
rq_weight += tg->cfs_rq[i]->load.weight;
shares += tg->cfs_rq[i]->shares;
@@ -2896,7 +2899,7 @@ static unsigned long
balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move, struct sched_domain *sd,
enum cpu_idle_type idle, int *all_pinned,
- int *this_best_prio, struct rq_iterator *iterator)
+ int *this_best_prio, struct rq_iterator *iterator, int strict)
{
int loops = 0, pulled = 0, pinned = 0, skip_for_load;
struct task_struct *p;
@@ -2919,7 +2922,10 @@ next:
* skip a task if it will be the highest priority task (i.e. smallest
* prio value) on its new queue regardless of its load weight
*/
- skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
+ if (strict)
+ skip_for_load = p->se.load.weight >= rem_load_move;
+ else
+ skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
SCHED_LOAD_SCALE_FUZZ;
if ((skip_for_load && p->prio >= *this_best_prio) ||
!can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -119,7 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
struct sched_entity *last;
unsigned long flags;
-#if !defined(CONFIG_CGROUP_SCHED) || !defined(CONFIG_USER_SCHED)
+#ifndef CONFIG_CGROUP_SCHED
SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
#else
char path[128] = "";
@@ -170,6 +170,7 @@ void print_cfs_rq(struct seq_file *m, in
#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "shares", cfs_rq->shares);
+ SEQ_printf(m, " .%-30s: %lu\n", "h_load", cfs_rq->h_load);
#endif
#endif
}
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1386,9 +1386,6 @@ __load_balance_iterator(struct cfs_rq *c
next = next->next;
} while (next != &cfs_rq->tasks && !entity_is_task(se));
- if (next == &cfs_rq->tasks)
- return NULL;
-
cfs_rq->balance_iterator = next;
if (entity_is_task(se))
@@ -1415,7 +1412,7 @@ static unsigned long
__load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move, struct sched_domain *sd,
enum cpu_idle_type idle, int *all_pinned, int *this_best_prio,
- struct cfs_rq *cfs_rq)
+ struct cfs_rq *cfs_rq, int strict)
{
struct rq_iterator cfs_rq_iterator;
@@ -1425,10 +1422,11 @@ __load_balance_fair(struct rq *this_rq,
return balance_tasks(this_rq, this_cpu, busiest,
max_load_move, sd, idle, all_pinned,
- this_best_prio, &cfs_rq_iterator);
+ this_best_prio, &cfs_rq_iterator, strict);
}
#ifdef CONFIG_FAIR_GROUP_SCHED
+
static unsigned long
load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move,
@@ -1438,13 +1436,17 @@ load_balance_fair(struct rq *this_rq, in
long rem_load_move = max_load_move;
int busiest_cpu = cpu_of(busiest);
struct task_group *tg;
+ int strict = 1;
update_h_load(cpu_of(busiest));
rcu_read_lock();
+
+retry:
list_for_each_entry(tg, &task_groups, list) {
struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
long rem_load, moved_load;
+ unsigned long busiest_cfs_rq_load;
/*
* empty group
@@ -1452,25 +1454,31 @@ load_balance_fair(struct rq *this_rq, in
if (!busiest_cfs_rq->task_weight)
continue;
- rem_load = rem_load_move * busiest_cfs_rq->load.weight;
+ busiest_cfs_rq_load = busiest_cfs_rq->load.weight;
+ rem_load = rem_load_move * busiest_cfs_rq_load;
rem_load /= busiest_cfs_rq->h_load + 1;
moved_load = __load_balance_fair(this_rq, this_cpu, busiest,
rem_load, sd, idle, all_pinned, this_best_prio,
- tg->cfs_rq[busiest_cpu]);
+ tg->cfs_rq[busiest_cpu], strict);
if (!moved_load)
continue;
moved_load *= busiest_cfs_rq->h_load;
- moved_load /= busiest_cfs_rq->load.weight + 1;
+ moved_load /= busiest_cfs_rq_load + 1;
rem_load_move -= moved_load;
if (rem_load_move < 0)
break;
}
+
+ if (rem_load_move && strict--)
+ goto retry;
+
rcu_read_unlock();
+
return max_load_move - rem_load_move;
}
#else
@@ -1482,7 +1490,7 @@ load_balance_fair(struct rq *this_rq, in
{
return __load_balance_fair(this_rq, this_cpu, busiest,
max_load_move, sd, idle, all_pinned,
- this_best_prio, &busiest->cfs);
+ this_best_prio, &busiest->cfs, 0);
}
#endif
--
Regards,
vatsa
next prev parent reply other threads:[~2008-05-29 16:37 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
2008-05-22 6:56 ` Peter Zijlstra
2008-05-22 20:02 ` Chris Friesen
2008-05-22 20:07 ` Peter Zijlstra
2008-05-22 20:18 ` Li, Tong N
2008-05-22 21:13 ` Peter Zijlstra
2008-05-23 0:17 ` Chris Friesen
2008-05-23 7:44 ` Srivatsa Vaddagiri
2008-05-23 9:42 ` Srivatsa Vaddagiri
2008-05-23 9:39 ` Peter Zijlstra
2008-05-23 10:19 ` Srivatsa Vaddagiri
2008-05-23 10:16 ` Peter Zijlstra
2008-05-27 17:15 ` Srivatsa Vaddagiri
2008-05-27 18:13 ` Chris Friesen
2008-05-28 16:33 ` Srivatsa Vaddagiri
2008-05-28 18:35 ` Chris Friesen
2008-05-28 18:47 ` Dhaval Giani
2008-05-29 2:50 ` Srivatsa Vaddagiri
2008-05-29 16:46 ` Srivatsa Vaddagiri [this message]
2008-05-29 16:47 ` Srivatsa Vaddagiri
2008-05-29 21:30 ` Chris Friesen
2008-05-30 6:43 ` Dhaval Giani
2008-05-30 10:21 ` Srivatsa Vaddagiri
2008-05-30 11:36 ` Srivatsa Vaddagiri
2008-06-02 20:03 ` Chris Friesen
2008-05-27 17:28 ` Srivatsa Vaddagiri
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080529164607.GC12836@linux.vnet.ibm.com \
--to=vatsa@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=balbir@in.ibm.com \
--cc=cfriesen@nortel.com \
--cc=dhaval@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=pj@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox