From: Ingo Molnar <mingo@elte.hu>
To: Andrew Morton <akpm@linux-foundation.org>,
Yinghai Lu <yinghai@kernel.org>
Cc: svaidy@linux.vnet.ibm.com, linux-kernel@vger.kernel.org,
suresh.b.siddha@intel.com, venkatesh.pallipadi@intel.com,
a.p.zijlstra@chello.nl, dipankar@in.ibm.com,
balbir@linux.vnet.ibm.com, vatsa@linux.vnet.ibm.com,
ego@in.ibm.com, andi@firstfloor.org, davecb@sun.com,
tconnors@astro.swin.edu.au, maxk@qualcomm.com,
gregory.haskins@gmail.com, pavel@suse.cz, rusty@rustcorp.com.au
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
Date: Fri, 19 Dec 2008 23:31:30 +0100 [thread overview]
Message-ID: <20081219223130.GA13172@elte.hu> (raw)
In-Reply-To: <20081219222709.GN2351@elte.hu>
* Ingo Molnar <mingo@elte.hu> wrote:
> > [ 74.013893] <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45
> > [ 74.013893] RIP [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> > [ 74.013893] RSP <ffff88025e057d88>
> > [ 74.020022] ---[ end trace 2fc4046e394f2312 ]---
> > [ 74.020188] Kernel panic - not syncing: Fatal exception in interrupt
> >
> > config: http://userweb.kernel.org/~akpm/config-akpm2.txt
> >
> > I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().
>
> hm, weird - Yinghai noticed this crash in -tip and we have that patch
> zapped from all -*-next branches already.
I guess you applied Ken's patch from email and had it in -mm? We have a
new version of that patch from Ken now based on Yinghai's bugreport - only
lightly tested for now.
Ingo
----------->
>From d71f5a7c8bf9cd7c74159a53e522e363f2eddaf5 Mon Sep 17 00:00:00 2001
From: Ken Chen <kenchen@google.com>
Date: Fri, 19 Dec 2008 10:11:50 -0800
Subject: [PATCH] sched: fix uneven per-cpu task_group share distribution
Impact: fix group scheduling behavior
While testing CFS scheduler on linux-2.6-tip tree:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
We found that task which is pinned to a CPU could be starved relative to its
allocated fair share.
The per-cpu sched_enetity load share calculation in tg_shares_up /
update_group_shares_cpu distributes total task_group's share among all CPUs
for a given SD domain, this would dilute task_group's per-cpu share because
it distributes share to CPU that even has no load. The trapped share is now
un-consumable and it leads to fair share starvation on the runnable CPU.
Peter was right that it is still required for the low level function to make
distinction between a boosted share that don't have any load and actual tg
share that should be distributed among CPUs in which the tg is running.
Patch to add that boost and we think the scheduler should only boost one
times of tg shares over all empty CPU that don't have any load for the
specific task_group in order to bound maximum temporary boost that a given
task_group can have.
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 49 ++++++++++++++++++++++++++++++-------------------
1 files changed, 30 insertions(+), 19 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index ae5ca3f..7d07c97 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1475,24 +1475,34 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares);
* Calculate and set the cpu's group shares.
*/
static void
-update_group_shares_cpu(struct task_group *tg, int cpu,
+update_group_shares_cpu(struct task_group *tg, int cpu, unsigned long boost,
unsigned long sd_shares, unsigned long sd_rq_weight)
{
- unsigned long shares;
+ unsigned long shares, raw_shares;
unsigned long rq_weight;
if (!tg->se[cpu])
return;
rq_weight = tg->cfs_rq[cpu]->rq_weight;
-
- /*
- * \Sum shares * rq_weight
- * shares = -----------------------
- * \Sum rq_weight
- *
- */
- shares = (sd_shares * rq_weight) / sd_rq_weight;
+ if (rq_weight && sd_rq_weight) {
+ /*
+ * \Sum shares * rq_weight
+ * shares = -----------------------
+ * \Sum rq_weight
+ *
+ */
+ raw_shares = (sd_shares * rq_weight) / sd_rq_weight;
+ shares = raw_shares;
+ } else {
+ /*
+ * If there are currently no tasks on the cpu pretend there
+ * is one of average load so that when a new task gets to
+ * run here it will not get delayed by group starvation.
+ */
+ raw_shares = 0;
+ shares = boost;
+ }
shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
if (abs(shares - tg->se[cpu]->load.weight) >
@@ -1501,7 +1511,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
unsigned long flags;
spin_lock_irqsave(&rq->lock, flags);
- tg->cfs_rq[cpu]->shares = shares;
+ tg->cfs_rq[cpu]->shares = raw_shares;
__set_se_shares(tg->se[cpu], shares);
spin_unlock_irqrestore(&rq->lock, flags);
@@ -1517,18 +1527,14 @@ static int tg_shares_up(struct task_group *tg, void *data)
{
unsigned long weight, rq_weight = 0;
unsigned long shares = 0;
+ unsigned long boost;
struct sched_domain *sd = data;
- int i;
+ int i, no_load_count = 0;
for_each_cpu(i, sched_domain_span(sd)) {
- /*
- * If there are currently no tasks on the cpu pretend there
- * is one of average load so that when a new task gets to
- * run here it will not get delayed by group starvation.
- */
weight = tg->cfs_rq[i]->load.weight;
if (!weight)
- weight = NICE_0_LOAD;
+ no_load_count++;
tg->cfs_rq[i]->rq_weight = weight;
rq_weight += weight;
@@ -1541,8 +1547,13 @@ static int tg_shares_up(struct task_group *tg, void *data)
if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
shares = tg->shares;
+ if (no_load_count)
+ boost = shares / no_load_count;
+ else
+ boost = shares / cpumask_weight(sched_domain_span(sd));
+
for_each_cpu(i, sched_domain_span(sd))
- update_group_shares_cpu(tg, i, shares, rq_weight);
+ update_group_shares_cpu(tg, i, boost, shares, rq_weight);
return 0;
}
next prev parent reply other threads:[~2008-12-19 22:33 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 1/8] sched: convert BALANCE_FOR_xx_POWER to inline functions Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 2/8] sched: Framework for sched_mc/smt_power_savings=N Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 3/8] sched: favour lower logical cpu number for sched_mc balance Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 4/8] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
2008-12-18 18:12 ` Balbir Singh
2008-12-19 21:55 ` Andrew Morton
2008-12-19 22:19 ` Andrew Morton
2008-12-19 22:27 ` Ingo Molnar
2008-12-19 22:31 ` Ingo Molnar [this message]
2008-12-19 22:38 ` Andrew Morton
2008-12-19 22:54 ` Ingo Molnar
2008-12-20 4:36 ` Vaidyanathan Srinivasan
2008-12-20 4:44 ` Andrew Morton
2008-12-20 7:54 ` Ingo Molnar
2008-12-20 10:02 ` Vaidyanathan Srinivasan
2008-12-20 10:36 ` Vaidyanathan Srinivasan
2008-12-20 10:56 ` Vaidyanathan Srinivasan
2008-12-21 8:46 ` Ingo Molnar
2008-12-18 17:56 ` [PATCH v7 5/8] sched: bias task wakeups to preferred semi-idle packages Vaidyanathan Srinivasan
2008-12-18 18:11 ` Balbir Singh
2008-12-18 17:56 ` [PATCH v7 6/8] sched: activate active load balancing in new idle cpus Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 7/8] sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0 Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle() Vaidyanathan Srinivasan
2008-12-18 18:12 ` Balbir Singh
2008-12-18 20:17 ` Ingo Molnar
2008-12-18 20:19 ` [PATCH v7 0/8] Tunable sched_mc_power_savings=n Ingo Molnar
2008-12-18 20:31 ` Ingo Molnar
2008-12-19 8:29 ` Vaidyanathan Srinivasan
2008-12-19 8:24 ` Vaidyanathan Srinivasan
2008-12-19 13:34 ` Vaidyanathan Srinivasan
2008-12-29 23:43 ` MinChan Kim
2008-12-30 2:48 ` Balbir Singh
2008-12-30 6:21 ` Ingo Molnar
2008-12-30 6:44 ` Balbir Singh
2008-12-30 7:20 ` Ingo Molnar
2008-12-30 18:07 ` Vaidyanathan Srinivasan
2009-01-02 7:26 ` Vaidyanathan Srinivasan
2009-01-02 22:16 ` Ingo Molnar
2009-01-03 7:29 ` Mike Galbraith
2009-01-03 10:16 ` Vaidyanathan Srinivasan
2009-01-03 11:22 ` Mike Galbraith
2009-01-04 15:00 ` Mike Galbraith
2009-01-04 18:19 ` Vaidyanathan Srinivasan
2009-01-04 19:52 ` Mike Galbraith
2009-01-05 3:20 ` Vaidyanathan Srinivasan
2009-01-05 4:40 ` Mike Galbraith
2009-01-05 6:36 ` Mike Galbraith
2009-01-05 15:19 ` Mike Galbraith
2009-01-06 9:31 ` Mike Galbraith
2009-01-06 15:07 ` Vaidyanathan Srinivasan
2009-01-06 17:48 ` Mike Galbraith
2009-01-06 18:45 ` Balbir Singh
2009-01-07 8:59 ` Mike Galbraith
2009-01-07 11:26 ` Vaidyanathan Srinivasan
2009-01-07 14:36 ` Mike Galbraith
2009-01-07 15:35 ` Vaidyanathan Srinivasan
2009-01-08 8:06 ` Mike Galbraith
2009-01-08 17:46 ` Vaidyanathan Srinivasan
2009-01-09 6:00 ` Mike Galbraith
2009-01-06 14:54 ` Vaidyanathan Srinivasan
2008-12-30 17:31 ` Vaidyanathan Srinivasan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081219223130.GA13172@elte.hu \
--to=mingo@elte.hu \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=davecb@sun.com \
--cc=dipankar@in.ibm.com \
--cc=ego@in.ibm.com \
--cc=gregory.haskins@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maxk@qualcomm.com \
--cc=pavel@suse.cz \
--cc=rusty@rustcorp.com.au \
--cc=suresh.b.siddha@intel.com \
--cc=svaidy@linux.vnet.ibm.com \
--cc=tconnors@astro.swin.edu.au \
--cc=vatsa@linux.vnet.ibm.com \
--cc=venkatesh.pallipadi@intel.com \
--cc=yinghai@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox