From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alex Shi <alex.shi@intel.com>, Paul Turner <pjt@google.com>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Andrew Morton <akpm@linux-foundation.org>,
Arjan van de Ven <arjan@linux.intel.com>,
Borislav Petkov <bp@alien8.de>,
namhyung@kernel.org, Mike Galbraith <efault@gmx.de>,
Vincent Guittot <vincent.guittot@linaro.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
Date: Mon, 07 Jan 2013 12:30:17 +0530 [thread overview]
Message-ID: <50EA7281.9040804@linux.vnet.ibm.com> (raw)
In-Reply-To: <CA+55aFxbS-u1Zfwg2oKOmHq6FBF+dRRhrdssqc9PpC_gXoV3+g@mail.gmail.com>
Hi everyone,
On 01/07/2013 12:01 AM, Linus Torvalds wrote:
> On Sat, Jan 5, 2013 at 11:54 PM, Alex Shi <alex.shi@intel.com> wrote:
>>
>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>> and run until all finished.
>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>> may goes to sleep till a regular balancing give it some new tasks. That
>> causes the performance dropping. cause more idle entering.
>
> Sounds like for AIM (and possibly for other really bursty loads), we
> might want to do some load-balancing at wakeup time by *just* looking
> at the number of running tasks, rather than at the load average. Hmm?
During wake ups,the load average is not even queried,is it? wake_affine() is called
to see in the affinity of which cpu(prev/waking),the task should go.But after that
select_idle_sibling() simply sees if there is an idle cpu to offload the task to.
Looks like only in the periodic load balancing we can correct this scenario as of now,
as pointed below.
>
> The load average is fundamentally always going to run behind a bit,
> and while you want to use it for long-term balancing, a short-term you
> might want to do just a "if we have a huge amount of runnable
> processes, do a load balancing *now*". Where "huge amount" should
> probably be relative to the long-term load balancing (ie comparing the
> number of runnable processes on this CPU right *now* with the load
> average over the last second or so would show a clear spike, and a
> reason for quick action).
>
> Linus
>
Earlier I had posted a patch,to address this.
https://lkml.org/lkml/2012/10/25/156
update_sd_pick_busiest() checks whether a sched group has too many running tasks
to be offloaded.
--------------START_PATCH-------------------------------------------------
The scenario which led to this patch is shown below:
Consider Task1 and Task2 to be a long running task
and Tasks 3,4,5,6 to be short running tasks
Task3
Task4
Task1 Task5
Task2 Task6
------ ------
SCHED_GRP1 SCHED_GRP2
Normal load calculator would qualify SCHED_GRP2 as
the candidate for sd->busiest due to the following loads
that it calculates.
SCHED_GRP1:2048
SCHED_GRP2:4096
Load calculator would probably qualify SCHED_GRP1 as the candidate
for sd->busiest due to the following loads that it calculates
SCHED_GRP1:3200
SCHED_GRP2:1156
This patch aims to strike a balance between the loads of the
group and the number of tasks running on the group to decide the
busiest group in the sched_domain.
This means we will need to use the PJT's metrics but with an
additional constraint.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
kernel/sched/fair.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e02dad4..aafa3c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -165,7 +165,8 @@ void sched_init_granularity(void)
#else
# define WMULT_CONST (1UL << 32)
#endif
-
+#define NR_THRESHOLD 2
+#define LOAD_THRESHOLD 1
#define WMULT_SHIFT 32
/*
@@ -4169,6 +4170,7 @@ struct sd_lb_stats {
/* Statistics of the busiest group */
unsigned int busiest_idle_cpus;
unsigned long max_load;
+ u64 max_sg_load; /* Equivalent of max_load but calculated using pjt's metric*/
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
@@ -4628,8 +4630,24 @@ static bool update_sd_pick_busiest(struct lb_env *env,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
- if (sgs->avg_load <= sds->max_load)
- return false;
+ /* Use PJT's metrics to qualify a sched_group as busy
+ *
+ * But a low load sched group may be queueing up many tasks
+ * So before dismissing a sched group with lesser load,ensure
+ * that the number of processes on it is checked if it is
+ * not too less loaded than the max load so far
+ *
+ * But as of now as LOAD_THRESHOLD is 1,this check is a nop.
+ * But we could vary LOAD_THRESHOLD suitably to bring in this check
+ */
+ if (sgs->avg_cfs_runnable_load <= sds->max_sg_load) {
+ if (sgs->avg_cfs_runnable_load > LOAD_THRESHOLD * sds->max_sg_load) {
+ if (sgs->sum_nr_running <= (NR_THRESHOLD + sds->busiest_nr_running))
+ return false;
+ } else {
+ return false;
+ }
+ }
if (sgs->sum_nr_running > sgs->group_capacity)
return true;
@@ -4708,6 +4726,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->this_idle_cpus = sgs.idle_cpus;
} else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
sds->max_load = sgs.avg_load;
+ sds->max_sg_load = sgs.avg_cfs_runnable_load;
sds->busiest = sg;
sds->busiest_nr_running = sgs.sum_nr_running;
sds->busiest_idle_cpus = sgs.idle_cpus;
Regards
Preeti U Murthy
next prev parent reply other threads:[~2013-01-07 7:01 UTC|newest]
Thread overview: 91+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-05 8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
2013-01-05 8:37 ` [PATCH v3 01/22] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
2013-01-05 8:37 ` [PATCH v3 02/22] sched: select_task_rq_fair clean up Alex Shi
2013-01-11 4:57 ` Preeti U Murthy
2013-01-05 8:37 ` [PATCH v3 03/22] sched: fix find_idlest_group mess logical Alex Shi
2013-01-11 4:59 ` Preeti U Murthy
2013-01-05 8:37 ` [PATCH v3 04/22] sched: don't need go to smaller sched domain Alex Shi
2013-01-09 17:38 ` Morten Rasmussen
2013-01-10 3:16 ` Mike Galbraith
2013-01-11 5:02 ` Preeti U Murthy
2013-01-05 8:37 ` [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake Alex Shi
2013-01-09 18:21 ` Morten Rasmussen
2013-01-11 2:46 ` Alex Shi
2013-01-11 10:07 ` Morten Rasmussen
2013-01-11 14:50 ` Alex Shi
2013-01-14 8:55 ` li guang
2013-01-14 9:18 ` Alex Shi
2013-01-11 4:56 ` Preeti U Murthy
2013-01-11 8:01 ` li guang
2013-01-11 14:56 ` Alex Shi
2013-01-14 9:03 ` li guang
2013-01-15 2:34 ` Alex Shi
2013-01-16 1:54 ` li guang
2013-01-11 10:54 ` Morten Rasmussen
2013-01-16 5:43 ` Alex Shi
2013-01-16 7:41 ` Alex Shi
2013-01-05 8:37 ` [PATCH v3 06/22] sched: load tracking bug fix Alex Shi
2013-01-05 8:37 ` [PATCH v3 07/22] sched: set initial load avg of new forked task Alex Shi
2013-01-11 5:10 ` Preeti U Murthy
2013-01-11 5:44 ` Alex Shi
2013-01-05 8:37 ` [PATCH v3 08/22] sched: update cpu load after task_tick Alex Shi
2013-01-05 8:37 ` [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
2013-01-05 8:56 ` Alex Shi
2013-01-06 7:54 ` Alex Shi
2013-01-06 18:31 ` Linus Torvalds
2013-01-07 7:00 ` Preeti U Murthy [this message]
2013-01-08 14:27 ` Alex Shi
2013-01-11 6:31 ` Alex Shi
2013-01-21 14:47 ` Alex Shi
2013-01-22 3:20 ` Alex Shi
2013-01-22 6:55 ` Mike Galbraith
2013-01-22 7:50 ` Alex Shi
2013-01-22 9:52 ` Mike Galbraith
2013-01-23 0:36 ` Alex Shi
2013-01-23 1:47 ` Mike Galbraith
2013-01-23 2:01 ` Alex Shi
2013-01-05 8:37 ` [PATCH v3 10/22] sched: consider runnable load average in move_tasks Alex Shi
2013-01-05 8:37 ` [PATCH v3 11/22] sched: consider runnable load average in effective_load Alex Shi
2013-01-10 11:28 ` Morten Rasmussen
2013-01-11 3:26 ` Alex Shi
2013-01-14 12:01 ` Morten Rasmussen
2013-01-16 5:30 ` Alex Shi
2013-01-05 8:37 ` [PATCH v3 12/22] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-01-05 8:37 ` [PATCH v3 13/22] sched: add sched_policy in kernel Alex Shi
2013-01-05 8:37 ` [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface Alex Shi
2013-01-14 6:53 ` Namhyung Kim
2013-01-14 8:11 ` Alex Shi
2013-01-05 8:37 ` [PATCH v3 15/22] sched: log the cpu utilization at rq Alex Shi
2013-01-10 11:40 ` Morten Rasmussen
2013-01-11 3:30 ` Alex Shi
2013-01-14 13:59 ` Morten Rasmussen
2013-01-16 5:53 ` Alex Shi
2013-01-05 8:37 ` [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake Alex Shi
2013-01-10 15:01 ` Morten Rasmussen
2013-01-11 7:08 ` Alex Shi
2013-01-14 16:09 ` Morten Rasmussen
2013-01-16 6:02 ` Alex Shi
2013-01-16 14:27 ` Morten Rasmussen
2013-01-17 5:47 ` Namhyung Kim
2013-01-18 13:41 ` Alex Shi
2013-01-14 7:03 ` Namhyung Kim
2013-01-14 8:30 ` Alex Shi
2013-01-05 8:37 ` [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing Alex Shi
2013-01-10 17:17 ` Morten Rasmussen
2013-01-11 3:47 ` Alex Shi
2013-01-14 7:13 ` Namhyung Kim
2013-01-16 6:11 ` Alex Shi
2013-01-16 12:52 ` Namhyung Kim
2013-01-14 17:00 ` Morten Rasmussen
2013-01-16 7:32 ` Alex Shi
2013-01-16 15:08 ` Morten Rasmussen
2013-01-18 14:06 ` Alex Shi
2013-01-05 8:37 ` [PATCH v3 18/22] sched: add power/performance balance allowed flag Alex Shi
2013-01-05 8:37 ` [PATCH v3 19/22] sched: pull all tasks from source group Alex Shi
2013-01-05 8:37 ` [PATCH v3 20/22] sched: don't care if the local group has capacity Alex Shi
2013-01-05 8:37 ` [PATCH v3 21/22] sched: power aware load balance, Alex Shi
2013-01-05 8:37 ` [PATCH v3 22/22] sched: lazy powersaving balance Alex Shi
2013-01-14 8:39 ` Namhyung Kim
2013-01-14 8:45 ` Alex Shi
2013-01-09 17:16 ` [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Morten Rasmussen
2013-01-10 3:49 ` Alex Shi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50EA7281.9040804@linux.vnet.ibm.com \
--to=preeti@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@intel.com \
--cc=arjan@linux.intel.com \
--cc=bp@alien8.de \
--cc=efault@gmx.de \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).