Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Alex Shi <alex.shi@intel.com>
To: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Arjan van de Ven <arjan@linux.intel.com>,
	Borislav Petkov <bp@alien8.de>,
	namhyung@kernel.org, Vincent Guittot <vincent.guittot@linaro.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	preeti@linux.vnet.ibm.com,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
Date: Tue, 22 Jan 2013 15:50:13 +0800	[thread overview]
Message-ID: <50FE44B5.6020004@intel.com> (raw)
In-Reply-To: <1358837740.5782.209.camel@marge.simpson.net>

On 01/22/2013 02:55 PM, Mike Galbraith wrote:
> On Tue, 2013-01-22 at 11:20 +0800, Alex Shi wrote: 
>>>>>>
>>>>>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>>>>>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>>>>>> and run until all finished.
>>>>>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>>>>>> may goes to sleep till a regular balancing give it some new tasks. That
>>>>>> causes the performance dropping. cause more idle entering.
>>>>>
>>>>> Sounds like for AIM (and possibly for other really bursty loads), we
>>>>> might want to do some load-balancing at wakeup time by *just* looking
>>>>> at the number of running tasks, rather than at the load average. Hmm?
>>>>>
>>>>> The load average is fundamentally always going to run behind a bit,
>>>>> and while you want to use it for long-term balancing, a short-term you
>>>>> might want to do just a "if we have a huge amount of runnable
>>>>> processes, do a load balancing *now*". Where "huge amount" should
>>>>> probably be relative to the long-term load balancing (ie comparing the
>>>>> number of runnable processes on this CPU right *now* with the load
>>>>> average over the last second or so would show a clear spike, and a
>>>>> reason for quick action).
>>>>>
>>>>
>>>> Sorry for response late!
>>>>
>>>> Just written a patch following your suggestion, but no clear improvement for this case.
>>>> I also tried change the burst checking interval, also no clear help.
>>>>
>>>> If I totally give up runnable load in periodic balancing, the performance can recover 60%
>>>> of lose.
>>>>
>>>> I will try to optimize wake up balancing in weekend.
>>>>
>>>
>>> (btw, the time for runnable avg to accumulate to 100%, needs 345ms; to
>>> 50% needs 32 ms)
>>>
>>> I have tried some tuning in both wake up balancing and regular
>>> balancing. Yes, when using instant load weight (without runnable avg
>>> engage), both in waking up, and regular balance, the performance recovered.
>>>
>>> But with per_cpu nr_running tracking, it's hard to find a elegant way to
>>> detect the burst whenever in waking up or in regular balance.
>>> In waking up, the whole sd_llc domain cpus are candidates, so just
>>> checking this_cpu is not enough.
>>> In regular balance, this_cpu is the migration destination cpu, checking
>>> if the burst on the cpu is not useful. Instead, we need to check whole
>>> domains' increased task number.
>>>
>>> So, guess 2 solutions for this issue.
>>> 1, for quick waking up, we need use instant load(same as current
>>> balancing) to do balance; and for regular balance, we can record both
>>> instant load and runnable load data for whole domain, then decide which
>>> one to use according to task number increasing in the domain after
>>> tracking done the whole domain.
>>>
>>> 2, we can keep current instant load balancing as performance balance
>>> policy, and using runnable load balancing in power friend policy.
>>> Since, none of us find performance benefit with runnable load balancing
>>> on benchmark hackbench/kbuild/aim9/tbench/specjbb etc.
>>> I prefer the 2nd.
>>
>> 3, On the other hand, Considering the aim9 testing scenario is rare in
>> real life(prepare thousands tasks and then wake up them at the same
>> time). And the runnable load avg includes useful running history info.
>> Only aim9 5~7% performance dropping is not unacceptable.
>> (kbuild/hackbench/tbench/specjbb have no clear performance change)
>>
>> So we can let this drop be with a reminder in code. Any comments?
> 
> Hm.  Burst of thousands of tasks may be rare and perhaps even silly, but
> what about few task bursts?   History is useless for bursts, they live
> or die now: modest gaggle of worker threads (NR_CPUS) for say video
> encoding job wake in parallel, each is handed a chunk of data to chew up
> in parallel.  Double scheduler latency of one worker (stack workers
> because individuals don't historically fill a cpu), you double latency
> for the entire job every time.
> 
> I think 2 is mandatory, keep both, and user picks his poison.
> 
> If you want max burst performance, you care about the here and now
> reality the burst is waking into.  If you're running a google freight
> train farm otoh, you may want some hysteresis so trains don't over-rev
> the electric meter on every microscopic spike.  Both policies make
> sense, but you can't have both performance profiles with either metric,
> so choosing one seems doomed to failure.
> 

Thanks for your suggestions and example, Mike!
I just can't understand the your last words here, Sorry. what the
detailed concern of you on 'both performance profiles with either
metric'? Could you like to give your preferred solutions?

> Case in point: tick skew.  It was removed because synchronized ticking
> saves power.. and then promptly returned under user control because the
> power saving gain also inflicted serious latency pain.
> 
> -Mike
> 


-- 
Thanks Alex

next prev parent reply	other threads:[~2013-01-22  7:49 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
2013-01-05  8:37 ` [PATCH v3 01/22] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
2013-01-05  8:37 ` [PATCH v3 02/22] sched: select_task_rq_fair clean up Alex Shi
2013-01-11  4:57   ` Preeti U Murthy
2013-01-05  8:37 ` [PATCH v3 03/22] sched: fix find_idlest_group mess logical Alex Shi
2013-01-11  4:59   ` Preeti U Murthy
2013-01-05  8:37 ` [PATCH v3 04/22] sched: don't need go to smaller sched domain Alex Shi
2013-01-09 17:38   ` Morten Rasmussen
2013-01-10  3:16     ` Mike Galbraith
2013-01-11  5:02   ` Preeti U Murthy
2013-01-05  8:37 ` [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake Alex Shi
2013-01-09 18:21   ` Morten Rasmussen
2013-01-11  2:46     ` Alex Shi
2013-01-11 10:07       ` Morten Rasmussen
2013-01-11 14:50         ` Alex Shi
2013-01-14  8:55         ` li guang
2013-01-14  9:18           ` Alex Shi
2013-01-11  4:56     ` Preeti U Murthy
2013-01-11  8:01       ` li guang
2013-01-11 14:56         ` Alex Shi
2013-01-14  9:03           ` li guang
2013-01-15  2:34             ` Alex Shi
2013-01-16  1:54               ` li guang
2013-01-11 10:54       ` Morten Rasmussen
2013-01-16  5:43       ` Alex Shi
2013-01-16  7:41         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 06/22] sched: load tracking bug fix Alex Shi
2013-01-05  8:37 ` [PATCH v3 07/22] sched: set initial load avg of new forked task Alex Shi
2013-01-11  5:10   ` Preeti U Murthy
2013-01-11  5:44     ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 08/22] sched: update cpu load after task_tick Alex Shi
2013-01-05  8:37 ` [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
2013-01-05  8:56   ` Alex Shi
2013-01-06  7:54     ` Alex Shi
2013-01-06 18:31       ` Linus Torvalds
2013-01-07  7:00         ` Preeti U Murthy
2013-01-08 14:27         ` Alex Shi
2013-01-11  6:31         ` Alex Shi
2013-01-21 14:47           ` Alex Shi
2013-01-22  3:20             ` Alex Shi
2013-01-22  6:55               ` Mike Galbraith
2013-01-22  7:50                 ` Alex Shi [this message]
2013-01-22  9:52                   ` Mike Galbraith
2013-01-23  0:36                     ` Alex Shi
2013-01-23  1:47                       ` Mike Galbraith
2013-01-23  2:01                         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 10/22] sched: consider runnable load average in move_tasks Alex Shi
2013-01-05  8:37 ` [PATCH v3 11/22] sched: consider runnable load average in effective_load Alex Shi
2013-01-10 11:28   ` Morten Rasmussen
2013-01-11  3:26     ` Alex Shi
2013-01-14 12:01       ` Morten Rasmussen
2013-01-16  5:30         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 12/22] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-01-05  8:37 ` [PATCH v3 13/22] sched: add sched_policy in kernel Alex Shi
2013-01-05  8:37 ` [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface Alex Shi
2013-01-14  6:53   ` Namhyung Kim
2013-01-14  8:11     ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 15/22] sched: log the cpu utilization at rq Alex Shi
2013-01-10 11:40   ` Morten Rasmussen
2013-01-11  3:30     ` Alex Shi
2013-01-14 13:59       ` Morten Rasmussen
2013-01-16  5:53         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake Alex Shi
2013-01-10 15:01   ` Morten Rasmussen
2013-01-11  7:08     ` Alex Shi
2013-01-14 16:09       ` Morten Rasmussen
2013-01-16  6:02         ` Alex Shi
2013-01-16 14:27           ` Morten Rasmussen
2013-01-17  5:47             ` Namhyung Kim
2013-01-18 13:41               ` Alex Shi
2013-01-14  7:03   ` Namhyung Kim
2013-01-14  8:30     ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing Alex Shi
2013-01-10 17:17   ` Morten Rasmussen
2013-01-11  3:47     ` Alex Shi
2013-01-14  7:13       ` Namhyung Kim
2013-01-16  6:11         ` Alex Shi
2013-01-16 12:52           ` Namhyung Kim
2013-01-14 17:00       ` Morten Rasmussen
2013-01-16  7:32         ` Alex Shi
2013-01-16 15:08           ` Morten Rasmussen
2013-01-18 14:06             ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 18/22] sched: add power/performance balance allowed flag Alex Shi
2013-01-05  8:37 ` [PATCH v3 19/22] sched: pull all tasks from source group Alex Shi
2013-01-05  8:37 ` [PATCH v3 20/22] sched: don't care if the local group has capacity Alex Shi
2013-01-05  8:37 ` [PATCH v3 21/22] sched: power aware load balance, Alex Shi
2013-01-05  8:37 ` [PATCH v3 22/22] sched: lazy powersaving balance Alex Shi
2013-01-14  8:39   ` Namhyung Kim
2013-01-14  8:45     ` Alex Shi
2013-01-09 17:16 ` [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Morten Rasmussen
2013-01-10  3:49   ` Alex Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50FE44B5.6020004@intel.com \
    --to=alex.shi@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=efault@gmx.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).