Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Alex Shi <alex.shi@intel.com>
To: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Arjan van de Ven <arjan@linux.intel.com>,
	Borislav Petkov <bp@alien8.de>,
	namhyung@kernel.org, Vincent Guittot <vincent.guittot@linaro.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	preeti@linux.vnet.ibm.com,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
Date: Tue, 22 Jan 2013 15:50:13 +0800	[thread overview]
Message-ID: <50FE44B5.6020004@intel.com> (raw)
In-Reply-To: <1358837740.5782.209.camel@marge.simpson.net>

On 01/22/2013 02:55 PM, Mike Galbraith wrote:
> On Tue, 2013-01-22 at 11:20 +0800, Alex Shi wrote: 
>>>>>>
>>>>>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>>>>>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>>>>>> and run until all finished.
>>>>>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>>>>>> may goes to sleep till a regular balancing give it some new tasks. That
>>>>>> causes the performance dropping. cause more idle entering.
>>>>>
>>>>> Sounds like for AIM (and possibly for other really bursty loads), we
>>>>> might want to do some load-balancing at wakeup time by *just* looking
>>>>> at the number of running tasks, rather than at the load average. Hmm?
>>>>>
>>>>> The load average is fundamentally always going to run behind a bit,
>>>>> and while you want to use it for long-term balancing, a short-term you
>>>>> might want to do just a "if we have a huge amount of runnable
>>>>> processes, do a load balancing *now*". Where "huge amount" should
>>>>> probably be relative to the long-term load balancing (ie comparing the
>>>>> number of runnable processes on this CPU right *now* with the load
>>>>> average over the last second or so would show a clear spike, and a
>>>>> reason for quick action).
>>>>>
>>>>
>>>> Sorry for response late!
>>>>
>>>> Just written a patch following your suggestion, but no clear improvement for this case.
>>>> I also tried change the burst checking interval, also no clear help.
>>>>
>>>> If I totally give up runnable load in periodic balancing, the performance can recover 60%
>>>> of lose.
>>>>
>>>> I will try to optimize wake up balancing in weekend.
>>>>
>>>
>>> (btw, the time for runnable avg to accumulate to 100%, needs 345ms; to
>>> 50% needs 32 ms)
>>>
>>> I have tried some tuning in both wake up balancing and regular
>>> balancing. Yes, when using instant load weight (without runnable avg
>>> engage), both in waking up, and regular balance, the performance recovered.
>>>
>>> But with per_cpu nr_running tracking, it's hard to find a elegant way to
>>> detect the burst whenever in waking up or in regular balance.
>>> In waking up, the whole sd_llc domain cpus are candidates, so just
>>> checking this_cpu is not enough.
>>> In regular balance, this_cpu is the migration destination cpu, checking
>>> if the burst on the cpu is not useful. Instead, we need to check whole
>>> domains' increased task number.
>>>
>>> So, guess 2 solutions for this issue.
>>> 1, for quick waking up, we need use instant load(same as current
>>> balancing) to do balance; and for regular balance, we can record both
>>> instant load and runnable load data for whole domain, then decide which
>>> one to use according to task number increasing in the domain after
>>> tracking done the whole domain.
>>>
>>> 2, we can keep current instant load balancing as performance balance
>>> policy, and using runnable load balancing in power friend policy.
>>> Since, none of us find performance benefit with runnable load balancing
>>> on benchmark hackbench/kbuild/aim9/tbench/specjbb etc.
>>> I prefer the 2nd.
>>
>> 3, On the other hand, Considering the aim9 testing scenario is rare in
>> real life(prepare thousands tasks and then wake up them at the same
>> time). And the runnable load avg includes useful running history info.
>> Only aim9 5~7% performance dropping is not unacceptable.
>> (kbuild/hackbench/tbench/specjbb have no clear performance change)
>>
>> So we can let this drop be with a reminder in code. Any comments?
> 
> Hm.  Burst of thousands of tasks may be rare and perhaps even silly, but
> what about few task bursts?   History is useless for bursts, they live
> or die now: modest gaggle of worker threads (NR_CPUS) for say video
> encoding job wake in parallel, each is handed a chunk of data to chew up
> in parallel.  Double scheduler latency of one worker (stack workers
> because individuals don't historically fill a cpu), you double latency
> for the entire job every time.
> 
> I think 2 is mandatory, keep both, and user picks his poison.
> 
> If you want max burst performance, you care about the here and now
> reality the burst is waking into.  If you're running a google freight
> train farm otoh, you may want some hysteresis so trains don't over-rev
> the electric meter on every microscopic spike.  Both policies make
> sense, but you can't have both performance profiles with either metric,
> so choosing one seems doomed to failure.
> 

Thanks for your suggestions and example, Mike!
I just can't understand the your last words here, Sorry. what the
detailed concern of you on 'both performance profiles with either
metric'? Could you like to give your preferred solutions?

> Case in point: tick skew.  It was removed because synchronized ticking
> saves power.. and then promptly returned under user control because the
> power saving gain also inflicted serious latency pain.
> 
> -Mike
> 


-- 
Thanks Alex

next prev parent reply	other threads:[~2013-01-22  7:49 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
2013-01-05  8:37 ` [PATCH v3 01/22] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
2013-01-05  8:37 ` [PATCH v3 02/22] sched: select_task_rq_fair clean up Alex Shi
2013-01-11  4:57   ` Preeti U Murthy
2013-01-05  8:37 ` [PATCH v3 03/22] sched: fix find_idlest_group mess logical Alex Shi
2013-01-11  4:59   ` Preeti U Murthy
2013-01-05  8:37 ` [PATCH v3 04/22] sched: don't need go to smaller sched domain Alex Shi
2013-01-09 17:38   ` Morten Rasmussen
2013-01-10  3:16     ` Mike Galbraith
2013-01-11  5:02   ` Preeti U Murthy
2013-01-05  8:37 ` [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake Alex Shi
2013-01-09 18:21   ` Morten Rasmussen
2013-01-11  2:46     ` Alex Shi
2013-01-11 10:07       ` Morten Rasmussen
2013-01-11 14:50         ` Alex Shi
2013-01-14  8:55         ` li guang
2013-01-14  9:18           ` Alex Shi
2013-01-11  4:56     ` Preeti U Murthy
2013-01-11  8:01       ` li guang
2013-01-11 14:56         ` Alex Shi
2013-01-14  9:03           ` li guang
2013-01-15  2:34             ` Alex Shi
2013-01-16  1:54               ` li guang
2013-01-11 10:54       ` Morten Rasmussen
2013-01-16  5:43       ` Alex Shi
2013-01-16  7:41         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 06/22] sched: load tracking bug fix Alex Shi
2013-01-05  8:37 ` [PATCH v3 07/22] sched: set initial load avg of new forked task Alex Shi
2013-01-11  5:10   ` Preeti U Murthy
2013-01-11  5:44     ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 08/22] sched: update cpu load after task_tick Alex Shi
2013-01-05  8:37 ` [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
2013-01-05  8:56   ` Alex Shi
2013-01-06  7:54     ` Alex Shi
2013-01-06 18:31       ` Linus Torvalds
2013-01-07  7:00         ` Preeti U Murthy
2013-01-08 14:27         ` Alex Shi
2013-01-11  6:31         ` Alex Shi
2013-01-21 14:47           ` Alex Shi
2013-01-22  3:20             ` Alex Shi
2013-01-22  6:55               ` Mike Galbraith
2013-01-22  7:50                 ` Alex Shi [this message]
2013-01-22  9:52                   ` Mike Galbraith
2013-01-23  0:36                     ` Alex Shi
2013-01-23  1:47                       ` Mike Galbraith
2013-01-23  2:01                         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 10/22] sched: consider runnable load average in move_tasks Alex Shi
2013-01-05  8:37 ` [PATCH v3 11/22] sched: consider runnable load average in effective_load Alex Shi
2013-01-10 11:28   ` Morten Rasmussen
2013-01-11  3:26     ` Alex Shi
2013-01-14 12:01       ` Morten Rasmussen
2013-01-16  5:30         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 12/22] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-01-05  8:37 ` [PATCH v3 13/22] sched: add sched_policy in kernel Alex Shi
2013-01-05  8:37 ` [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface Alex Shi
2013-01-14  6:53   ` Namhyung Kim
2013-01-14  8:11     ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 15/22] sched: log the cpu utilization at rq Alex Shi
2013-01-10 11:40   ` Morten Rasmussen
2013-01-11  3:30     ` Alex Shi
2013-01-14 13:59       ` Morten Rasmussen
2013-01-16  5:53         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake Alex Shi
2013-01-10 15:01   ` Morten Rasmussen
2013-01-11  7:08     ` Alex Shi
2013-01-14 16:09       ` Morten Rasmussen
2013-01-16  6:02         ` Alex Shi
2013-01-16 14:27           ` Morten Rasmussen
2013-01-17  5:47             ` Namhyung Kim
2013-01-18 13:41               ` Alex Shi
2013-01-14  7:03   ` Namhyung Kim
2013-01-14  8:30     ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing Alex Shi
2013-01-10 17:17   ` Morten Rasmussen
2013-01-11  3:47     ` Alex Shi
2013-01-14  7:13       ` Namhyung Kim
2013-01-16  6:11         ` Alex Shi
2013-01-16 12:52           ` Namhyung Kim
2013-01-14 17:00       ` Morten Rasmussen
2013-01-16  7:32         ` Alex Shi
2013-01-16 15:08           ` Morten Rasmussen
2013-01-18 14:06             ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 18/22] sched: add power/performance balance allowed flag Alex Shi
2013-01-05  8:37 ` [PATCH v3 19/22] sched: pull all tasks from source group Alex Shi
2013-01-05  8:37 ` [PATCH v3 20/22] sched: don't care if the local group has capacity Alex Shi
2013-01-05  8:37 ` [PATCH v3 21/22] sched: power aware load balance, Alex Shi
2013-01-05  8:37 ` [PATCH v3 22/22] sched: lazy powersaving balance Alex Shi
2013-01-14  8:39   ` Namhyung Kim
2013-01-14  8:45     ` Alex Shi
2013-01-09 17:16 ` [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Morten Rasmussen
2013-01-10  3:49   ` Alex Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50FE44B5.6020004@intel.com \
    --to=alex.shi@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=efault@gmx.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.