public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Charles Wang <muming.wq@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, "Ingo Molnar" <mingo@redhat.com>,
	"Charles Wang" <muming.wq@taobao.com>, "Tao Ma" <tm@tao.ma>,
	含黛 <handai.szj@taobao.com>
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate
Date: Tue, 12 Jun 2012 17:34:10 +0800	[thread overview]
Message-ID: <4FD70D12.5030404@gmail.com> (raw)
In-Reply-To: <1339429374.30462.54.camel@twins>

OK! Configure the Thunderbird as Documentation/email-clients.txt saying.
Sorry for last email contained with html. No html now.

On Monday, June 11, 2012 11:42 PM, Peter Zijlstra wrote:

> On Sat, 2012-06-09 at 18:54 +0800, Charles Wang wrote:
>> After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold
>> the idle into calc_load_tasks_idle between the last cpu load calculating and
>> calc_global_load calling. However problem still exits between the first cpu 
>> load calculating and the last cpu load calculating. Every time when we do load 
>> calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if
>> the idle load is caused by calculated cpus. This problem is also described in
>> the following link:
>>
>> https://lkml.org/lkml/2012/5/24/419
> 
> Don't put links in like this, if its useful describe it. As it stands I
> think your paragraph and that link are trying to say the same.
> 
> Also, since you found Tao's email, why didn't you CC him?
> 

My mistake! And this problem is first reported by Sha Zhengju
<handai.szj@taobao.com>, so i also add her in.

> I find it hard to understand either description.
> 
> Are you in fact saying:
> 
> 
> cpu0		cpu1		cpu2
> 
> calc_load_account_active()
> 
> 		calc_load_account_active()
> 		-> idle
> 	,----	   calc_load_account_idle()
> 	|
> 	|			calc_load_account_active()
> 	`--------->		  calc_load_fold_idle()
> 
> 
> That when a cpu goes idle during the per-cpu active folding window
> it might happen that a cpu goes idle after its fold, but before a next
> cpu's fold, so that its calc_load_tasks_idle contribution gets folded
> back in. As per the above picture.
> 
> Now _why_ is this a problem? 
> 

consider following case:

5HZ+1
| cpu0_load	cpu1	cpu2	cpu3	calc_load_tasks
|    1		 1	 1	 1	
|  -->calc_load				    1
|    1		 1	 1	 1	
|       	     -->calc_load	    2
|    0		 0	 1	 0
|		     -->calc_load	    2+1-3=1
|    1		 1	 0	 1	
|			     -->calc_load   1-1=0
V
5HZ+11     -->calc_global_load              0



actually the load should be around 3, but shows nearly 0.

1 tick is much long for some workloads.

> All this code tries to do is:
> 
>   nr_active = 0;
>   for_each_possible_cpu(cpu)
> 	nr_active += cpu_rq(cpu)->nr_running + cpu_rq(cpu)
> 
> Without actually iterating all cpus because on large systems that's
> prohibitively expensive.
> 
> So instead we do the calc_load_fold_active() thing on each individual
> cpu:
> 
>   nr_active = this_rq->nr_running + this_rq->nr_uninterruptible;
>   delta = nr_active - this_rq->calc_load_active;
>   this_rq->calc_load_active = nr_active;
> 
>   atomic_long_add(delta, &calc_load_tasks);
> 
> The first does a straight sum, the second does a sum of deltas, both
> should give the same number.
> 
> 
>   \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
>                 = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
> 
> The whole NO_HZ thing complicates this a little further in order to not
> have to wake an idle cpu. I'll try and do the above sum with the nohz
> thing in.. maybe something obvious falls out.
> 
> Now all this is sampling.. we take a nr_active sample once every
> LOAD_FREQ (5s). Anything that happens inside this doesn't exist.
> 
> So how is the above scenario different from the cpu going idle right
> before we take the sample?
> 
>> This bug can be found in our work load. The average running processes number 
>> is about 15, but the load only shows about 4.
> 
> Right,. still not really sure what the bug is though.
> 
>> The patch provides a solution, by taking calculated load cpus' idle away from
>>  real effective idle. 
> 
> -ENOPARSE
> 
>> First adds a cpumask to record those cpus that alread 
>> calculated their load, and then adds a calc_unmask_cpu_load_idle to record 
>> thoses not marked cpus' go-idle load. Calc_unmask_cpu_load_idle takes place 
>> of calc_load_tasks_idle to be added into calc_load_tasks every 5HZ
> 
> 5 seconds
> 
>>  when cpu 
>> calculate its load. Go-idle load on those cpus which load alread has been calculated 
>> will only be added into calc_load_tasks_idle, no in calc_unmask_cpu_load_idle.
> 
> I don't like the solution, keeping that mask is expensive. Also, I still
> don't understand what the problem is.
> 

It's a problem. And I plan to use per-cpu var to take place of cpumask.

> A few nits on the patch below..
> 
>> ---
>>  include/linux/sched.h     |    1 +
>>  kernel/sched/core.c       |   83 ++++++++++++++++++++++++++++++++++++++++++++-
>>  kernel/time/timekeeping.c |    1 +
>>  3 files changed, 84 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 6029d8c..a2b8df2 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -145,6 +145,7 @@ extern unsigned long this_cpu_load(void);
>>  
>>
>>  extern void calc_global_load(unsigned long ticks);
>> +extern void prepare_idle_mask(unsigned long ticks);
>>  extern void update_cpu_load_nohz(void);
>>  
>>  extern unsigned long get_parent_ip(unsigned long addr);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c46958e..bdfe3c2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2164,6 +2164,7 @@ unsigned long this_cpu_load(void)
>>  /* Variables and functions for calc_load */
>>  static atomic_long_t calc_load_tasks;
>>  static unsigned long calc_load_update;
>> +static unsigned long idle_mask_update;
>>  unsigned long avenrun[3];
>>  EXPORT_SYMBOL(avenrun);
>>  
>> @@ -2199,13 +2200,38 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
>>   */
>>  static atomic_long_t calc_load_tasks_idle;
>>  
>> +/* 
>> + * Those cpus whose load alread has been calculated in this LOAD_FREQ 
>> + * period will be masked.
>> + */
>> +struct cpumask  cpu_load_update_mask;
> 
> This should be a cpumask_var_t and allocated somewhere..
> 
>> +
>> +/* 
>> + * Fold unmask cpus' idle load
>> + */
>> +static atomic_long_t calc_unmask_cpu_load_idle;
>> +
>> +
>>  void calc_load_account_idle(struct rq *this_rq)
>>  {
>>  	long delta;
>> +	int cpu = smp_processor_id();
>> +
>>  
>>  	delta = calc_load_fold_active(this_rq);
>>  	if (delta)
>> +	{
> 
> wrong bracket placement
> 
>>  		atomic_long_add(delta, &calc_load_tasks_idle);
>> +		/*
>> +		 * calc_unmask_cpu_load_idle is only used between the first cpu load accounting
>> +		 * and the last cpu load accounting in every LOAD_FREQ period, and records idle load on
>> +		 * those unmask cpus.
>> +		 */
>> +		if (!cpumask_empty(&cpu_load_update_mask) && !cpumask_test_cpu(cpu, &cpu_load_update_mask))
>> +		{
>> +			atomic_long_add(delta, &calc_unmask_cpu_load_idle);
> 
> lines are too long, brackets are placed wrong and brackets are
> superfluous.
> 
> 
>> +		}
>> +       }
>>  }
>>  
> 
> and many more similar nits.
> 

Thanks for pointing these out!

>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index 6e46cac..afbc06a 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1222,6 +1222,7 @@ void do_timer(unsigned long ticks)
>>  	jiffies_64 += ticks;
>>  	update_wall_time();
>>  	calc_global_load(ticks);
>> +	prepare_idle_mask(ticks);
> 
> Why not add this to calc_global_load() ? 
> 

If some CPUs go into nohz and get up after calc_global_load,
then calc_load_account_active will be called after calc_global_load.
This will cause identifying of first calc_load_account_active cpu being
much more complicated.

>>  }
>>  
>>  /**
> 



  parent reply	other threads:[~2012-06-12  9:34 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-09 10:54 [PATCH] sched: Folding nohz load accounting more accurate Charles Wang
2012-06-11 15:42 ` Peter Zijlstra
     [not found]   ` <4FD6BFC4.1060302@gmail.com>
2012-06-12  8:54     ` Peter Zijlstra
2012-06-12  9:34   ` Charles Wang [this message]
2012-06-12  9:56     ` Peter Zijlstra
2012-06-13  5:55       ` Doug Smythies
2012-06-13  7:56         ` Charles Wang
2012-06-14  4:41           ` Doug Smythies
2012-06-14 15:42             ` Charles Wang
2012-06-16  6:42               ` Doug Smythies
2012-06-13  8:16         ` Peter Zijlstra
2012-06-13 15:33           ` Doug Smythies
2012-06-13 21:57             ` Peter Zijlstra
2012-06-14  3:13               ` Doug Smythies
2012-06-18 10:13                 ` Peter Zijlstra
2012-07-20 19:24         ` sched: care and feeding of load-avg code (Re: [PATCH] sched: Folding nohz load accounting more accurate) Jonathan Nieder
2012-06-15 14:27       ` [PATCH] sched: Folding nohz load accounting more accurate Charles Wang
2012-06-15 17:39         ` Peter Zijlstra
2012-06-16 14:53           ` Doug Smythies
2012-06-18  6:41             ` Doug Smythies
2012-06-18 14:41               ` Charles Wang
2012-06-18 10:06           ` Charles Wang
2012-06-18 16:03         ` Peter Zijlstra
2012-06-19  6:08           ` Yong Zhang
2012-06-19  9:18             ` Peter Zijlstra
2012-06-19 15:50               ` Doug Smythies
2012-06-20  9:45                 ` Peter Zijlstra
2012-06-21  4:12                   ` Doug Smythies
2012-06-21  6:35                     ` Charles Wang
2012-06-21  8:48                     ` Peter Zijlstra
2012-06-22 14:03                     ` Peter Zijlstra
2012-06-24 21:45                       ` Doug Smythies
2012-07-03 16:01                         ` Doug Smythies
2012-06-25  2:15                       ` Charles Wang
2012-07-06  6:19                       ` [tip:sched/core] sched/nohz: Rewrite and fix load-avg computation -- again tip-bot for Peter Zijlstra
2012-06-19  6:19           ` [PATCH] sched: Folding nohz load accounting more accurate Doug Smythies
2012-06-19  6:24           ` Charles Wang
2012-06-19  9:57             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FD70D12.5030404@gmail.com \
    --to=muming.wq@gmail.com \
    --cc=handai.szj@taobao.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=muming.wq@taobao.com \
    --cc=peterz@infradead.org \
    --cc=tm@tao.ma \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox