Re: [PATCH] sched: Folding nohz load accounting more accurate

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Charles Wang <muming.wq@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, "Ingo Molnar" <mingo@redhat.com>,
	"Charles Wang" <muming.wq@taobao.com>, "Tao Ma" <tm@tao.ma>,
	含黛 <handai.szj@taobao.com>
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate
Date: Tue, 12 Jun 2012 17:34:10 +0800	[thread overview]
Message-ID: <4FD70D12.5030404@gmail.com> (raw)
In-Reply-To: <1339429374.30462.54.camel@twins>

OK! Configure the Thunderbird as Documentation/email-clients.txt saying.
Sorry for last email contained with html. No html now.

On Monday, June 11, 2012 11:42 PM, Peter Zijlstra wrote:

> On Sat, 2012-06-09 at 18:54 +0800, Charles Wang wrote:
>> After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold
>> the idle into calc_load_tasks_idle between the last cpu load calculating and
>> calc_global_load calling. However problem still exits between the first cpu 
>> load calculating and the last cpu load calculating. Every time when we do load 
>> calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if
>> the idle load is caused by calculated cpus. This problem is also described in
>> the following link:
>>
>> https://lkml.org/lkml/2012/5/24/419
> 
> Don't put links in like this, if its useful describe it. As it stands I
> think your paragraph and that link are trying to say the same.
> 
> Also, since you found Tao's email, why didn't you CC him?
> 

My mistake! And this problem is first reported by Sha Zhengju
<handai.szj@taobao.com>, so i also add her in.

> I find it hard to understand either description.
> 
> Are you in fact saying:
> 
> 
> cpu0		cpu1		cpu2
> 
> calc_load_account_active()
> 
> 		calc_load_account_active()
> 		-> idle
> 	,----	   calc_load_account_idle()
> 	|
> 	|			calc_load_account_active()
> 	`--------->		  calc_load_fold_idle()
> 
> 
> That when a cpu goes idle during the per-cpu active folding window
> it might happen that a cpu goes idle after its fold, but before a next
> cpu's fold, so that its calc_load_tasks_idle contribution gets folded
> back in. As per the above picture.
> 
> Now _why_ is this a problem? 
> 

consider following case:

5HZ+1
| cpu0_load	cpu1	cpu2	cpu3	calc_load_tasks
|    1		 1	 1	 1	
|  -->calc_load				    1
|    1		 1	 1	 1	
|       	     -->calc_load	    2
|    0		 0	 1	 0
|		     -->calc_load	    2+1-3=1
|    1		 1	 0	 1	
|			     -->calc_load   1-1=0
V
5HZ+11     -->calc_global_load              0



actually the load should be around 3, but shows nearly 0.

1 tick is much long for some workloads.

> All this code tries to do is:
> 
>   nr_active = 0;
>   for_each_possible_cpu(cpu)
> 	nr_active += cpu_rq(cpu)->nr_running + cpu_rq(cpu)
> 
> Without actually iterating all cpus because on large systems that's
> prohibitively expensive.
> 
> So instead we do the calc_load_fold_active() thing on each individual
> cpu:
> 
>   nr_active = this_rq->nr_running + this_rq->nr_uninterruptible;
>   delta = nr_active - this_rq->calc_load_active;
>   this_rq->calc_load_active = nr_active;
> 
>   atomic_long_add(delta, &calc_load_tasks);
> 
> The first does a straight sum, the second does a sum of deltas, both
> should give the same number.
> 
> 
>   \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
>                 = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
> 
> The whole NO_HZ thing complicates this a little further in order to not
> have to wake an idle cpu. I'll try and do the above sum with the nohz
> thing in.. maybe something obvious falls out.
> 
> Now all this is sampling.. we take a nr_active sample once every
> LOAD_FREQ (5s). Anything that happens inside this doesn't exist.
> 
> So how is the above scenario different from the cpu going idle right
> before we take the sample?
> 
>> This bug can be found in our work load. The average running processes number 
>> is about 15, but the load only shows about 4.
> 
> Right,. still not really sure what the bug is though.
> 
>> The patch provides a solution, by taking calculated load cpus' idle away from
>>  real effective idle. 
> 
> -ENOPARSE
> 
>> First adds a cpumask to record those cpus that alread 
>> calculated their load, and then adds a calc_unmask_cpu_load_idle to record 
>> thoses not marked cpus' go-idle load. Calc_unmask_cpu_load_idle takes place 
>> of calc_load_tasks_idle to be added into calc_load_tasks every 5HZ
> 
> 5 seconds
> 
>>  when cpu 
>> calculate its load. Go-idle load on those cpus which load alread has been calculated 
>> will only be added into calc_load_tasks_idle, no in calc_unmask_cpu_load_idle.
> 
> I don't like the solution, keeping that mask is expensive. Also, I still
> don't understand what the problem is.
> 

It's a problem. And I plan to use per-cpu var to take place of cpumask.

> A few nits on the patch below..
> 
>> ---
>>  include/linux/sched.h     |    1 +
>>  kernel/sched/core.c       |   83 ++++++++++++++++++++++++++++++++++++++++++++-
>>  kernel/time/timekeeping.c |    1 +
>>  3 files changed, 84 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 6029d8c..a2b8df2 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -145,6 +145,7 @@ extern unsigned long this_cpu_load(void);
>>  
>>
>>  extern void calc_global_load(unsigned long ticks);
>> +extern void prepare_idle_mask(unsigned long ticks);
>>  extern void update_cpu_load_nohz(void);
>>  
>>  extern unsigned long get_parent_ip(unsigned long addr);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c46958e..bdfe3c2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2164,6 +2164,7 @@ unsigned long this_cpu_load(void)
>>  /* Variables and functions for calc_load */
>>  static atomic_long_t calc_load_tasks;
>>  static unsigned long calc_load_update;
>> +static unsigned long idle_mask_update;
>>  unsigned long avenrun[3];
>>  EXPORT_SYMBOL(avenrun);
>>  
>> @@ -2199,13 +2200,38 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
>>   */
>>  static atomic_long_t calc_load_tasks_idle;
>>  
>> +/* 
>> + * Those cpus whose load alread has been calculated in this LOAD_FREQ 
>> + * period will be masked.
>> + */
>> +struct cpumask  cpu_load_update_mask;
> 
> This should be a cpumask_var_t and allocated somewhere..
> 
>> +
>> +/* 
>> + * Fold unmask cpus' idle load
>> + */
>> +static atomic_long_t calc_unmask_cpu_load_idle;
>> +
>> +
>>  void calc_load_account_idle(struct rq *this_rq)
>>  {
>>  	long delta;
>> +	int cpu = smp_processor_id();
>> +
>>  
>>  	delta = calc_load_fold_active(this_rq);
>>  	if (delta)
>> +	{
> 
> wrong bracket placement
> 
>>  		atomic_long_add(delta, &calc_load_tasks_idle);
>> +		/*
>> +		 * calc_unmask_cpu_load_idle is only used between the first cpu load accounting
>> +		 * and the last cpu load accounting in every LOAD_FREQ period, and records idle load on
>> +		 * those unmask cpus.
>> +		 */
>> +		if (!cpumask_empty(&cpu_load_update_mask) && !cpumask_test_cpu(cpu, &cpu_load_update_mask))
>> +		{
>> +			atomic_long_add(delta, &calc_unmask_cpu_load_idle);
> 
> lines are too long, brackets are placed wrong and brackets are
> superfluous.
> 
> 
>> +		}
>> +       }
>>  }
>>  
> 
> and many more similar nits.
> 

Thanks for pointing these out!

>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index 6e46cac..afbc06a 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1222,6 +1222,7 @@ void do_timer(unsigned long ticks)
>>  	jiffies_64 += ticks;
>>  	update_wall_time();
>>  	calc_global_load(ticks);
>> +	prepare_idle_mask(ticks);
> 
> Why not add this to calc_global_load() ? 
> 

If some CPUs go into nohz and get up after calc_global_load,
then calc_load_account_active will be called after calc_global_load.
This will cause identifying of first calc_load_account_active cpu being
much more complicated.

>>  }
>>  
>>  /**
>

next prev parent reply	other threads:[~2012-06-12  9:34 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-09 10:54 [PATCH] sched: Folding nohz load accounting more accurate Charles Wang
2012-06-11 15:42 ` Peter Zijlstra
     [not found]   ` <4FD6BFC4.1060302@gmail.com>
2012-06-12  8:54     ` Peter Zijlstra
2012-06-12  9:34   ` Charles Wang [this message]
2012-06-12  9:56     ` Peter Zijlstra
2012-06-13  5:55       ` Doug Smythies
2012-06-13  7:56         ` Charles Wang
2012-06-14  4:41           ` Doug Smythies
2012-06-14 15:42             ` Charles Wang
2012-06-16  6:42               ` Doug Smythies
2012-06-13  8:16         ` Peter Zijlstra
2012-06-13 15:33           ` Doug Smythies
2012-06-13 21:57             ` Peter Zijlstra
2012-06-14  3:13               ` Doug Smythies
2012-06-18 10:13                 ` Peter Zijlstra
2012-07-20 19:24         ` sched: care and feeding of load-avg code (Re: [PATCH] sched: Folding nohz load accounting more accurate) Jonathan Nieder
2012-06-15 14:27       ` [PATCH] sched: Folding nohz load accounting more accurate Charles Wang
2012-06-15 17:39         ` Peter Zijlstra
2012-06-16 14:53           ` Doug Smythies
2012-06-18  6:41             ` Doug Smythies
2012-06-18 14:41               ` Charles Wang
2012-06-18 10:06           ` Charles Wang
2012-06-18 16:03         ` Peter Zijlstra
2012-06-19  6:08           ` Yong Zhang
2012-06-19  9:18             ` Peter Zijlstra
2012-06-19 15:50               ` Doug Smythies
2012-06-20  9:45                 ` Peter Zijlstra
2012-06-21  4:12                   ` Doug Smythies
2012-06-21  6:35                     ` Charles Wang
2012-06-21  8:48                     ` Peter Zijlstra
2012-06-22 14:03                     ` Peter Zijlstra
2012-06-24 21:45                       ` Doug Smythies
2012-07-03 16:01                         ` Doug Smythies
2012-06-25  2:15                       ` Charles Wang
2012-07-06  6:19                       ` [tip:sched/core] sched/nohz: Rewrite and fix load-avg computation -- again tip-bot for Peter Zijlstra
2012-06-19  6:19           ` [PATCH] sched: Folding nohz load accounting more accurate Doug Smythies
2012-06-19  6:24           ` Charles Wang
2012-06-19  9:57             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FD70D12.5030404@gmail.com \
    --to=muming.wq@gmail.com \
    --cc=handai.szj@taobao.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=muming.wq@taobao.com \
    --cc=peterz@infradead.org \
    --cc=tm@tao.ma \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.