[QUERY] Confusing usage of rq->nr_running in load balancing

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [QUERY] Confusing usage of rq->nr_running in load balancing
@ 2014-09-03 12:21 Preeti U Murthy
  2014-09-03 15:30 ` Peter Zijlstra
  2014-09-03 16:58 ` Vincent Guittot
  0 siblings, 2 replies; 7+ messages in thread
From: Preeti U Murthy @ 2014-09-03 12:21 UTC (permalink / raw)
  To: Vincent Guittot, peterz@infradead.org, Ingo Molnar, Rik van Riel,
	Morten Rasmussen
  Cc: LKML, Mike Galbraith, Nicolas Pitre, daniel.lezcano@linaro.org,
	Dietmar Eggemann, Kamalesh Babulal, Srikar Dronamraju

Hi,

There are places in kernel/sched/fair.c in the load balancing part where
rq->nr_running is used as against cfs_rq->nr_running. At least I could
not make out why the former was used in the following scenarios.
It looks to me that it can very well lead to incorrect load balancing.
Also I did not pay attention to the numa balancing part of the code
while skimming through this file to catch this scenario. There are a
couple of places there too which need to be scrutinized.

1. load_balance(): The check (busiest->nr_running > 1)
The load balancing would be futile if there are tasks of other
scheduling classes, wouldn't it?

2. active_load_balance_cpu_stop(): A similar check and a similar
consequence as 1 here.

3. nohz_kick_needed() : We check for more than one task on the runqueue
and hence trigger load balancing even if there are rt-tasks.

4. cpu_avg_load_per_task(): This stands out among the rest as an
incorrect usage of rq->nr_running in place of cfs_rq->nr_running. We
divide the load associated with the cfs_rq by the number of tasks on the
rq. This will make the cfs_rq load look smaller.

5. task_hot() : I am not too sure about the consequences of using
rq->nr_running here.

6. update_sg_lb_stats(): sgs->sum_nr_running is the sum of
rq->nr_running and propogates thus throughout the load balancing code path.

7. sg_capacity_factor(): Returns the capacity factor measured against
the cpu capacity available to fair tasks. We then compare this with the
rq->nr_running in update_sg_lb_stats(), update_sd_pick_busiest() and
calculate_imbalance()

8. find_busiest_queue(): This anomaly shows up when we filter against
rq->nr_running == 1 and imbalance cannot be taken care of by the
existing task on this rq.

Did I miss something or is it true that the usage of rq->nr_running in
the above places is incorrect?

Thanks

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [QUERY] Confusing usage of rq->nr_running in load balancing
  2014-09-03 12:21 [QUERY] Confusing usage of rq->nr_running in load balancing Preeti U Murthy
@ 2014-09-03 15:30 ` Peter Zijlstra
  2014-09-03 16:58 ` Vincent Guittot
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2014-09-03 15:30 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Vincent Guittot, Ingo Molnar, Rik van Riel, Morten Rasmussen,
	LKML, Mike Galbraith, Nicolas Pitre, daniel.lezcano@linaro.org,
	Dietmar Eggemann, Kamalesh Babulal, Srikar Dronamraju

On Wed, Sep 03, 2014 at 05:51:45PM +0530, Preeti U Murthy wrote:
> Hi,
> 
> There are places in kernel/sched/fair.c in the load balancing part where
> rq->nr_running is used as against cfs_rq->nr_running. At least I could

> Did I miss something or is it true that the usage of rq->nr_running in
> the above places is incorrect?

Yeah, smells fishy. Probably hysterical accidents that haven't been
cleaned up yet. I do remember a few patches cleaning some of this up
recently. Clearly there's more to do.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [QUERY] Confusing usage of rq->nr_running in load balancing
  2014-09-03 12:21 [QUERY] Confusing usage of rq->nr_running in load balancing Preeti U Murthy
  2014-09-03 15:30 ` Peter Zijlstra
@ 2014-09-03 16:58 ` Vincent Guittot
  2014-09-05 12:19   ` Preeti U Murthy
  2014-09-15  4:16   ` Preeti U Murthy
  1 sibling, 2 replies; 7+ messages in thread
From: Vincent Guittot @ 2014-09-03 16:58 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: peterz@infradead.org, Ingo Molnar, Rik van Riel, Morten Rasmussen,
	LKML, Mike Galbraith, Nicolas Pitre, daniel.lezcano@linaro.org,
	Dietmar Eggemann, Kamalesh Babulal, Srikar Dronamraju

On 3 September 2014 14:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi,

Hi Preeti,

>
> There are places in kernel/sched/fair.c in the load balancing part where
> rq->nr_running is used as against cfs_rq->nr_running. At least I could
> not make out why the former was used in the following scenarios.
> It looks to me that it can very well lead to incorrect load balancing.
> Also I did not pay attention to the numa balancing part of the code
> while skimming through this file to catch this scenario. There are a
> couple of places there too which need to be scrutinized.
>
> 1. load_balance(): The check (busiest->nr_running > 1)
> The load balancing would be futile if there are tasks of other
> scheduling classes, wouldn't it?

agree with you

>
> 2. active_load_balance_cpu_stop(): A similar check and a similar
> consequence as 1 here.

agree with you

>
> 3. nohz_kick_needed() : We check for more than one task on the runqueue
> and hence trigger load balancing even if there are rt-tasks.

I can see one potentiel reason why rq->nr_running is interesting that
is the group capacity might have changed because of non cfs tasks
since last load balance. So we need to monitor the change of the
groups' capacity to ensure that the average load of each group is
still in the same level

>
> 4. cpu_avg_load_per_task(): This stands out among the rest as an
> incorrect usage of rq->nr_running in place of cfs_rq->nr_running. We
> divide the load associated with the cfs_rq by the number of tasks on the
> rq. This will make the cfs_rq load look smaller.

This one is solved in the consolidation of cpu_capacity patchset

>
> 5. task_hot() : I am not too sure about the consequences of using
> rq->nr_running here.

IIUC, the goal is to check if other tasks are running on the dst_rq so
rq->nr_running is the good one

>
> 6. update_sg_lb_stats(): sgs->sum_nr_running is the sum of
> rq->nr_running and propogates thus throughout the load balancing code path.

This one is solved in the consolidation of cpu_capacity patchset
>
> 7. sg_capacity_factor(): Returns the capacity factor measured against
> the cpu capacity available to fair tasks. We then compare this with the
> rq->nr_running in update_sg_lb_stats(), update_sd_pick_busiest() and
> calculate_imbalance()

This one is removed with the consolidation of cpu_capacity patchset

>
> 8. find_busiest_queue(): This anomaly shows up when we filter against
> rq->nr_running == 1 and imbalance cannot be taken care of by the
> existing task on this rq.

agree with you even if the test with wl should prevent wrong decision
as a wl will be null if no cfs task are present

Regards
Vincent
>
> Did I miss something or is it true that the usage of rq->nr_running in
> the above places is incorrect?
>
> Thanks
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [QUERY] Confusing usage of rq->nr_running in load balancing
  2014-09-03 16:58 ` Vincent Guittot
@ 2014-09-05 12:19   ` Preeti U Murthy
  2014-09-05 12:27     ` Vincent Guittot
  2014-09-15  4:16   ` Preeti U Murthy
  1 sibling, 1 reply; 7+ messages in thread
From: Preeti U Murthy @ 2014-09-05 12:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz@infradead.org, Ingo Molnar, Rik van Riel, Morten Rasmussen,
	LKML, Mike Galbraith, Nicolas Pitre, daniel.lezcano@linaro.org,
	Dietmar Eggemann, Kamalesh Babulal, Srikar Dronamraju

Hi Vincent,

On 09/03/2014 10:28 PM, Vincent Guittot wrote:
> On 3 September 2014 14:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi,
> 
> Hi Preeti,
> 
>>
>> There are places in kernel/sched/fair.c in the load balancing part where
>> rq->nr_running is used as against cfs_rq->nr_running. At least I could
>> not make out why the former was used in the following scenarios.
>> It looks to me that it can very well lead to incorrect load balancing.
>> Also I did not pay attention to the numa balancing part of the code
>> while skimming through this file to catch this scenario. There are a
>> couple of places there too which need to be scrutinized.
>>
>> 1. load_balance(): The check (busiest->nr_running > 1)
>> The load balancing would be futile if there are tasks of other
>> scheduling classes, wouldn't it?
> 
> agree with you
> 
>>
>> 2. active_load_balance_cpu_stop(): A similar check and a similar
>> consequence as 1 here.
> 
> agree with you
> 
>>
>> 3. nohz_kick_needed() : We check for more than one task on the runqueue
>> and hence trigger load balancing even if there are rt-tasks.
> 
> I can see one potentiel reason why rq->nr_running is interesting that
> is the group capacity might have changed because of non cfs tasks
> since last load balance. So we need to monitor the change of the
> groups' capacity to ensure that the average load of each group is
> still in the same level
> 
>>
>> 4. cpu_avg_load_per_task(): This stands out among the rest as an
>> incorrect usage of rq->nr_running in place of cfs_rq->nr_running. We
>> divide the load associated with the cfs_rq by the number of tasks on the
>> rq. This will make the cfs_rq load look smaller.
> 
> This one is solved in the consolidation of cpu_capacity patchset

Sorry, but I don't see where in your patchset you have addressed this
issue. Can you please point out the patch?

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [QUERY] Confusing usage of rq->nr_running in load balancing
  2014-09-05 12:19   ` Preeti U Murthy
@ 2014-09-05 12:27     ` Vincent Guittot
  2014-09-10  8:21       ` Preeti U Murthy
  0 siblings, 1 reply; 7+ messages in thread
From: Vincent Guittot @ 2014-09-05 12:27 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: peterz@infradead.org, Ingo Molnar, Rik van Riel, Morten Rasmussen,
	LKML, Mike Galbraith, Nicolas Pitre, daniel.lezcano@linaro.org,
	Dietmar Eggemann, Kamalesh Babulal, Srikar Dronamraju

On 5 September 2014 14:19, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 09/03/2014 10:28 PM, Vincent Guittot wrote:
>> On 3 September 2014 14:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> Hi,
>>
>> Hi Preeti,
>>
>>>
>>> There are places in kernel/sched/fair.c in the load balancing part where
>>> rq->nr_running is used as against cfs_rq->nr_running. At least I could
>>> not make out why the former was used in the following scenarios.
>>> It looks to me that it can very well lead to incorrect load balancing.
>>> Also I did not pay attention to the numa balancing part of the code
>>> while skimming through this file to catch this scenario. There are a
>>> couple of places there too which need to be scrutinized.
>>>
>>> 1. load_balance(): The check (busiest->nr_running > 1)
>>> The load balancing would be futile if there are tasks of other
>>> scheduling classes, wouldn't it?
>>
>> agree with you
>>
>>>
>>> 2. active_load_balance_cpu_stop(): A similar check and a similar
>>> consequence as 1 here.
>>
>> agree with you
>>
>>>
>>> 3. nohz_kick_needed() : We check for more than one task on the runqueue
>>> and hence trigger load balancing even if there are rt-tasks.
>>
>> I can see one potentiel reason why rq->nr_running is interesting that
>> is the group capacity might have changed because of non cfs tasks
>> since last load balance. So we need to monitor the change of the
>> groups' capacity to ensure that the average load of each group is
>> still in the same level
>>
>>>
>>> 4. cpu_avg_load_per_task(): This stands out among the rest as an
>>> incorrect usage of rq->nr_running in place of cfs_rq->nr_running. We
>>> divide the load associated with the cfs_rq by the number of tasks on the
>>> rq. This will make the cfs_rq load look smaller.
>>
>> This one is solved in the consolidation of cpu_capacity patchset
>
> Sorry, but I don't see where in your patchset you have addressed this
> issue. Can you please point out the patch?

In [PATCH v5 03/12] sched: fix avg_load computation:

static unsigned long cpu_avg_load_per_task(int cpu)
 {
        struct rq *rq = cpu_rq(cpu);
-       unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+       unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
        unsigned long load_avg = rq->cfs.runnable_load_avg;

Are you referring to another problem than the one above ?

Regards
Vincent

>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [QUERY] Confusing usage of rq->nr_running in load balancing
  2014-09-05 12:27     ` Vincent Guittot
@ 2014-09-10  8:21       ` Preeti U Murthy
  0 siblings, 0 replies; 7+ messages in thread
From: Preeti U Murthy @ 2014-09-10  8:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz@infradead.org, Ingo Molnar, Rik van Riel, Morten Rasmussen,
	LKML, Mike Galbraith, Nicolas Pitre, daniel.lezcano@linaro.org,
	Dietmar Eggemann, Kamalesh Babulal, Srikar Dronamraju

On 09/05/2014 05:57 PM, Vincent Guittot wrote:
> On 5 September 2014 14:19, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> On 09/03/2014 10:28 PM, Vincent Guittot wrote:
>>> On 3 September 2014 14:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>>> Hi,
>>>
>>> Hi Preeti,
>>>
>>>>
>>>> There are places in kernel/sched/fair.c in the load balancing part where
>>>> rq->nr_running is used as against cfs_rq->nr_running. At least I could
>>>> not make out why the former was used in the following scenarios.
>>>> It looks to me that it can very well lead to incorrect load balancing.
>>>> Also I did not pay attention to the numa balancing part of the code
>>>> while skimming through this file to catch this scenario. There are a
>>>> couple of places there too which need to be scrutinized.
>>>>
>>>> 1. load_balance(): The check (busiest->nr_running > 1)
>>>> The load balancing would be futile if there are tasks of other
>>>> scheduling classes, wouldn't it?
>>>
>>> agree with you
>>>
>>>>
>>>> 2. active_load_balance_cpu_stop(): A similar check and a similar
>>>> consequence as 1 here.
>>>
>>> agree with you
>>>
>>>>
>>>> 3. nohz_kick_needed() : We check for more than one task on the runqueue
>>>> and hence trigger load balancing even if there are rt-tasks.
>>>
>>> I can see one potentiel reason why rq->nr_running is interesting that
>>> is the group capacity might have changed because of non cfs tasks
>>> since last load balance. So we need to monitor the change of the
>>> groups' capacity to ensure that the average load of each group is
>>> still in the same level
>>>
>>>>
>>>> 4. cpu_avg_load_per_task(): This stands out among the rest as an
>>>> incorrect usage of rq->nr_running in place of cfs_rq->nr_running. We
>>>> divide the load associated with the cfs_rq by the number of tasks on the
>>>> rq. This will make the cfs_rq load look smaller.
>>>
>>> This one is solved in the consolidation of cpu_capacity patchset
>>
>> Sorry, but I don't see where in your patchset you have addressed this
>> issue. Can you please point out the patch?
> 
> In [PATCH v5 03/12] sched: fix avg_load computation:
> 
> static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
> -       unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> +       unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
>         unsigned long load_avg = rq->cfs.runnable_load_avg;
> 
> Are you referring to another problem than the one above ?

No Vincent, this is one. Thanks for pointing it out.

Regards
Preeti U Murthy
> 
> Regards
> Vincent
> 
>>
>> Regards
>> Preeti U Murthy
>>
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [QUERY] Confusing usage of rq->nr_running in load balancing
  2014-09-03 16:58 ` Vincent Guittot
  2014-09-05 12:19   ` Preeti U Murthy
@ 2014-09-15  4:16   ` Preeti U Murthy
  1 sibling, 0 replies; 7+ messages in thread
From: Preeti U Murthy @ 2014-09-15  4:16 UTC (permalink / raw)
  To: Vincent Guittot, peterz@infradead.org
  Cc: Ingo Molnar, Rik van Riel, Morten Rasmussen, LKML, Mike Galbraith,
	Nicolas Pitre, daniel.lezcano@linaro.org, Dietmar Eggemann,
	Kamalesh Babulal, Srikar Dronamraju

Hi Peter, Vincent,

On 09/03/2014 10:28 PM, Vincent Guittot wrote:
> On 3 September 2014 14:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi,
> 
> Hi Preeti,
> 
>>
>> There are places in kernel/sched/fair.c in the load balancing part where
>> rq->nr_running is used as against cfs_rq->nr_running. At least I could
>> not make out why the former was used in the following scenarios.
>> It looks to me that it can very well lead to incorrect load balancing.
>> Also I did not pay attention to the numa balancing part of the code
>> while skimming through this file to catch this scenario. There are a
>> couple of places there too which need to be scrutinized.
>>
>> 1. load_balance(): The check (busiest->nr_running > 1)
>> The load balancing would be futile if there are tasks of other
>> scheduling classes, wouldn't it?
> 
> agree with you
> 
>>
>> 2. active_load_balance_cpu_stop(): A similar check and a similar
>> consequence as 1 here.
> 
> agree with you
> 
>>
>> 3. nohz_kick_needed() : We check for more than one task on the runqueue
>> and hence trigger load balancing even if there are rt-tasks.
> 
> I can see one potentiel reason why rq->nr_running is interesting that
> is the group capacity might have changed because of non cfs tasks
> since last load balance. So we need to monitor the change of the
> groups' capacity to ensure that the average load of each group is
> still in the same level

I tried a patch which changes nr_running to cfs.h_nr_running in the
above three scenarios and found that the performance of the workload
*drops significantly*. The workload that I ran was ebizzy with a few
threads running at rt priority and few running at normal priority and
running in parallel. This was tried on a 16 core SMT-8 machine. The drop
in the performance was around 18% with the patch across different number
of threads.

I figured that it was because if we consider only cfs.h_nr_running in
the above cases, we reduce load balancing attempts even when the
capacity of the cpus to run fair tasks is significantly reduced. For
example if the cpu is running two rt tasks and one fair task, we skip
load balancing altogether with the patch. Besides this, we may end up
doing active load balancing too often.

So I think we are good with nr_running although that may mean
unnecessary load balancing attempts when only rt tasks are running on
the cpus. But evaluating on nr_running in the above three scenarios
when there is a mix of rt and fair tasks is better so as to see if the
cpus have enough capacity to handle the one fair task that they can
possibly be running (If there are more fair tasks, we load balance anyway).

As for the usage of nr_running in find_busiest_queue(), we are good
there as Vincent pointed out as below.

"
>
> 8. find_busiest_queue(): This anomaly shows up when we filter against
> rq->nr_running == 1 and imbalance cannot be taken care of by the
> existing task on this rq.

agree with you even if the test with wl should prevent wrong decision
as a wl will be null if no cfs task are present

"

So the only changes we require around this is the change of nr_running
to cfs.h_nr_running in update_sg_lb_stats() and cpu_avg_load_per_task()
which is being done by Vincent already in the consolidation of
cpu_capacity patches and I did not see regressions there during my tests.

Thanks

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-09-15  4:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-03 12:21 [QUERY] Confusing usage of rq->nr_running in load balancing Preeti U Murthy
2014-09-03 15:30 ` Peter Zijlstra
2014-09-03 16:58 ` Vincent Guittot
2014-09-05 12:19   ` Preeti U Murthy
2014-09-05 12:27     ` Vincent Guittot
2014-09-10  8:21       ` Preeti U Murthy
2014-09-15  4:16   ` Preeti U Murthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).