Re: [PATCH v3 04/10] sched/fair: rework load_balance

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Phil Auld <pauld@redhat.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Quentin Perret <quentin.perret@arm.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	Hillf Danton <hdanton@sina.com>
Subject: Re: [PATCH v3 04/10] sched/fair: rework load_balance
Date: Tue, 1 Oct 2019 18:52:58 +0200	[thread overview]
Message-ID: <3dca46c5-c395-e2b3-a7e8-e9208ba741c8@arm.com> (raw)
In-Reply-To: <CAKfTPtDUFMFnD+RZndx0+8A+V9HV9Hv0TN+p=mAge0VsqS6xmA@mail.gmail.com>

On 01/10/2019 10:14, Vincent Guittot wrote:
> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> Hi Vincent,
>>
>> On 19/09/2019 09:33, Vincent Guittot wrote:

[...]

>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>>>   {
>>>         struct list_head *tasks = &env->src_rq->cfs_tasks;
>>>         struct task_struct *p;
>>> -     unsigned long load;
>>> +     unsigned long util, load;
>>
>> Minor: Order by length or reduce scope to while loop ?
> 
> I don't get your point here

Nothing dramatic here! Just

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..a08f342ead89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;
 static int detach_tasks(struct lb_env *env)
 {
        struct list_head *tasks = &env->src_rq->cfs_tasks;
-       struct task_struct *p;
        unsigned long load, util;
+       struct task_struct *p;
        int detached = 0;

        lockdep_assert_held(&env->src_rq->lock);

or

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..4d1864d43ed7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7334,7 +7334,6 @@ static int detach_tasks(struct lb_env *env)
 {
        struct list_head *tasks = &env->src_rq->cfs_tasks;
        struct task_struct *p;
-       unsigned long load, util;
        int detached = 0;

        lockdep_assert_held(&env->src_rq->lock);
@@ -7343,6 +7342,8 @@ static int detach_tasks(struct lb_env *env)
                return 0;

        while (!list_empty(tasks)) {
+               unsigned long load, util;
+
                /*

[...]

>>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>                 }
>>>         }
>>>
>>> -     /* Adjust by relative CPU capacity of the group */
>>> +     /* Check if dst cpu is idle and preferred to this group */
>>
>> s/preferred to/preferred by ? or the preferred CPU of this group ?
> 
> dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
> this group vs dst_cpu which belongs to another group

Ah, in the sense of 'preferred over'. Got it now!

[...]

>>> +     if (busiest->group_type == group_imbalanced) {
>>> +             /*
>>> +              * In the group_imb case we cannot rely on group-wide averages
>>> +              * to ensure CPU-load equilibrium, try to move any task to fix
>>> +              * the imbalance. The next load balance will take care of
>>> +              * balancing back the system.
>>
>> balancing back ?
> 
> In case of imbalance, we don't try to balance the system but only try
> to get rid of the pinned tasks problem. The system will still be
> unbalanced after the migration and the next load balance will take
> care of balancing the system

OK.

[...]

>>>         /*
>>> -      * Avg load of busiest sg can be less and avg load of local sg can
>>> -      * be greater than avg load across all sgs of sd because avg load
>>> -      * factors in sg capacity and sgs with smaller group_type are
>>> -      * skipped when updating the busiest sg:
>>> +      * Try to use spare capacity of local group without overloading it or
>>> +      * emptying busiest
>>>          */
>>> -     if (busiest->group_type != group_misfit_task &&
>>> -         (busiest->avg_load <= sds->avg_load ||
>>> -          local->avg_load >= sds->avg_load)) {
>>> -             env->imbalance = 0;
>>> +     if (local->group_type == group_has_spare) {
>>> +             if (busiest->group_type > group_fully_busy) {
>>
>> So this could be 'busiest->group_type == group_overloaded' here to match
>> the comment below? Since you handle group_misfit_task,
>> group_asym_packing, group_imbalanced above and return.
> 
> This is just to be more robust in case some new states are added later

OK, although I doubt that additional states can be added easily w/o
carefully auditing the entire lb code ;-)

[...]

>>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
>>> +                     /*
>>> +                      * When prefer sibling, evenly spread running tasks on
>>> +                      * groups.
>>> +                      */
>>> +                     env->balance_type = migrate_task;
>>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>>> +                     return;
>>> +             }
>>> +
>>> +             /*
>>> +              * If there is no overload, we just want to even the number of
>>> +              * idle cpus.
>>> +              */
>>> +             env->balance_type = migrate_task;
>>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>>
>> Why do we need a max_t(long, 0, ...) here and not for the 'if
>> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
> 
> For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> 
> either we have sds->prefer_sibling && busiest->sum_nr_running >
> local->sum_nr_running + 1

I see, this corresponds to

/* Try to move all excess tasks to child's sibling domain */
       if (sds.prefer_sibling && local->group_type == group_has_spare &&
           busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
               goto force_balance;

in find_busiest_group, I assume.

Haven't been able to recreate this yet on my arm64 platform since there
is no prefer_sibling and in case local and busiest have
group_type=group_has_spare they bailout in

         if (busiest->group_type != group_overloaded &&
              (env->idle == CPU_NOT_IDLE ||
               local->idle_cpus <= (busiest->idle_cpus + 1)))
                 goto out_balanced;


[...]

>>> -     if (busiest->group_type == group_overloaded &&
>>> -         local->group_type   == group_overloaded) {
>>> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
>>> -             if (load_above_capacity > busiest->group_capacity) {
>>> -                     load_above_capacity -= busiest->group_capacity;
>>> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);
>>> -                     load_above_capacity /= busiest->group_capacity;
>>> -             } else
>>> -                     load_above_capacity = ~0UL;
>>> +     if (local->group_type < group_overloaded) {
>>> +             /*
>>> +              * Local will become overloaded so the avg_load metrics are
>>> +              * finally needed.
>>> +              */
>>
>> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
>> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
>> It would be nice to be able to match all allowed fields of dm to code sections.
> 
> decision_matrix describes how it decides between balanced or unbalanced.
> In case of dm[overload, overload], we use the avg_load to decide if it
> is balanced or not

OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
for 'sgs->group_type == group_overloaded'.

> In case of dm[fully_busy, overload], the groups are unbalanced because
> fully_busy < overload and we force the balance. Then
> calculate_imbalance() uses the avg_load to decide how much will be
> moved

And in this case 'local->group_type < group_overloaded' in
calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
calculated before using them in env->imbalance = min(...).

OK, got it now.

> dm[overload, overload]=force means that we force the balance and we
> will compute later the imbalance. avg_load may be used to calculate
> the imbalance
> dm[overload, overload]=avg_load means that we compare the avg_load to
> decide whether we need to balance load between groups
> dm[overload, overload]=nr_idle means that we compare the number of
> idle cpus to decide whether we need to balance.  In fact this is no
> more true with patch 7 because we also take into account the number of
> nr_h_running when weight =1

This becomes clearer now ... slowly.

[...]

next prev parent reply	other threads:[~2019-10-01 16:53 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 01/10] sched/fair: clean up asym packing Vincent Guittot
2019-09-27 23:57   ` Rik van Riel
2019-09-19  7:33 ` [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
2019-09-27 23:59   ` Rik van Riel
2019-10-01 17:11   ` Valentin Schneider
2019-09-19  7:33 ` [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation Vincent Guittot
2019-09-28  0:05   ` Rik van Riel
2019-10-01 17:12   ` Valentin Schneider
2019-10-02  6:28     ` Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
2019-09-30  1:12   ` Rik van Riel
2019-09-30  7:44     ` Vincent Guittot
2019-09-30 16:24   ` Dietmar Eggemann
2019-10-01  8:14     ` Vincent Guittot
2019-10-01 16:52       ` Dietmar Eggemann [this message]
2019-10-02  6:44         ` Vincent Guittot
2019-10-02  9:21           ` Dietmar Eggemann
2019-10-08 13:02             ` Peter Zijlstra
2019-10-02  8:23         ` Vincent Guittot
2019-10-02  9:24           ` Dietmar Eggemann
2019-10-01  8:15   ` Dietmar Eggemann
2019-10-01  9:14     ` Vincent Guittot
2019-10-01 16:57       ` Dietmar Eggemann
2019-10-01 17:47   ` Valentin Schneider
2019-10-02  8:30     ` Vincent Guittot
2019-10-02 10:47       ` Valentin Schneider
2019-10-08 14:16         ` Peter Zijlstra
2019-10-08 14:34           ` Valentin Schneider
2019-10-08 15:30             ` Vincent Guittot
2019-10-08 15:48               ` Valentin Schneider
2019-10-08 17:39               ` Peter Zijlstra
2019-10-08 18:45                 ` Vincent Guittot
2019-10-08 16:33             ` Peter Zijlstra
2019-10-08 16:39               ` Valentin Schneider
2019-10-08 17:36                 ` Valentin Schneider
2019-10-08 17:55   ` Peter Zijlstra
2019-10-08 18:47     ` Vincent Guittot
2019-10-16  7:21   ` Parth Shah
2019-10-16 11:56     ` Vincent Guittot
2019-10-18  5:34       ` Parth Shah
2019-09-19  7:33 ` [PATCH v3 05/10] sched/fair: use rq->nr_running when balancing load Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 06/10] sched/fair: use load instead of runnable load in load_balance Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 07/10] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 08/10] sched/fair: use utilization to select misfit task Vincent Guittot
2019-10-01 17:12   ` Valentin Schneider
2019-09-19  7:33 ` [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
2019-10-07 15:14   ` Rik van Riel
2019-10-07 15:27     ` Vincent Guittot
2019-10-07 18:06       ` Rik van Riel
2019-09-19  7:33 ` [PATCH v3 10/10] sched/fair: optimize find_idlest_group Vincent Guittot
2019-10-08 14:32 ` [PATCH v3 0/8] sched/fair: rework the CFS load balance Phil Auld
2019-10-08 15:53   ` Vincent Guittot
2019-10-09 19:33     ` Phil Auld
2019-10-10  8:20       ` Vincent Guittot
2019-10-16  7:21 ` Parth Shah
2019-10-16 11:51   ` Vincent Guittot

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:d0c3aa1dc29 dfblob:a08f342ead8 dfblob:d0c3aa1dc29
dfblob:4d1864d43ed )
 OR (
bs:"Re: [PATCH v3 04/10] sched/fair: rework load_balance" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3dca46c5-c395-e2b3-a7e8-e9208ba741c8@arm.com \
    --to=dietmar.eggemann@arm.com \
    --cc=Morten.Rasmussen@arm.com \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox