Re: [PATCH v3 04/10] sched/fair: rework load_balance

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Phil Auld <pauld@redhat.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Quentin Perret <quentin.perret@arm.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	Hillf Danton <hdanton@sina.com>
Subject: Re: [PATCH v3 04/10] sched/fair: rework load_balance
Date: Tue, 1 Oct 2019 18:52:58 +0200	[thread overview]
Message-ID: <3dca46c5-c395-e2b3-a7e8-e9208ba741c8@arm.com> (raw)
In-Reply-To: <CAKfTPtDUFMFnD+RZndx0+8A+V9HV9Hv0TN+p=mAge0VsqS6xmA@mail.gmail.com>

On 01/10/2019 10:14, Vincent Guittot wrote:
> On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> Hi Vincent,
>>
>> On 19/09/2019 09:33, Vincent Guittot wrote:

[...]

>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>>>   {
>>>         struct list_head *tasks = &env->src_rq->cfs_tasks;
>>>         struct task_struct *p;
>>> -     unsigned long load;
>>> +     unsigned long util, load;
>>
>> Minor: Order by length or reduce scope to while loop ?
> 
> I don't get your point here

Nothing dramatic here! Just

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..a08f342ead89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7333,8 +7333,8 @@ static const unsigned int sched_nr_migrate_break = 32;
 static int detach_tasks(struct lb_env *env)
 {
        struct list_head *tasks = &env->src_rq->cfs_tasks;
-       struct task_struct *p;
        unsigned long load, util;
+       struct task_struct *p;
        int detached = 0;

        lockdep_assert_held(&env->src_rq->lock);

or

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c3aa1dc290..4d1864d43ed7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7334,7 +7334,6 @@ static int detach_tasks(struct lb_env *env)
 {
        struct list_head *tasks = &env->src_rq->cfs_tasks;
        struct task_struct *p;
-       unsigned long load, util;
        int detached = 0;

        lockdep_assert_held(&env->src_rq->lock);
@@ -7343,6 +7342,8 @@ static int detach_tasks(struct lb_env *env)
                return 0;

        while (!list_empty(tasks)) {
+               unsigned long load, util;
+
                /*

[...]

>>> @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>                 }
>>>         }
>>>
>>> -     /* Adjust by relative CPU capacity of the group */
>>> +     /* Check if dst cpu is idle and preferred to this group */
>>
>> s/preferred to/preferred by ? or the preferred CPU of this group ?
> 
> dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
> this group vs dst_cpu which belongs to another group

Ah, in the sense of 'preferred over'. Got it now!

[...]

>>> +     if (busiest->group_type == group_imbalanced) {
>>> +             /*
>>> +              * In the group_imb case we cannot rely on group-wide averages
>>> +              * to ensure CPU-load equilibrium, try to move any task to fix
>>> +              * the imbalance. The next load balance will take care of
>>> +              * balancing back the system.
>>
>> balancing back ?
> 
> In case of imbalance, we don't try to balance the system but only try
> to get rid of the pinned tasks problem. The system will still be
> unbalanced after the migration and the next load balance will take
> care of balancing the system

OK.

[...]

>>>         /*
>>> -      * Avg load of busiest sg can be less and avg load of local sg can
>>> -      * be greater than avg load across all sgs of sd because avg load
>>> -      * factors in sg capacity and sgs with smaller group_type are
>>> -      * skipped when updating the busiest sg:
>>> +      * Try to use spare capacity of local group without overloading it or
>>> +      * emptying busiest
>>>          */
>>> -     if (busiest->group_type != group_misfit_task &&
>>> -         (busiest->avg_load <= sds->avg_load ||
>>> -          local->avg_load >= sds->avg_load)) {
>>> -             env->imbalance = 0;
>>> +     if (local->group_type == group_has_spare) {
>>> +             if (busiest->group_type > group_fully_busy) {
>>
>> So this could be 'busiest->group_type == group_overloaded' here to match
>> the comment below? Since you handle group_misfit_task,
>> group_asym_packing, group_imbalanced above and return.
> 
> This is just to be more robust in case some new states are added later

OK, although I doubt that additional states can be added easily w/o
carefully auditing the entire lb code ;-)

[...]

>>> +             if (busiest->group_weight == 1 || sds->prefer_sibling) {
>>> +                     /*
>>> +                      * When prefer sibling, evenly spread running tasks on
>>> +                      * groups.
>>> +                      */
>>> +                     env->balance_type = migrate_task;
>>> +                     env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>>> +                     return;
>>> +             }
>>> +
>>> +             /*
>>> +              * If there is no overload, we just want to even the number of
>>> +              * idle cpus.
>>> +              */
>>> +             env->balance_type = migrate_task;
>>> +             env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>>
>> Why do we need a max_t(long, 0, ...) here and not for the 'if
>> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
> 
> For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> 
> either we have sds->prefer_sibling && busiest->sum_nr_running >
> local->sum_nr_running + 1

I see, this corresponds to

/* Try to move all excess tasks to child's sibling domain */
       if (sds.prefer_sibling && local->group_type == group_has_spare &&
           busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
               goto force_balance;

in find_busiest_group, I assume.

Haven't been able to recreate this yet on my arm64 platform since there
is no prefer_sibling and in case local and busiest have
group_type=group_has_spare they bailout in

         if (busiest->group_type != group_overloaded &&
              (env->idle == CPU_NOT_IDLE ||
               local->idle_cpus <= (busiest->idle_cpus + 1)))
                 goto out_balanced;


[...]

>>> -     if (busiest->group_type == group_overloaded &&
>>> -         local->group_type   == group_overloaded) {
>>> -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
>>> -             if (load_above_capacity > busiest->group_capacity) {
>>> -                     load_above_capacity -= busiest->group_capacity;
>>> -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);
>>> -                     load_above_capacity /= busiest->group_capacity;
>>> -             } else
>>> -                     load_above_capacity = ~0UL;
>>> +     if (local->group_type < group_overloaded) {
>>> +             /*
>>> +              * Local will become overloaded so the avg_load metrics are
>>> +              * finally needed.
>>> +              */
>>
>> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
>> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
>> It would be nice to be able to match all allowed fields of dm to code sections.
> 
> decision_matrix describes how it decides between balanced or unbalanced.
> In case of dm[overload, overload], we use the avg_load to decide if it
> is balanced or not

OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
for 'sgs->group_type == group_overloaded'.

> In case of dm[fully_busy, overload], the groups are unbalanced because
> fully_busy < overload and we force the balance. Then
> calculate_imbalance() uses the avg_load to decide how much will be
> moved

And in this case 'local->group_type < group_overloaded' in
calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
calculated before using them in env->imbalance = min(...).

OK, got it now.

> dm[overload, overload]=force means that we force the balance and we
> will compute later the imbalance. avg_load may be used to calculate
> the imbalance
> dm[overload, overload]=avg_load means that we compare the avg_load to
> decide whether we need to balance load between groups
> dm[overload, overload]=nr_idle means that we compare the number of
> idle cpus to decide whether we need to balance.  In fact this is no
> more true with patch 7 because we also take into account the number of
> nr_h_running when weight =1

This becomes clearer now ... slowly.

[...]

next prev parent reply	other threads:[~2019-10-01 16:53 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-19  7:33 [PATCH v3 0/8] sched/fair: rework the CFS load balance Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 01/10] sched/fair: clean up asym packing Vincent Guittot
2019-09-27 23:57   ` Rik van Riel
2019-09-19  7:33 ` [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
2019-09-27 23:59   ` Rik van Riel
2019-10-01 17:11   ` Valentin Schneider
2019-09-19  7:33 ` [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation Vincent Guittot
2019-09-28  0:05   ` Rik van Riel
2019-10-01 17:12   ` Valentin Schneider
2019-10-02  6:28     ` Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 04/10] sched/fair: rework load_balance Vincent Guittot
2019-09-30  1:12   ` Rik van Riel
2019-09-30  7:44     ` Vincent Guittot
2019-09-30 16:24   ` Dietmar Eggemann
2019-10-01  8:14     ` Vincent Guittot
2019-10-01 16:52       ` Dietmar Eggemann [this message]
2019-10-02  6:44         ` Vincent Guittot
2019-10-02  9:21           ` Dietmar Eggemann
2019-10-08 13:02             ` Peter Zijlstra
2019-10-02  8:23         ` Vincent Guittot
2019-10-02  9:24           ` Dietmar Eggemann
2019-10-01  8:15   ` Dietmar Eggemann
2019-10-01  9:14     ` Vincent Guittot
2019-10-01 16:57       ` Dietmar Eggemann
2019-10-01 17:47   ` Valentin Schneider
2019-10-02  8:30     ` Vincent Guittot
2019-10-02 10:47       ` Valentin Schneider
2019-10-08 14:16         ` Peter Zijlstra
2019-10-08 14:34           ` Valentin Schneider
2019-10-08 15:30             ` Vincent Guittot
2019-10-08 15:48               ` Valentin Schneider
2019-10-08 17:39               ` Peter Zijlstra
2019-10-08 18:45                 ` Vincent Guittot
2019-10-08 16:33             ` Peter Zijlstra
2019-10-08 16:39               ` Valentin Schneider
2019-10-08 17:36                 ` Valentin Schneider
2019-10-08 17:55   ` Peter Zijlstra
2019-10-08 18:47     ` Vincent Guittot
2019-10-16  7:21   ` Parth Shah
2019-10-16 11:56     ` Vincent Guittot
2019-10-18  5:34       ` Parth Shah
2019-09-19  7:33 ` [PATCH v3 05/10] sched/fair: use rq->nr_running when balancing load Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 06/10] sched/fair: use load instead of runnable load in load_balance Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 07/10] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
2019-09-19  7:33 ` [PATCH v3 08/10] sched/fair: use utilization to select misfit task Vincent Guittot
2019-10-01 17:12   ` Valentin Schneider
2019-09-19  7:33 ` [PATCH v3 09/10] sched/fair: use load instead of runnable load in wakeup path Vincent Guittot
2019-10-07 15:14   ` Rik van Riel
2019-10-07 15:27     ` Vincent Guittot
2019-10-07 18:06       ` Rik van Riel
2019-09-19  7:33 ` [PATCH v3 10/10] sched/fair: optimize find_idlest_group Vincent Guittot
2019-10-08 14:32 ` [PATCH v3 0/8] sched/fair: rework the CFS load balance Phil Auld
2019-10-08 15:53   ` Vincent Guittot
2019-10-09 19:33     ` Phil Auld
2019-10-10  8:20       ` Vincent Guittot
2019-10-16  7:21 ` Parth Shah
2019-10-16 11:51   ` Vincent Guittot

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:d0c3aa1dc29 dfblob:a08f342ead8 dfblob:d0c3aa1dc29
dfblob:4d1864d43ed )
 OR (
bs:"Re: [PATCH v3 04/10] sched/fair: rework load_balance" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3dca46c5-c395-e2b3-a7e8-e9208ba741c8@arm.com \
    --to=dietmar.eggemann@arm.com \
    --cc=Morten.Rasmussen@arm.com \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.