Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Valentin Schneider <valentin.schneider@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
Date: Thu, 09 Jul 2020 14:06:35 +0100	[thread overview]
Message-ID: <jhjsge0ltwk.mognet@arm.com> (raw)
In-Reply-To: <20200702144258.19326-1-vincent.guittot@linaro.org>


On 02/07/20 15:42, Vincent Guittot wrote:
> task_h_load() can return 0 in some situations like running stress-ng
> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> system. The load balance doesn't handle this correctly because
> env->imbalance never decreases and it will stop pulling tasks only after
> reaching loop_max, which can be equal to the number of running tasks of
> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> misfit task is the other feature that doesn't handle correctly such
> situation although it's probably more difficult to face the problem
> because of the smaller number of CPUs and running tasks on heterogenous
> system.
>
> We can't simply ensure that task_h_load() returns at least one because it
> would imply to handle underrun in other places.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

I dug some more into this; if I got my math right, this can be reproduced
with a single task group below the root. Forked tasks get max load, so this
can be tried out with either tons of forks or tons of CPU hogs.

We need

  p->se.avg.load_avg * cfs_rq->h_load
  -----------------------------------  < 1
    cfs_rq_load_avg(cfs_rq) + 1

Assuming homogeneous system with tasks spread out all over (no other tasks
interfering), that should boil down to

  1024 * (tg.shares / nr_cpus)
  ---------------------------  < 1
  1024 * (nr_tasks_on_cpu)

IOW

  tg.shares / nr_cpus < nr_tasks_on_cpu

If we get tasks nicely spread out, a simple condition to hit this should be
to have more tasks than shares.

I can hit task_h_load=0 with the following on my Juno (pinned to one CPU to
make things simpler; big.LITTLE doesn't yield equal weights between CPUs):

  cgcreate -g cpu:tg0

  echo 128 > /sys/fs/cgroup/cpu/tg0/cpu.shares

  for ((i=0; i<130; i++)); do
      # busy loop of your choice
      taskset -c 0 ./loop.sh &
      echo $! > /sys/fs/cgroup/cpu/tg0/tasks
  done

> ---
>  kernel/sched/fair.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fab1d17c575..62747c24aa9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
>               return;
>       }
>
> -	rq->misfit_task_load = task_h_load(p);
> +	/*
> +	 * Make sure that misfit_task_load will not be null even if
> +	 * task_h_load() returns 0. misfit_task_load is only used to select
> +	 * rq with highest load so adding 1 will not modify the result
> +	 * of the comparison.
> +	 */
> +	rq->misfit_task_load = task_h_load(p) + 1;

For here and below; wouldn't it be a tad cleaner to just do

        foo = max(task_h_load(p), 1);

Otherwise, I think I've properly convinced myself we do want to have
that in one form or another. So either way:

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

>  }
>
>  #else /* CONFIG_SMP */
> @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
>                           env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
>                               goto next;
>
> +			/*
> +			 * Depending of the number of CPUs and tasks and the
> +			 * cgroup hierarchy, task_h_load() can return a null
> +			 * value. Make sure that env->imbalance decreases
> +			 * otherwise detach_tasks() will stop only after
> +			 * detaching up to loop_max tasks.
> +			 */
> +			if (!load)
> +				load = 1;
> +
>                       env->imbalance -= load;
>                       break;

next prev parent reply	other threads:[~2020-07-09 13:06 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02 14:42 [PATCH] sched/fair: handle case of task_h_load() returning 0 Vincent Guittot
2020-07-02 16:11 ` Valentin Schneider
2020-07-02 16:28   ` Vincent Guittot
2020-07-07 13:30     ` Vincent Guittot
2020-07-08 10:34       ` Valentin Schneider
2020-07-08  9:44 ` Dietmar Eggemann
2020-07-08  9:47   ` Vincent Guittot
2020-07-09 13:34     ` Dietmar Eggemann
2020-07-09 13:52       ` Vincent Guittot
2020-07-09 13:06 ` Valentin Schneider [this message]
2020-07-09 13:51   ` Vincent Guittot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=jhjsge0ltwk.mognet@arm.com \
    --to=valentin.schneider@arm.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.