Re: [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Yuyang Du <yuyang.du@intel.com>,
	"mingo@kernel.org" <mingo@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: "pjt@google.com" <pjt@google.com>,
	"bsegall@google.com" <bsegall@google.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
	"len.brown@intel.com" <len.brown@intel.com>,
	"rafael.j.wysocki@intel.com" <rafael.j.wysocki@intel.com>,
	"fengguang.wu@intel.com" <fengguang.wu@intel.com>,
	"boqun.feng@gmail.com" <boqun.feng@gmail.com>,
	"srikar@linux.vnet.ibm.com" <srikar@linux.vnet.ibm.com>
Subject: Re: [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking
Date: Mon, 13 Jul 2015 18:08:39 +0100	[thread overview]
Message-ID: <55A3F097.8040101@arm.com> (raw)
In-Reply-To: <1435018085-7004-3-git-send-email-yuyang.du@intel.com>

Hi Yuyang,

I did some testing of your new pelt implementation.

TC 1: one nice-0 60% task affine to cpu1 in root tg and 2 nice-0 20%
periodic tasks affine to cpu1 in a task group with id=3 (one hierarchy).

TC 2: 10 nice-0 5% tasks affine to cpu1 in a task group with id=3 (one
hierarchy).

and compared the results (the se (tasks and tg representation for cpu1),
cfs_rq and tg related pelt signals) with the current pelt implementation.

The signals are very similar (taken the differences due to
separated/missing blocked load/util in the current pelt and the slightly
different behaviour in transitional phases (e.g. task enqueue/dequeue)
into consideration.

I haven't done any performance related tests yet.

-- Dietmar

On 23/06/15 01:08, Yuyang Du wrote:
> The idea of runnable load average (let runnable time contribute to weight)
> was proposed by Paul Turner, and it is still followed by this rewrite. This
> rewrite aims to solve the following issues:

[...]

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index af0eeba..8b4bc4f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1183,29 +1183,23 @@ struct load_weight {
>         u32 inv_weight;
>  };
> 
> +/*
> + * The load_avg/util_avg represents an infinite geometric series:
> + * 1) load_avg describes the amount of time that a sched_entity
> + * is runnable on a rq. It is based on both load_sum and the
> + * weight of the task.
> + * 2) util_avg describes the amount of time that a sched_entity
> + * is running on a CPU. It is based on util_sum and is scaled
> + * in the range [0..SCHED_LOAD_SCALE].

sa->load_[avg/sum] and sa->util_[avg/sum] are also used for the
aggregated load/util values on the cfs_rq's.

> + * The 64 bit load_sum can:
> + * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
> + * the highest weight (=88761) always runnable, we should not overflow
> + * 2) for entity, support any load.weight always runnable
> + */
>  struct sched_avg {
> -       u64 last_runnable_update;
> -       s64 decay_count;
> -       /*
> -        * utilization_avg_contrib describes the amount of time that a
> -        * sched_entity is running on a CPU. It is based on running_avg_sum
> -        * and is scaled in the range [0..SCHED_LOAD_SCALE].
> -        * load_avg_contrib described the amount of time that a sched_entity
> -        * is runnable on a rq. It is based on both runnable_avg_sum and the
> -        * weight of the task.
> -        */
> -       unsigned long load_avg_contrib, utilization_avg_contrib;
> -       /*
> -        * These sums represent an infinite geometric series and so are bound
> -        * above by 1024/(1-y).  Thus we only need a u32 to store them for all
> -        * choices of y < 1-2^(-32)*1024.
> -        * running_avg_sum reflects the time that the sched_entity is
> -        * effectively running on the CPU.
> -        * runnable_avg_sum represents the amount of time a sched_entity is on
> -        * a runqueue which includes the running time that is monitored by
> -        * running_avg_sum.
> -        */
> -       u32 runnable_avg_sum, avg_period, running_avg_sum;
> +       u64 last_update_time, load_sum;
> +       u32 util_sum, period_contrib;
> +       unsigned long load_avg, util_avg;
>  };

[...]

>  /*
> - * Aggregate cfs_rq runnable averages into an equivalent task_group
> - * representation for computing load contributions.
> + * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> + * and effective_load (which is not done because it is too costly).
>   */
> -static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> -                                                 struct cfs_rq *cfs_rq)
> +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  {

This function is always called with force=0 ? I remember that there was
some discussion about this in your v5 (error bounds of '/ 64') but since
it is not used ...

> -       struct task_group *tg = cfs_rq->tg;
> -       long contrib;
> -
> -       /* The fraction of a cpu used by this cfs_rq */
> -       contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
> -                         sa->avg_period + 1);
> -       contrib -= cfs_rq->tg_runnable_contrib;
> +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> 
> -       if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> -               atomic_add(contrib, &tg->runnable_avg);
> -               cfs_rq->tg_runnable_contrib += contrib;
> -       }
> -}

[...]

> -static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> -
> -/* Update a sched_entity's runnable average */
> -static inline void update_entity_load_avg(struct sched_entity *se,
> -                                         int update_cfs_rq)
> +/* Update task and its cfs_rq load average */
> +static inline void update_load_avg(struct sched_entity *se, int update_tg)
>  {
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -       long contrib_delta, utilization_delta;
>         int cpu = cpu_of(rq_of(cfs_rq));
> -       u64 now;
> +       u64 now = cfs_rq_clock_task(cfs_rq);
> 
>         /*
> -        * For a group entity we need to use their owned cfs_rq_clock_task() in
> -        * case they are the parent of a throttled hierarchy.
> +        * Track task load average for carrying it to new CPU after migrated, and
> +        * track group sched_entity load average for task_h_load calc in migration
>          */
> -       if (entity_is_task(se))
> -               now = cfs_rq_clock_task(cfs_rq);
> -       else
> -               now = cfs_rq_clock_task(group_cfs_rq(se));

Why don't you make this distinction while getting 'now' between se's
representing tasks or task groups anymore?

> +       __update_load_avg(now, cpu, &se->avg,
> +               se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
> 
> -       if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
> -                                       cfs_rq->curr == se))
> -               return;
> -
> -       contrib_delta = __update_entity_load_avg_contrib(se);
> -       utilization_delta = __update_entity_utilization_avg_contrib(se);
> -
> -       if (!update_cfs_rq)
> -               return;
> -
> -       if (se->on_rq) {
> -               cfs_rq->runnable_load_avg += contrib_delta;
> -               cfs_rq->utilization_load_avg += utilization_delta;
> -       } else {
> -               subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> -       }
> +       if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
> +               update_tg_load_avg(cfs_rq, 0);
>  }

[...]

> -
>  static void update_blocked_averages(int cpu)

The name of this function now becomes misleading since you don't update
blocked averages any more. Existing pelt calls
__update_blocked_averages_cpu() -> update_cfs_rq_blocked_load() ->
subtract_blocked_load_contrib() for all tg tree.

Whereas you update cfs_rq.avg->[load/util]_[avg/sum] and conditionally
tg->load_avg and cfs_rq->tg_load_avg_contrib.

[...]

next prev parent reply	other threads:[~2015-07-13 17:08 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-23  0:08 [PATCH v9 0/4] sched: Rewrite runnable load and utilization average tracking Yuyang Du
2015-06-23  0:08 ` [PATCH v9 1/4] sched: Remove rq's runnable avg Yuyang Du
2015-06-23 16:49   ` Dietmar Eggemann
2015-06-23  0:08 ` [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking Yuyang Du
2015-06-23 10:30   ` Wanpeng Li
2015-06-24  7:11   ` [PATCH] sched: update blocked load of idle cpus Vincent Guittot
2015-06-23 23:41     ` Yuyang Du
2015-07-13 17:08   ` Dietmar Eggemann [this message]
2015-07-13 23:59     ` [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking Yuyang Du
2015-06-23  0:08 ` [PATCH v9 3/4] sched: Init cfs_rq's sched_entity load average Yuyang Du
2015-06-23  0:08 ` [PATCH v9 4/4] sched: Remove task and group entity load when they are dead Yuyang Du

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55A3F097.8040101@arm.com \
    --to=dietmar.eggemann@arm.com \
    --cc=Morten.Rasmussen@arm.com \
    --cc=boqun.feng@gmail.com \
    --cc=bsegall@google.com \
    --cc=fengguang.wu@intel.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=yuyang.du@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.