Re: [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Yuyang Du <yuyang.du@intel.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: "mingo@kernel.org" <mingo@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"pjt@google.com" <pjt@google.com>,
	"bsegall@google.com" <bsegall@google.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
	"len.brown@intel.com" <len.brown@intel.com>,
	"rafael.j.wysocki@intel.com" <rafael.j.wysocki@intel.com>,
	"fengguang.wu@intel.com" <fengguang.wu@intel.com>,
	"boqun.feng@gmail.com" <boqun.feng@gmail.com>,
	"srikar@linux.vnet.ibm.com" <srikar@linux.vnet.ibm.com>
Subject: Re: [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking
Date: Tue, 14 Jul 2015 07:59:23 +0800	[thread overview]
Message-ID: <20150713235923.GJ5197@intel.com> (raw)
In-Reply-To: <55A3F097.8040101@arm.com>

Hi Dietmar,

On Mon, Jul 13, 2015 at 06:08:39PM +0100, Dietmar Eggemann wrote:
> Hi Yuyang,
> 
> I did some testing of your new pelt implementation.
> 
> TC 1: one nice-0 60% task affine to cpu1 in root tg and 2 nice-0 20%
> periodic tasks affine to cpu1 in a task group with id=3 (one hierarchy).
> 
> TC 2: 10 nice-0 5% tasks affine to cpu1 in a task group with id=3 (one
> hierarchy).
> 
> and compared the results (the se (tasks and tg representation for cpu1),
> cfs_rq and tg related pelt signals) with the current pelt implementation.
> 
> The signals are very similar (taken the differences due to
> separated/missing blocked load/util in the current pelt and the slightly
> different behaviour in transitional phases (e.g. task enqueue/dequeue)
> into consideration.
> 

Thanks for testing and comments.

> I haven't done any performance related tests yet.
>
> > +/*
> > + * The load_avg/util_avg represents an infinite geometric series:
> > + * 1) load_avg describes the amount of time that a sched_entity
> > + * is runnable on a rq. It is based on both load_sum and the
> > + * weight of the task.
> > + * 2) util_avg describes the amount of time that a sched_entity
> > + * is running on a CPU. It is based on util_sum and is scaled
> > + * in the range [0..SCHED_LOAD_SCALE].
> 
> sa->load_[avg/sum] and sa->util_[avg/sum] are also used for the
> aggregated load/util values on the cfs_rq's.
 
Yes.

> >  /*
> > - * Aggregate cfs_rq runnable averages into an equivalent task_group
> > - * representation for computing load contributions.
> > + * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> > + * and effective_load (which is not done because it is too costly).
> >   */
> > -static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> > -                                                 struct cfs_rq *cfs_rq)
> > +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> >  {
> 
> This function is always called with force=0 ? I remember that there was
> some discussion about this in your v5 (error bounds of '/ 64') but since
> it is not used ...
 
IIRC, Peter said the cacheline is hot, but also need to bound the error. What I
did is we bound as much as that of now and give the option (force here if someone
wants to use it someday) to flush the error.

> > -       struct task_group *tg = cfs_rq->tg;
> > -       long contrib;
> > -
> > -       /* The fraction of a cpu used by this cfs_rq */
> > -       contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
> > -                         sa->avg_period + 1);
> > -       contrib -= cfs_rq->tg_runnable_contrib;
> > +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > 
> > -       if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> > -               atomic_add(contrib, &tg->runnable_avg);
> > -               cfs_rq->tg_runnable_contrib += contrib;
> > -       }
> > -}
> 
> [...]
> 
> > -static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> > -
> > -/* Update a sched_entity's runnable average */
> > -static inline void update_entity_load_avg(struct sched_entity *se,
> > -                                         int update_cfs_rq)
> > +/* Update task and its cfs_rq load average */
> > +static inline void update_load_avg(struct sched_entity *se, int update_tg)
> >  {
> >         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > -       long contrib_delta, utilization_delta;
> >         int cpu = cpu_of(rq_of(cfs_rq));
> > -       u64 now;
> > +       u64 now = cfs_rq_clock_task(cfs_rq);
> > 
> >         /*
> > -        * For a group entity we need to use their owned cfs_rq_clock_task() in
> > -        * case they are the parent of a throttled hierarchy.
> > +        * Track task load average for carrying it to new CPU after migrated, and
> > +        * track group sched_entity load average for task_h_load calc in migration
> >          */
> > -       if (entity_is_task(se))
> > -               now = cfs_rq_clock_task(cfs_rq);
> > -       else
> > -               now = cfs_rq_clock_task(group_cfs_rq(se));
> 
> Why don't you make this distinction while getting 'now' between se's
> representing tasks or task groups anymore?
 
Memory is a little blurry, I think (1) we don't distinguish task or tg SE, and
(2) as cfs_rq tracks load avg on their own, throttled cfs_rq will use stopped
clock when they update their load.

But thinking about this now, it seems neither is doing the right load average for
throttled load. And it does not seem clear to me how to correlate group quota with
averaged load.

> > +       __update_load_avg(now, cpu, &se->avg,
> > +               se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
> > 
> > -       if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
> > -                                       cfs_rq->curr == se))
> > -               return;
> > -
> > -       contrib_delta = __update_entity_load_avg_contrib(se);
> > -       utilization_delta = __update_entity_utilization_avg_contrib(se);
> > -
> > -       if (!update_cfs_rq)
> > -               return;
> > -
> > -       if (se->on_rq) {
> > -               cfs_rq->runnable_load_avg += contrib_delta;
> > -               cfs_rq->utilization_load_avg += utilization_delta;
> > -       } else {
> > -               subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> > -       }
> > +       if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
> > +               update_tg_load_avg(cfs_rq, 0);
> >  }
> 
> [...]
> 
> > -
> >  static void update_blocked_averages(int cpu)
> 
> The name of this function now becomes misleading since you don't update
> blocked averages any more. Existing pelt calls
> __update_blocked_averages_cpu() -> update_cfs_rq_blocked_load() ->
> subtract_blocked_load_contrib() for all tg tree.
> 
> Whereas you update cfs_rq.avg->[load/util]_[avg/sum] and conditionally
> tg->load_avg and cfs_rq->tg_load_avg_contrib.

I actually gave thoughts to the naming. We don't have explicit blocked load,
but this function is still largely for the "silent" blocked load and get
the cfs_rq and tg updated due to it.

Thanks,
Yuyang

next prev parent reply	other threads:[~2015-07-14  7:50 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-23  0:08 [PATCH v9 0/4] sched: Rewrite runnable load and utilization average tracking Yuyang Du
2015-06-23  0:08 ` [PATCH v9 1/4] sched: Remove rq's runnable avg Yuyang Du
2015-06-23 16:49   ` Dietmar Eggemann
2015-06-23  0:08 ` [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking Yuyang Du
2015-06-23 10:30   ` Wanpeng Li
2015-06-24  7:11   ` [PATCH] sched: update blocked load of idle cpus Vincent Guittot
2015-06-23 23:41     ` Yuyang Du
2015-07-13 17:08   ` [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking Dietmar Eggemann
2015-07-13 23:59     ` Yuyang Du [this message]
2015-06-23  0:08 ` [PATCH v9 3/4] sched: Init cfs_rq's sched_entity load average Yuyang Du
2015-06-23  0:08 ` [PATCH v9 4/4] sched: Remove task and group entity load when they are dead Yuyang Du

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150713235923.GJ5197@intel.com \
    --to=yuyang.du@intel.com \
    --cc=Morten.Rasmussen@arm.com \
    --cc=boqun.feng@gmail.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=fengguang.wu@intel.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox