Re: [PATCH 2/2 v2] sched: Rewrite per entity runnable load average tracking

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Yuyang Du <yuyang.du@intel.com>
To: bsegall@google.com
Cc: mingo@redhat.com, peterz@infradead.org,
	linux-kernel@vger.kernel.org, pjt@google.com,
	arjan.van.de.ven@intel.com, len.brown@intel.com,
	rafael.j.wysocki@intel.com, alan.cox@intel.com,
	mark.gross@intel.com, fengguang.wu@intel.com
Subject: Re: [PATCH 2/2 v2] sched: Rewrite per entity runnable load average tracking
Date: Tue, 15 Jul 2014 09:51:05 +0800	[thread overview]
Message-ID: <20140715015105.GA2532@intel.com> (raw)
In-Reply-To: <xm26vbqzwx9a.fsf@sword-of-the-dawn.mtv.corp.google.com>

Thanks, Ben.

On Mon, Jul 14, 2014 at 12:33:53PM -0700, bsegall@google.com wrote:

> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -282,9 +282,6 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
> >  	return grp->my_q;
> >  }
> >  
> > -static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
> > -				       int force_update);
> > -
> >  static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  {
> >  	if (!cfs_rq->on_list) {
> > @@ -304,8 +301,6 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  		}
> >  
> >  		cfs_rq->on_list = 1;
> > -		/* We should have no load, but we need to update last_decay. */
> > -		update_cfs_rq_blocked_load(cfs_rq, 0);
> 
> AFAICT this call was nonsense before your change, yes (it gets called by
> enqueue_entity_load_avg)?
> 
Yes, I think so.

> > @@ -667,18 +662,17 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  #ifdef CONFIG_SMP
> >  static unsigned long task_h_load(struct task_struct *p);
> >  
> > -static inline void __update_task_entity_contrib(struct sched_entity *se);
> > -
> >  /* Give new task start runnable values to heavy its load in infant time */
> >  void init_task_runnable_average(struct task_struct *p)
> >  {
> >  	u32 slice;
> > +	struct sched_avg *sa = &p->se.avg;
> >  
> > -	p->se.avg.decay_count = 0;
> > +	sa->last_update_time = 0;
> > +	sa->period_contrib = 0;
> 
> sa->period_contrib = slice;

period_contrib should be strictly < 1024. I suppose sched_slice does not guarantee that.
So here I will give it 1023 to heavy the load.
 
> > +static __always_inline u64 decay_load64(u64 val, u64 n)
> > +{
> > +	if (likely(val <= UINT_MAX))
> > +		val = decay_load(val, n);
> > +	else {
> > +		/*
> > +		 * LOAD_AVG_MAX can last ~500ms (=log_2(LOAD_AVG_MAX)*32ms).
> > +		 * Since we have so big runnable load_avg, it is impossible
> > +		 * load_avg has not been updated for such a long time. So
> > +		 * LOAD_AVG_MAX is enough here.
> > +		 */
> 
> I mean, LOAD_AVG_MAX is irrelevant - the constant could just as well be
> 1<<20, or whatever, yes? In fact, if you're going to then turn it into a
> fraction of 1<<10, just do (with whatever temporaries you find most tasteful):
> 
> val *= (u32) decay_load(1 << 10, n);
> val >>= 10;
> 

LOAD_AVG_MAX is selected on purpose. The val arriving here specifies that it is really
big. So the decay_load may not decay it to 0 even period n is not small. If we use 1<<10
here, n=10*32 will decay it to 0, but val (larger than 1<<32) can survive.

But if even LOAD_AVG_MAX will decay to 0, it means in the current code, any runnable_avg_sum
will not survive, sicne LOAD_AVG_MAX is the upperbound.

> > +/*
> > + * Strictly, this se should use its parent cfs_rq's clock_task, but
> > + * here we use its own cfs_rq's for performance matter. But on load_avg
> > + * update, what we really care about is "the difference between two regular
> > + * clock reads", not absolute time, so the variation should be neglectable.
> > + */
> 
> Yes, but the difference between two clock reads can differ vastly
> depending on which clock you read - if cfs_rq was throttled, but
> parent_cfs_rq was not, reading cfs_rq's clock will give you no time
> passing. That said I think that's probably what you want for cfs_rq's
> load_avg, but is wrong for the group se, which probably needs to use its
> parent's.

Yes, then I think I may have to fall back to track group se load_avg alone.

> > +/* Update task and its cfs_rq load average */
> > +static inline void update_load_avg(struct sched_entity *se, int update_tg)
> >  {
> > +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +	u64 now = cfs_rq_clock_task(cfs_rq);
> > +
> >  	/*
> > +	 * Track task load average for carrying it to new CPU after migrated
> >  	 */
> > +	if (entity_is_task(se))
> > +		__update_load_avg(now, &se->avg, se->on_rq * se->load.weight);
> >  
> > +	update_cfs_rq_load_avg(now, cfs_rq);
> >  
> > +	if (update_tg)
> > +		update_tg_load_avg(cfs_rq);
> >  }
> 
> I personally find this very confusing, in that update_load_avg is doing
> more to se->cfs_rq, and in fact on anything other than a task, it isn't
> touching the se at all (instead, it touches _se->parent_ even).
 
What is confusing? The naming?

About the overflow problem, maybe I can just fall back to do load_avg / 47742
for every update, then everything would be in nature the same range with the
current code.

next prev parent reply	other threads:[~2014-07-15  9:53 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-13 23:15 [PATCH 0/2 v2] sched: Rewrite per entity runnable load average tracking Yuyang Du
2014-07-13 23:15 ` [PATCH 1/2 v2] sched: Remove update_rq_runnable_avg Yuyang Du
2014-07-13 23:15 ` [PATCH 2/2 v2] sched: Rewrite per entity runnable load average tracking Yuyang Du
2014-07-14 19:33   ` bsegall
2014-07-15  1:51     ` Yuyang Du [this message]
2014-07-15 17:27       ` bsegall
2014-07-15 23:45         ` Yuyang Du
2014-07-16  8:27         ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140715015105.GA2532@intel.com \
    --to=yuyang.du@intel.com \
    --cc=alan.cox@intel.com \
    --cc=arjan.van.de.ven@intel.com \
    --cc=bsegall@google.com \
    --cc=fengguang.wu@intel.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.gross@intel.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=rafael.j.wysocki@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox