Re: [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yuyang Du <yuyang.du@intel.com>
To: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: peterz@infradead.org, mingo@kernel.org,
	linux-kernel@vger.kernel.org, bsegall@google.com, pjt@google.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	juri.lelli@arm.com
Subject: Re: [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg()
Date: Tue, 10 May 2016 10:27:44 +0800	[thread overview]
Message-ID: <20160510022744.GT16093@intel.com> (raw)
In-Reply-To: <20160510084601.GB22310@e105550-lin.cambridge.arm.com>

On Tue, May 10, 2016 at 09:46:03AM +0100, Morten Rasmussen wrote:
> On Wed, May 04, 2016 at 04:02:46AM +0800, Yuyang Du wrote:
> > __update_sched_avg() has these steps:
> > 
> >   1. add the remainder of the last incomplete period
> >   2. decay old sum
> >   3. accumulate new sum in full periods since last_update_time
> >   4. add the current incomplete period
> >   5. update averages
> > 
> > Previously, we separately computed steps 1, 3, and 4, leading to
> > each one of them ugly in codes and costly in overhead.
> > 
> > For example:
> > 
> >              c1          c3           c4
> >              ^           ^            ^
> >              |           |            |
> >            |<->|<----------------->|<--->|
> >    ... |---x---|------| ... |------|-----x (now)
> > 
> > c1, c3, and c4 are the accumulated (meanwhile decayed) contributions
> > in timing aspect of steps 1, 3, and 4 respectively.
> > 
> > Then the accumulated contribution to load_sum, for example, is:
> > 
> > contrib = c1 * weight * freq_scaled;
> > contrib += c3 * weight * freq_scaled;
> > contrib += c4 * weight * freq_scaled;
> > 
> > Obviously, we can optimize the above as:
> > 
> > contrib = c1 + c3 + c4;
> > contrib *= weight * freq_scaled;
> 
> This isn't obvious to me. After spending quite a bit of time figuring
> what your code actually does, there is more to it than what you describe
> here. As your label names suggest, you don't consider what happens in
> step 2 where contrib is decayed. When and how the individual bits are
> decayed is essential information.
> 
> Your patch partially moves step 2 (on your list above) before step 1.
> So it becomes:
> 
> a. decay old sum
> b. compute the contribution up to the first 1ms boundary (c1)
> c. decay c1 to get c1'
> d. accumulate the full periods (c3) adding them to c1'
> e. add remaining time up until now (c4) to contrib (c3+c1').
> f. scale by weight and arch_scale_{freq,cpu}_capacity() functions and
>    add to sum.
> 
> The basic difference in the computation is that you consolidate the
> scaling into one operation instead of three at the cost of decaying
> twice instead of once. The net result is saving a few multiplications.
> 
> I would recommend that this is made very clear. Deriving the math from
> the code is daunting task for most people.
 
Agreed.

> An alternative optimization strategy would be to leave c1 as it is and
> thereby avoid to decay twice, and only add up c3 and c4 before scaling
> reducing the number of scaling operations from three to two.

I did struggle a lot about this, balancing code simplicity and overhead.
So I favored simplicity, and believe (in my opinion) it is still an
overhead win.
 
> > 
> > After we combine them together, we will have much cleaner codes
> > and less CPU cycles.
> 
> Could you share any numbers backing up that claim?
> 
> I did a couple of runs on arm64 (Juno):
> 
> perf bench sched messaging -g 50 -l 2500' (10 runs):
> 
> tip:	65.545996712 seconds time elapsed	( +-  0.22% )
> patch:	65.209931026 seconds time elapsed	( +-  0.23% )
> 
> Taking the error margins into account the difference there is still an
> improvement, but it is about as small as it can be without getting lost
> in the noise. Is the picture any better on x86?
>
> Whether the code is cleaner is a subjective opinion. The diffstat below
> doesn't really show any benefit, but I think you have slightly more
> comments so I would not judge based on that.
> 
> When it comes to code structure, the new __update_sched_avg() is a lot
> simpler than __update_load_avg(), but that is only due to the fact that
> most of the content has been move to accumulate_sum() and
> __accumulate_sum(). Where we before had all code in a single function
> with fitted on screen in one go, you know have to consider three
> functions and how they work together to figure out what is really going
> on.
> 
> I haven't found any bugs in the code, but IMHO I don't really see the
> point in rewriting the code completely. The result isn't significantly
> simpler than what we have and generates code churn affecting everyone
> working with this code. I think we can improve the existing code more by
> just factoring out the capacity/weight scaling, which would be much less
> intrusive.
> 
> Maybe I'm missing something, but I don't see a strong argument for this
> refactoring.

Do you have a criteria on how much to improve in perf and code merits a
patch?

To me, you want it significant or anything else?
 
> > Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> > ---
> >  kernel/sched/fair.c |  189 ++++++++++++++++++++++++++-------------------------
> >  1 file changed, 95 insertions(+), 94 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a060ef2..495e5f0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -668,7 +668,7 @@ static unsigned long task_h_load(struct task_struct *p);
> >   */
> >  #define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
> >  #define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
> > -#define SCHED_AVG_MAX_N 345	/* number of full periods to produce SCHED_AVG_MAX */
> > +#define SCHED_AVG_MAX_N 347	/* number of full periods to produce SCHED_AVG_MAX */
> 
> Why +2? I doesn't seem related to anything in this patch or explained
> anywhere.
 
I really should have had a separate patch to explain this. Sorry.

I provided a new program to compute SCHED_AVG_MAX_N or LOAD_AVG_MAX_N, this is
the only difference from Paul's numbers, which definitely deserves a thorough
discussion. I will do it in next version.

> >  
> >  /* Give new sched_entity start runnable values to heavy its load in infant time */
> >  void init_entity_runnable_average(struct sched_entity *se)
> > @@ -2606,7 +2606,7 @@ static const u32 __accumulated_sum_N[] = {
> >  
> >  /*
> >   * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> > - * lower integers.
> > + * lower integers. Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11
> >   */
> >  static const u32 __accumulated_sum_N32[] = {
> >  	    0, 23371, 35056, 40899, 43820, 45281,
> > @@ -2616,8 +2616,11 @@ static const u32 __accumulated_sum_N32[] = {
> >  /*
> >   * val * y^n, where y^m ~= 0.5
> >   *
> > - * n is the number of periods past; a period is ~1ms
> > + * n is the number of periods past. A period is ~1ms, so a 32bit
> > + * integer can hold approximately a maximum of 49 (=2^32/1000/3600/24) days.
> > + *
> >   * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
> > + *
> >   */
> >  static __always_inline u64 __decay_sum(u64 val, u32 n)
> >  {
> 
> The above two hunks seem to belong to some of the previous patches in
> this patch set?

Duh, yes...
 
> > @@ -2649,20 +2652,30 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
> >   * We can compute this efficiently by combining:
> >   * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
> >   */
> > -static __always_inline u32 __accumulate_sum(u32 n)
> > +static __always_inline u32
> > +__accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
> >  {
> > -	u32 contrib = 0;
> > +	u32 contrib;
> > +
> > +	if (!periods)
> > +		return remainder - period_contrib;
> >  
> > -	if (likely(n <= SCHED_AVG_HALFLIFE))
> > -		return __accumulated_sum_N[n];
> > -	else if (unlikely(n >= SCHED_AVG_MAX_N))
> > +	if (unlikely(periods >= SCHED_AVG_MAX_N))
> >  		return SCHED_AVG_MAX;
> >  
> > -	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
> > -	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
> > -	n %= SCHED_AVG_HALFLIFE;
> > -	contrib = __decay_sum(contrib, n);
> > -	return contrib + __accumulated_sum_N[n];
> > +	remainder += __decay_sum((u64)(1024 - period_contrib), periods);
> > +
> > +	periods -= 1;
> > +	if (likely(periods <= SCHED_AVG_HALFLIFE))
> > +		contrib = __accumulated_sum_N[periods];
> > +	else {
> > +		contrib = __accumulated_sum_N32[periods/SCHED_AVG_HALFLIFE];
> > +		periods %= SCHED_AVG_HALFLIFE;
> > +		contrib = __decay_sum(contrib, periods);
> > +		contrib += __accumulated_sum_N[periods];
> > +	}
> > +
> > +	return contrib + remainder;
> >  }
> >  
> >  #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> > @@ -2671,6 +2684,55 @@ static __always_inline u32 __accumulate_sum(u32 n)
> >  
> >  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> >  
> > +static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
> > +	struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
> 
> The unit and meaning of 'delta' confused me a lot when reviewing this
> patch. Here it is the time delta since last update in [us] (duration of
> c1+c3+c4).
> 
> > +{
> > +	u32 contrib, periods;
> > +	unsigned long scale_freq, scale_cpu;
> > +
> > +	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > +	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +
> > +	delta += sa->period_contrib;
> 
> Here it is extended to include time previous 1ms boundary.
> 
> > +	periods = delta >> 10; /* A period is 1024us (~1ms) */
> > +
> > +	/*
> > +	 * Accumulating *_sum has two steps.
> > +	 *
> > +	 * Step 1: decay old *_sum if we crossed period boundaries.
> > +	 */
> > +	if (periods) {
> > +		sa->load_sum = __decay_sum(sa->load_sum, periods);
> > +		if (cfs_rq) {
> > +			cfs_rq->runnable_load_sum =
> > +				__decay_sum(cfs_rq->runnable_load_sum, periods);
> > +		}
> > +		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
> > +	}
> > +
> > +	/*
> > +	 * Step 2: accumulate new *_sum since last_update_time. This at most has
> > +	 * three parts (at least one part): (1) remainder of incomplete last
> > +	 * period, (2) full periods since (1), and (3) incomplete current period.
> > +	 *
> > +	 * Fortunately, we can (and should) do all these three at once.
> > +	 */
> > +	delta %= 1024;
> 
> Now 'delta' is any leftover contribution to the current 1ms period
> (duration of c4).
> 
> > +	contrib = __accumulate_sum(periods, sa->period_contrib, delta);
> > +	sa->period_contrib = delta;
> > +
> > +	contrib = cap_scale(contrib, scale_freq);
> > +	if (weight) {
> > +		sa->load_sum += weight * contrib;
> > +		if (cfs_rq)
> > +			cfs_rq->runnable_load_sum += weight * contrib;
> > +	}
> > +	if (running)
> > +		sa->util_sum += contrib * scale_cpu;
> > +
> > +	return periods;
> > +}
> > +
> >  /*
> >   * We can represent the historical contribution to sched average as the
> >   * coefficients of a geometric series.  To do this we divide the history
> > @@ -2701,12 +2763,9 @@ static __always_inline u32 __accumulate_sum(u32 n)
> >   */
> >  static __always_inline int
> >  __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
> > -		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > +		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
> >  {
> > -	u64 delta, scaled_delta;
> > -	u32 contrib, periods;
> > -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> > -	unsigned long scale_freq, scale_cpu;
> > +	u64 delta;
> >  
> >  	delta = now - sa->last_update_time;
> 
> 'delta' is true delta, but in [ns] here.
> 
> I would suggest clearly specifying what each parameter is and its unit
> for each of the helper functions to help people a bit.

Agreed, and I reused the variable.

next prev parent reply	other threads:[~2016-05-10 10:09 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
2016-05-03 20:02 ` [PATCH v3 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
2016-05-03 20:02 ` [PATCH v3 02/12] sched/fair: Rename variable names for sched averages Yuyang Du
2016-05-03 20:02 ` [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit Yuyang Du
2016-05-05 11:13   ` Morten Rasmussen
2016-05-05 11:23     ` Vincent Guittot
2016-05-05 18:19     ` Yuyang Du
2016-05-10  9:10       ` Morten Rasmussen
2016-05-10  2:05         ` Yuyang Du
2016-05-03 20:02 ` [PATCH v3 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
2016-05-03 20:02 ` [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
2016-05-10  8:46   ` Morten Rasmussen
2016-05-10  2:27     ` Yuyang Du [this message]
2016-05-03 20:02 ` [PATCH v3 06/12] documentation: Add scheduler/sched-avg.txt Yuyang Du
2016-05-03 20:02 ` [PATCH v3 07/12] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
2016-05-03 20:02 ` [PATCH v3 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
2016-05-05  7:46   ` [PATCH v4] " Ingo Molnar
2016-05-03 20:02 ` [PATCH v3 09/12] sched/fair: Add introduction to the sched average metrics Yuyang Du
2016-05-03 20:02 ` [PATCH v3 10/12] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
2016-05-03 20:02 ` [PATCH v3 11/12] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
2016-05-03 20:02 ` [PATCH v3 12/12] sched/fair: Enable increased scale for kernel load Yuyang Du

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160510022744.GT16093@intel.com \
    --to=yuyang.du@intel.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.