Re: [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg()

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Yuyang Du <yuyang.du@intel.com>
To: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: peterz@infradead.org, mingo@kernel.org,
	linux-kernel@vger.kernel.org, bsegall@google.com, pjt@google.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	juri.lelli@arm.com
Subject: Re: [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg()
Date: Tue, 10 May 2016 10:27:44 +0800	[thread overview]
Message-ID: <20160510022744.GT16093@intel.com> (raw)
In-Reply-To: <20160510084601.GB22310@e105550-lin.cambridge.arm.com>

On Tue, May 10, 2016 at 09:46:03AM +0100, Morten Rasmussen wrote:
> On Wed, May 04, 2016 at 04:02:46AM +0800, Yuyang Du wrote:
> > __update_sched_avg() has these steps:
> > 
> >   1. add the remainder of the last incomplete period
> >   2. decay old sum
> >   3. accumulate new sum in full periods since last_update_time
> >   4. add the current incomplete period
> >   5. update averages
> > 
> > Previously, we separately computed steps 1, 3, and 4, leading to
> > each one of them ugly in codes and costly in overhead.
> > 
> > For example:
> > 
> >              c1          c3           c4
> >              ^           ^            ^
> >              |           |            |
> >            |<->|<----------------->|<--->|
> >    ... |---x---|------| ... |------|-----x (now)
> > 
> > c1, c3, and c4 are the accumulated (meanwhile decayed) contributions
> > in timing aspect of steps 1, 3, and 4 respectively.
> > 
> > Then the accumulated contribution to load_sum, for example, is:
> > 
> > contrib = c1 * weight * freq_scaled;
> > contrib += c3 * weight * freq_scaled;
> > contrib += c4 * weight * freq_scaled;
> > 
> > Obviously, we can optimize the above as:
> > 
> > contrib = c1 + c3 + c4;
> > contrib *= weight * freq_scaled;
> 
> This isn't obvious to me. After spending quite a bit of time figuring
> what your code actually does, there is more to it than what you describe
> here. As your label names suggest, you don't consider what happens in
> step 2 where contrib is decayed. When and how the individual bits are
> decayed is essential information.
> 
> Your patch partially moves step 2 (on your list above) before step 1.
> So it becomes:
> 
> a. decay old sum
> b. compute the contribution up to the first 1ms boundary (c1)
> c. decay c1 to get c1'
> d. accumulate the full periods (c3) adding them to c1'
> e. add remaining time up until now (c4) to contrib (c3+c1').
> f. scale by weight and arch_scale_{freq,cpu}_capacity() functions and
>    add to sum.
> 
> The basic difference in the computation is that you consolidate the
> scaling into one operation instead of three at the cost of decaying
> twice instead of once. The net result is saving a few multiplications.
> 
> I would recommend that this is made very clear. Deriving the math from
> the code is daunting task for most people.
 
Agreed.

> An alternative optimization strategy would be to leave c1 as it is and
> thereby avoid to decay twice, and only add up c3 and c4 before scaling
> reducing the number of scaling operations from three to two.

I did struggle a lot about this, balancing code simplicity and overhead.
So I favored simplicity, and believe (in my opinion) it is still an
overhead win.
 
> > 
> > After we combine them together, we will have much cleaner codes
> > and less CPU cycles.
> 
> Could you share any numbers backing up that claim?
> 
> I did a couple of runs on arm64 (Juno):
> 
> perf bench sched messaging -g 50 -l 2500' (10 runs):
> 
> tip:	65.545996712 seconds time elapsed	( +-  0.22% )
> patch:	65.209931026 seconds time elapsed	( +-  0.23% )
> 
> Taking the error margins into account the difference there is still an
> improvement, but it is about as small as it can be without getting lost
> in the noise. Is the picture any better on x86?
>
> Whether the code is cleaner is a subjective opinion. The diffstat below
> doesn't really show any benefit, but I think you have slightly more
> comments so I would not judge based on that.
> 
> When it comes to code structure, the new __update_sched_avg() is a lot
> simpler than __update_load_avg(), but that is only due to the fact that
> most of the content has been move to accumulate_sum() and
> __accumulate_sum(). Where we before had all code in a single function
> with fitted on screen in one go, you know have to consider three
> functions and how they work together to figure out what is really going
> on.
> 
> I haven't found any bugs in the code, but IMHO I don't really see the
> point in rewriting the code completely. The result isn't significantly
> simpler than what we have and generates code churn affecting everyone
> working with this code. I think we can improve the existing code more by
> just factoring out the capacity/weight scaling, which would be much less
> intrusive.
> 
> Maybe I'm missing something, but I don't see a strong argument for this
> refactoring.

Do you have a criteria on how much to improve in perf and code merits a
patch?

To me, you want it significant or anything else?
 
> > Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> > ---
> >  kernel/sched/fair.c |  189 ++++++++++++++++++++++++++-------------------------
> >  1 file changed, 95 insertions(+), 94 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a060ef2..495e5f0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -668,7 +668,7 @@ static unsigned long task_h_load(struct task_struct *p);
> >   */
> >  #define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
> >  #define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
> > -#define SCHED_AVG_MAX_N 345	/* number of full periods to produce SCHED_AVG_MAX */
> > +#define SCHED_AVG_MAX_N 347	/* number of full periods to produce SCHED_AVG_MAX */
> 
> Why +2? I doesn't seem related to anything in this patch or explained
> anywhere.
 
I really should have had a separate patch to explain this. Sorry.

I provided a new program to compute SCHED_AVG_MAX_N or LOAD_AVG_MAX_N, this is
the only difference from Paul's numbers, which definitely deserves a thorough
discussion. I will do it in next version.

> >  
> >  /* Give new sched_entity start runnable values to heavy its load in infant time */
> >  void init_entity_runnable_average(struct sched_entity *se)
> > @@ -2606,7 +2606,7 @@ static const u32 __accumulated_sum_N[] = {
> >  
> >  /*
> >   * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> > - * lower integers.
> > + * lower integers. Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11
> >   */
> >  static const u32 __accumulated_sum_N32[] = {
> >  	    0, 23371, 35056, 40899, 43820, 45281,
> > @@ -2616,8 +2616,11 @@ static const u32 __accumulated_sum_N32[] = {
> >  /*
> >   * val * y^n, where y^m ~= 0.5
> >   *
> > - * n is the number of periods past; a period is ~1ms
> > + * n is the number of periods past. A period is ~1ms, so a 32bit
> > + * integer can hold approximately a maximum of 49 (=2^32/1000/3600/24) days.
> > + *
> >   * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
> > + *
> >   */
> >  static __always_inline u64 __decay_sum(u64 val, u32 n)
> >  {
> 
> The above two hunks seem to belong to some of the previous patches in
> this patch set?

Duh, yes...
 
> > @@ -2649,20 +2652,30 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
> >   * We can compute this efficiently by combining:
> >   * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
> >   */
> > -static __always_inline u32 __accumulate_sum(u32 n)
> > +static __always_inline u32
> > +__accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
> >  {
> > -	u32 contrib = 0;
> > +	u32 contrib;
> > +
> > +	if (!periods)
> > +		return remainder - period_contrib;
> >  
> > -	if (likely(n <= SCHED_AVG_HALFLIFE))
> > -		return __accumulated_sum_N[n];
> > -	else if (unlikely(n >= SCHED_AVG_MAX_N))
> > +	if (unlikely(periods >= SCHED_AVG_MAX_N))
> >  		return SCHED_AVG_MAX;
> >  
> > -	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
> > -	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
> > -	n %= SCHED_AVG_HALFLIFE;
> > -	contrib = __decay_sum(contrib, n);
> > -	return contrib + __accumulated_sum_N[n];
> > +	remainder += __decay_sum((u64)(1024 - period_contrib), periods);
> > +
> > +	periods -= 1;
> > +	if (likely(periods <= SCHED_AVG_HALFLIFE))
> > +		contrib = __accumulated_sum_N[periods];
> > +	else {
> > +		contrib = __accumulated_sum_N32[periods/SCHED_AVG_HALFLIFE];
> > +		periods %= SCHED_AVG_HALFLIFE;
> > +		contrib = __decay_sum(contrib, periods);
> > +		contrib += __accumulated_sum_N[periods];
> > +	}
> > +
> > +	return contrib + remainder;
> >  }
> >  
> >  #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> > @@ -2671,6 +2684,55 @@ static __always_inline u32 __accumulate_sum(u32 n)
> >  
> >  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> >  
> > +static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
> > +	struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
> 
> The unit and meaning of 'delta' confused me a lot when reviewing this
> patch. Here it is the time delta since last update in [us] (duration of
> c1+c3+c4).
> 
> > +{
> > +	u32 contrib, periods;
> > +	unsigned long scale_freq, scale_cpu;
> > +
> > +	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > +	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +
> > +	delta += sa->period_contrib;
> 
> Here it is extended to include time previous 1ms boundary.
> 
> > +	periods = delta >> 10; /* A period is 1024us (~1ms) */
> > +
> > +	/*
> > +	 * Accumulating *_sum has two steps.
> > +	 *
> > +	 * Step 1: decay old *_sum if we crossed period boundaries.
> > +	 */
> > +	if (periods) {
> > +		sa->load_sum = __decay_sum(sa->load_sum, periods);
> > +		if (cfs_rq) {
> > +			cfs_rq->runnable_load_sum =
> > +				__decay_sum(cfs_rq->runnable_load_sum, periods);
> > +		}
> > +		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
> > +	}
> > +
> > +	/*
> > +	 * Step 2: accumulate new *_sum since last_update_time. This at most has
> > +	 * three parts (at least one part): (1) remainder of incomplete last
> > +	 * period, (2) full periods since (1), and (3) incomplete current period.
> > +	 *
> > +	 * Fortunately, we can (and should) do all these three at once.
> > +	 */
> > +	delta %= 1024;
> 
> Now 'delta' is any leftover contribution to the current 1ms period
> (duration of c4).
> 
> > +	contrib = __accumulate_sum(periods, sa->period_contrib, delta);
> > +	sa->period_contrib = delta;
> > +
> > +	contrib = cap_scale(contrib, scale_freq);
> > +	if (weight) {
> > +		sa->load_sum += weight * contrib;
> > +		if (cfs_rq)
> > +			cfs_rq->runnable_load_sum += weight * contrib;
> > +	}
> > +	if (running)
> > +		sa->util_sum += contrib * scale_cpu;
> > +
> > +	return periods;
> > +}
> > +
> >  /*
> >   * We can represent the historical contribution to sched average as the
> >   * coefficients of a geometric series.  To do this we divide the history
> > @@ -2701,12 +2763,9 @@ static __always_inline u32 __accumulate_sum(u32 n)
> >   */
> >  static __always_inline int
> >  __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
> > -		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > +		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
> >  {
> > -	u64 delta, scaled_delta;
> > -	u32 contrib, periods;
> > -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> > -	unsigned long scale_freq, scale_cpu;
> > +	u64 delta;
> >  
> >  	delta = now - sa->last_update_time;
> 
> 'delta' is true delta, but in [ns] here.
> 
> I would suggest clearly specifying what each parameter is and its unit
> for each of the helper functions to help people a bit.

Agreed, and I reused the variable.

next prev parent reply	other threads:[~2016-05-10 10:09 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-03 20:02 [PATCH v3 00/12] sched/fair: Optimize and clean up sched averages Yuyang Du
2016-05-03 20:02 ` [PATCH v3 01/12] sched/fair: Optimize sum computation with a lookup table Yuyang Du
2016-05-03 20:02 ` [PATCH v3 02/12] sched/fair: Rename variable names for sched averages Yuyang Du
2016-05-03 20:02 ` [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit Yuyang Du
2016-05-05 11:13   ` Morten Rasmussen
2016-05-05 11:23     ` Vincent Guittot
2016-05-05 18:19     ` Yuyang Du
2016-05-10  9:10       ` Morten Rasmussen
2016-05-10  2:05         ` Yuyang Du
2016-05-03 20:02 ` [PATCH v3 04/12] sched/fair: Add __always_inline compiler attribute to __accumulate_sum() Yuyang Du
2016-05-03 20:02 ` [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg() Yuyang Du
2016-05-10  8:46   ` Morten Rasmussen
2016-05-10  2:27     ` Yuyang Du [this message]
2016-05-03 20:02 ` [PATCH v3 06/12] documentation: Add scheduler/sched-avg.txt Yuyang Du
2016-05-03 20:02 ` [PATCH v3 07/12] sched/fair: Generalize the load/util averages resolution definition Yuyang Du
2016-05-03 20:02 ` [PATCH v3 08/12] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE Yuyang Du
2016-05-05  7:46   ` [PATCH v4] " Ingo Molnar
2016-05-03 20:02 ` [PATCH v3 09/12] sched/fair: Add introduction to the sched average metrics Yuyang Du
2016-05-03 20:02 ` [PATCH v3 10/12] sched/fair: Remove scale_load_down() for load_avg Yuyang Du
2016-05-03 20:02 ` [PATCH v3 11/12] sched/fair: Rename scale_load() and scale_load_down() Yuyang Du
2016-05-03 20:02 ` [PATCH v3 12/12] sched/fair: Enable increased scale for kernel load Yuyang Du

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160510022744.GT16093@intel.com \
    --to=yuyang.du@intel.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).