[PATCH v7 6/7] sched: replace capacity_factor by usage

All of lore.kernel.org
 help / color / mirror / Atom feed

From: peterz@infradead.org (Peter Zijlstra)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH v7 6/7] sched: replace capacity_factor by usage
Date: Thu, 9 Oct 2014 16:58:16 +0200	[thread overview]
Message-ID: <20141009145816.GS4750@worktop.programming.kicks-ass.net> (raw)
In-Reply-To: <1412684017-16595-7-git-send-email-vincent.guittot@linaro.org>

On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
> @@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  
>  		/*
>  		 * In case the child domain prefers tasks go to siblings
> -		 * first, lower the sg capacity factor to one so that we'll try
> +		 * first, lower the sg capacity to one so that we'll try
>  		 * and move all the excess tasks away. We lower the capacity
>  		 * of a group only if the local group has the capacity to fit
> -		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
> +		 * these excess tasks, i.e. group_capacity > 0. The
>  		 * extra check prevents the case where you always pull from the
>  		 * heaviest group when it is already under-utilized (possible
>  		 * with a large weight task outweighs the tasks on the system).
>  		 */
>  		if (prefer_sibling && sds->local &&
> -		    sds->local_stat.group_has_free_capacity)
> -			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
> +		    group_has_capacity(env, &sds->local_stat)) {
> +			if (sgs->sum_nr_running > 1)
> +				sgs->group_no_capacity = 1;
> +			sgs->group_capacity = min(sgs->group_capacity,
> +						SCHED_CAPACITY_SCALE);
> +		}
>  
>  		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>  			sds->busiest = sg;

So this is your PREFER_SIBLING implementation, why is this a good one?

That is, the current PREFER_SIBLING works because we account against
nr_running, and setting it to 1 makes 2 tasks too much and we end up
moving stuff away.

But if I understand things right, we're now measuring tasks in
'utilization' against group_capacity, so setting group_capacity to
CAPACITY_SCALE, means we can end up with many tasks on the one cpu
before we move over to another group, right?

So I think that for 'idle' systems we want to do the
nr_running/work-conserving thing -- get as many cpus running
'something' and avoid queueing like the plague.

Then when there's some queueing, we want to go do the utilization thing,
basically minimize queueing by leveling utilization.

Once all cpus are fully utilized, we switch to fair/load based balancing
and try and get equal load on cpus.

Does that make sense?

If so, how about adding a group_type and splitting group_other into say
group_idle and group_util:

enum group_type {
	group_idle = 0,
	group_util,
	group_imbalanced,
	group_overloaded,
}

we change group_classify() into something like:

	if (sgs->group_usage > sgs->group_capacity)
		return group_overloaded;

	if (sg_imbalanced(group))
		return group_imbalanced;

	if (sgs->nr_running < sgs->weight)
		return group_idle;

	return group_util;

And then have update_sd_pick_busiest() something like:

	if (sgs->group_type > busiest->group_type)
		return true;

	if (sgs->group_type < busiest->group_type)
		return false;

	switch (sgs->group_type) {
	case group_idle:
		if (sgs->nr_running < busiest->nr_running)
			return false;
		break;

	case group_util:
		if (sgs->group_usage < busiest->group_usage)
			return false;
		break;

	default:
		if (sgs->avg_load < busiest->avg_load)
			return false;
		break;
	}

	....

And then some calculate_imbalance() magic to complete it..

If we have that, we can play tricks with the exact busiest condition in
update_sd_pick_busiest() to implement PREFER_SIBLING or so.

Makes sense?

WARNING: multiple messages have this Message-ID (diff)

From: Peter Zijlstra <peterz@infradead.org>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: mingo@kernel.org, linux-kernel@vger.kernel.org,
	preeti@linux.vnet.ibm.com, Morten.Rasmussen@arm.com,
	kamalesh@linux.vnet.ibm.com, linux@arm.linux.org.uk,
	linux-arm-kernel@lists.infradead.org, riel@redhat.com,
	efault@gmx.de, nicolas.pitre@linaro.org,
	linaro-kernel@lists.linaro.org, daniel.lezcano@linaro.org,
	dietmar.eggemann@arm.com, pjt@google.com, bsegall@google.com
Subject: Re: [PATCH v7 6/7] sched: replace capacity_factor by usage
Date: Thu, 9 Oct 2014 16:58:16 +0200	[thread overview]
Message-ID: <20141009145816.GS4750@worktop.programming.kicks-ass.net> (raw)
In-Reply-To: <1412684017-16595-7-git-send-email-vincent.guittot@linaro.org>

On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
> @@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  
>  		/*
>  		 * In case the child domain prefers tasks go to siblings
> -		 * first, lower the sg capacity factor to one so that we'll try
> +		 * first, lower the sg capacity to one so that we'll try
>  		 * and move all the excess tasks away. We lower the capacity
>  		 * of a group only if the local group has the capacity to fit
> -		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
> +		 * these excess tasks, i.e. group_capacity > 0. The
>  		 * extra check prevents the case where you always pull from the
>  		 * heaviest group when it is already under-utilized (possible
>  		 * with a large weight task outweighs the tasks on the system).
>  		 */
>  		if (prefer_sibling && sds->local &&
> -		    sds->local_stat.group_has_free_capacity)
> -			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
> +		    group_has_capacity(env, &sds->local_stat)) {
> +			if (sgs->sum_nr_running > 1)
> +				sgs->group_no_capacity = 1;
> +			sgs->group_capacity = min(sgs->group_capacity,
> +						SCHED_CAPACITY_SCALE);
> +		}
>  
>  		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>  			sds->busiest = sg;

So this is your PREFER_SIBLING implementation, why is this a good one?

That is, the current PREFER_SIBLING works because we account against
nr_running, and setting it to 1 makes 2 tasks too much and we end up
moving stuff away.

But if I understand things right, we're now measuring tasks in
'utilization' against group_capacity, so setting group_capacity to
CAPACITY_SCALE, means we can end up with many tasks on the one cpu
before we move over to another group, right?

So I think that for 'idle' systems we want to do the
nr_running/work-conserving thing -- get as many cpus running
'something' and avoid queueing like the plague.

Then when there's some queueing, we want to go do the utilization thing,
basically minimize queueing by leveling utilization.

Once all cpus are fully utilized, we switch to fair/load based balancing
and try and get equal load on cpus.

Does that make sense?

If so, how about adding a group_type and splitting group_other into say
group_idle and group_util:

enum group_type {
	group_idle = 0,
	group_util,
	group_imbalanced,
	group_overloaded,
}

we change group_classify() into something like:

	if (sgs->group_usage > sgs->group_capacity)
		return group_overloaded;

	if (sg_imbalanced(group))
		return group_imbalanced;

	if (sgs->nr_running < sgs->weight)
		return group_idle;

	return group_util;

And then have update_sd_pick_busiest() something like:

	if (sgs->group_type > busiest->group_type)
		return true;

	if (sgs->group_type < busiest->group_type)
		return false;

	switch (sgs->group_type) {
	case group_idle:
		if (sgs->nr_running < busiest->nr_running)
			return false;
		break;

	case group_util:
		if (sgs->group_usage < busiest->group_usage)
			return false;
		break;

	default:
		if (sgs->avg_load < busiest->avg_load)
			return false;
		break;
	}

	....

And then some calculate_imbalance() magic to complete it..

If we have that, we can play tricks with the exact busiest condition in
update_sd_pick_busiest() to implement PREFER_SIBLING or so.

Makes sense?

next prev parent reply	other threads:[~2014-10-09 14:58 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-07 12:13 [PATCH v7 0/7] sched: consolidation of cpu_capacity Vincent Guittot
2014-10-07 12:13 ` Vincent Guittot
2014-10-07 12:13 ` [PATCH v7 1/7] sched: add per rq cpu_capacity_orig Vincent Guittot
2014-10-07 12:13   ` Vincent Guittot
2014-10-07 12:13 ` [PATCH v7 2/7] sched: move cfs task on a CPU with higher capacity Vincent Guittot
2014-10-07 12:13   ` Vincent Guittot
2014-10-09 11:23   ` Peter Zijlstra
2014-10-09 11:23     ` Peter Zijlstra
2014-10-09 14:59     ` Vincent Guittot
2014-10-09 14:59       ` Vincent Guittot
2014-10-09 15:30       ` Peter Zijlstra
2014-10-09 15:30         ` Peter Zijlstra
2014-10-10  7:46         ` Vincent Guittot
2014-10-10  7:46           ` Vincent Guittot
2014-10-07 12:13 ` [PATCH v7 3/7] sched: add utilization_avg_contrib Vincent Guittot
2014-10-07 12:13   ` Vincent Guittot
2014-10-08 17:04   ` Dietmar Eggemann
2014-10-08 17:04     ` Dietmar Eggemann
2014-10-07 12:13 ` [PATCH 4/7] sched: Track group sched_entity usage contributions Vincent Guittot
2014-10-07 12:13   ` Vincent Guittot
2014-10-07 20:15   ` bsegall at google.com
2014-10-07 20:15     ` bsegall
2014-10-08  7:16     ` Vincent Guittot
2014-10-08  7:16       ` Vincent Guittot
2014-10-08 11:13     ` Morten Rasmussen
2014-10-08 11:13       ` Morten Rasmussen
2014-10-07 12:13 ` [PATCH v7 5/7] sched: get CPU's usage statistic Vincent Guittot
2014-10-07 12:13   ` Vincent Guittot
2014-10-09 11:36   ` Peter Zijlstra
2014-10-09 11:36     ` Peter Zijlstra
2014-10-09 13:57     ` Vincent Guittot
2014-10-09 13:57       ` Vincent Guittot
2014-10-09 15:12       ` Peter Zijlstra
2014-10-09 15:12         ` Peter Zijlstra
2014-10-10 14:38         ` Vincent Guittot
2014-10-10 14:38           ` Vincent Guittot
2014-10-07 12:13 ` [PATCH v7 6/7] sched: replace capacity_factor by usage Vincent Guittot
2014-10-07 12:13   ` Vincent Guittot
2014-10-09 12:16   ` Peter Zijlstra
2014-10-09 12:16     ` Peter Zijlstra
2014-10-09 14:18     ` Vincent Guittot
2014-10-09 14:18       ` Vincent Guittot
2014-10-09 15:18       ` Peter Zijlstra
2014-10-09 15:18         ` Peter Zijlstra
2014-10-10  7:17         ` Vincent Guittot
2014-10-10  7:17           ` Vincent Guittot
2014-10-10  7:18           ` Vincent Guittot
2014-10-10  7:18             ` Vincent Guittot
2014-11-23  1:03       ` Wanpeng Li
2014-11-23  1:03         ` Wanpeng Li
2014-11-24 10:16         ` Vincent Guittot
2014-11-24 10:16           ` Vincent Guittot
2014-10-09 14:16   ` Peter Zijlstra
2014-10-09 14:16     ` Peter Zijlstra
2014-10-09 14:28     ` Vincent Guittot
2014-10-09 14:28       ` Vincent Guittot
2014-10-09 14:58   ` Peter Zijlstra [this message]
2014-10-09 14:58     ` Peter Zijlstra
2014-10-21  7:38     ` Vincent Guittot
2014-10-21  7:38       ` Vincent Guittot
2014-10-07 12:13 ` [PATCH v7 7/7] sched: add SD_PREFER_SIBLING for SMT level Vincent Guittot
2014-10-07 12:13   ` Vincent Guittot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141009145816.GS4750@worktop.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.