From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Neuling <mikey@neuling.org>,
Mike Galbraith <bitbucket@online.de>,
linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
Anton Blanchard <anton@samba.org>, Paul Turner <pjt@google.com>,
Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
Date: Wed, 23 Oct 2013 09:30:44 +0530 [thread overview]
Message-ID: <526749EC.9030005@linux.vnet.ibm.com> (raw)
In-Reply-To: <20131022221138.GJ2490@laptop.programming.kicks-ass.net>
Hi Peter,
On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
>> kernel/sched/fair.c | 19 +++++++++++++------
>> 1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7c70201..12f0eab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
>>
>> rcu_read_lock();
>> for_each_domain(cpu, sd) {
>> + struct sched_domain *sd_parent = sd->parent;
>> + struct sched_group *sg;
>> + struct sched_group_power *sgp;
>> + int nr_busy;
>> +
>> + if (sd_parent) {
>> + sg = sd_parent->groups;
>> + sgp = sg->sgp;
>> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> +
>> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> + goto need_kick_unlock;
>> + }
>>
>> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>> && (cpumask_first_and(nohz.idle_cpus_mask,
>>
>
> Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?
You are right, sorry about this. The idea was to correct the nr_busy
computation before the patch that would remove its usage in the second
patch. But that would mean the condition nr_busy != sg->group_weight
would be invalid with this patch. The second patch needs to go first to
avoid this confusion.
>
> Also, this made me look at the nr_busy stuff again, and somehow that
> entire thing makes me a little sad.
>
> Can't we do something like the below and cut that nr_busy sd iteration
> short?
We can surely cut the nr_busy sd iteration but not like what is done
with this patch. You stop the nr_busy computation at the sched domain
that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed()
would want to know the nr_busy for one level above this.
Consider a core. Assume it is the highest domain with this flag set.
The nr_busy of its groups, which are logical threads are set to 1/0
each. But nohz_kick_needed() would like to know the sum of the nr_busy
parameter of all the groups, i.e. the threads in a core before it
decides if it can kick nohz_idle balancing. The information about the
individual group's nr_busy is of no relevance here.
Thats why the above patch tries to get the
sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to
the core's busy cpus in this example. But the below patch stops before
updating this parameter at the sd->parent level, where sd is the highest
level sched domain with the SD_SHARE_PKG_RESOURCES flag set.
But we can get around all this confusion if we can move the nr_busy
parameter to be included in the sched_domain structure rather than the
sched_groups_power structure. Anyway the only place where nr_busy is
used, that is at nohz_kick_needed(), is done to know the total number of
busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES
set and not at a sched group level.
So why not move nr_busy to struct sched_domain and having the below
patch which just updates this parameter for the sched domain, sd_busy ?
This will avoid iterating through all the levels of sched domains and
should resolve the scalability issue. We also don't need to get to
sd->parent to get the nr_busy parameter for the sake of nohz_kick_needed().
What do you think?
Regards
Preeti U Murthy
>
> This nohz stuff really needs to be re-thought and made more scalable --
> its a royal pain :/
>
>
> kernel/sched/core.c | 4 ++++
> kernel/sched/fair.c | 21 +++++++++++++++------
> kernel/sched/sched.h | 5 ++---
> 3 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c06b8d3..89db8dc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> DEFINE_PER_CPU(struct sched_domain *, sd_numa);
> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
>
> static void update_top_cache_domain(int cpu)
> {
> @@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)
>
> sd = lowest_flag_domain(cpu, SD_NUMA);
> rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> +
> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING);
> + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
> }
>
> /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 813dd61..3d5141e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6512,19 +6512,23 @@ static inline void nohz_balance_exit_idle(int cpu)
> }
> }
>
> -static inline void set_cpu_sd_state_busy(void)
> +static inline void set_cpu_sd_state_busy(int cpu)
> {
> struct sched_domain *sd;
> + struct rq *rq = cpu_rq(cpu);
>
> rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> + sd = rcu_dereference_check_sched_domain(rq->sd);
>
> if (!sd || !sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 0;
>
> - for (; sd; sd = sd->parent)
> + for (; sd; sd = sd->parent) {
> atomic_inc(&sd->groups->sgp->nr_busy_cpus);
> + if (sd == per_cpu(sd_busy, cpu))
> + break;
> + }
> unlock:
> rcu_read_unlock();
> }
> @@ -6532,16 +6536,21 @@ static inline void set_cpu_sd_state_busy(void)
> void set_cpu_sd_state_idle(void)
> {
> struct sched_domain *sd;
> + int cpu = smp_processor_id();
> + struct rq *rq = cpu_rq(cpu);
>
> rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> + sd = rcu_dereference_check_sched_domain(rq->sd);
>
> if (!sd || sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 1;
>
> - for (; sd; sd = sd->parent)
> + for (; sd; sd = sd->parent) {
> atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> + if (sd == per_cpu(sd_busy, cpu))
> + break;
> + }
> unlock:
> rcu_read_unlock();
> }
> @@ -6756,7 +6765,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> * We may be recently in ticked or tickless idle mode. At the first
> * busy tick after returning from idle, we will update the busy stats.
> */
> - set_cpu_sd_state_busy();
> + set_cpu_sd_state_busy(cpu);
> nohz_balance_exit_idle(cpu);
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ffc7087..80c5fd2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -599,9 +599,8 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
> struct sched_domain *sd, *hsd = NULL;
>
> for_each_domain(cpu, sd) {
> - if (!(sd->flags & flag))
> - break;
> - hsd = sd;
> + if (sd->flags & flag)
> + hsd = sd;
> }
>
> return hsd;
>
WARNING: multiple messages have this Message-ID (diff)
From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>,
Mike Galbraith <bitbucket@online.de>,
Paul Turner <pjt@google.com>, Ingo Molnar <mingo@kernel.org>,
Michael Neuling <mikey@neuling.org>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
linux-kernel@vger.kernel.org, Anton Blanchard <anton@samba.org>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
Date: Wed, 23 Oct 2013 09:30:44 +0530 [thread overview]
Message-ID: <526749EC.9030005@linux.vnet.ibm.com> (raw)
In-Reply-To: <20131022221138.GJ2490@laptop.programming.kicks-ass.net>
Hi Peter,
On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
>> kernel/sched/fair.c | 19 +++++++++++++------
>> 1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7c70201..12f0eab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
>>
>> rcu_read_lock();
>> for_each_domain(cpu, sd) {
>> + struct sched_domain *sd_parent = sd->parent;
>> + struct sched_group *sg;
>> + struct sched_group_power *sgp;
>> + int nr_busy;
>> +
>> + if (sd_parent) {
>> + sg = sd_parent->groups;
>> + sgp = sg->sgp;
>> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> +
>> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> + goto need_kick_unlock;
>> + }
>>
>> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>> && (cpumask_first_and(nohz.idle_cpus_mask,
>>
>
> Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?
You are right, sorry about this. The idea was to correct the nr_busy
computation before the patch that would remove its usage in the second
patch. But that would mean the condition nr_busy != sg->group_weight
would be invalid with this patch. The second patch needs to go first to
avoid this confusion.
>
> Also, this made me look at the nr_busy stuff again, and somehow that
> entire thing makes me a little sad.
>
> Can't we do something like the below and cut that nr_busy sd iteration
> short?
We can surely cut the nr_busy sd iteration but not like what is done
with this patch. You stop the nr_busy computation at the sched domain
that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed()
would want to know the nr_busy for one level above this.
Consider a core. Assume it is the highest domain with this flag set.
The nr_busy of its groups, which are logical threads are set to 1/0
each. But nohz_kick_needed() would like to know the sum of the nr_busy
parameter of all the groups, i.e. the threads in a core before it
decides if it can kick nohz_idle balancing. The information about the
individual group's nr_busy is of no relevance here.
Thats why the above patch tries to get the
sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to
the core's busy cpus in this example. But the below patch stops before
updating this parameter at the sd->parent level, where sd is the highest
level sched domain with the SD_SHARE_PKG_RESOURCES flag set.
But we can get around all this confusion if we can move the nr_busy
parameter to be included in the sched_domain structure rather than the
sched_groups_power structure. Anyway the only place where nr_busy is
used, that is at nohz_kick_needed(), is done to know the total number of
busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES
set and not at a sched group level.
So why not move nr_busy to struct sched_domain and having the below
patch which just updates this parameter for the sched domain, sd_busy ?
This will avoid iterating through all the levels of sched domains and
should resolve the scalability issue. We also don't need to get to
sd->parent to get the nr_busy parameter for the sake of nohz_kick_needed().
What do you think?
Regards
Preeti U Murthy
>
> This nohz stuff really needs to be re-thought and made more scalable --
> its a royal pain :/
>
>
> kernel/sched/core.c | 4 ++++
> kernel/sched/fair.c | 21 +++++++++++++++------
> kernel/sched/sched.h | 5 ++---
> 3 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c06b8d3..89db8dc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> DEFINE_PER_CPU(struct sched_domain *, sd_numa);
> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
>
> static void update_top_cache_domain(int cpu)
> {
> @@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)
>
> sd = lowest_flag_domain(cpu, SD_NUMA);
> rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> +
> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING);
> + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
> }
>
> /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 813dd61..3d5141e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6512,19 +6512,23 @@ static inline void nohz_balance_exit_idle(int cpu)
> }
> }
>
> -static inline void set_cpu_sd_state_busy(void)
> +static inline void set_cpu_sd_state_busy(int cpu)
> {
> struct sched_domain *sd;
> + struct rq *rq = cpu_rq(cpu);
>
> rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> + sd = rcu_dereference_check_sched_domain(rq->sd);
>
> if (!sd || !sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 0;
>
> - for (; sd; sd = sd->parent)
> + for (; sd; sd = sd->parent) {
> atomic_inc(&sd->groups->sgp->nr_busy_cpus);
> + if (sd == per_cpu(sd_busy, cpu))
> + break;
> + }
> unlock:
> rcu_read_unlock();
> }
> @@ -6532,16 +6536,21 @@ static inline void set_cpu_sd_state_busy(void)
> void set_cpu_sd_state_idle(void)
> {
> struct sched_domain *sd;
> + int cpu = smp_processor_id();
> + struct rq *rq = cpu_rq(cpu);
>
> rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> + sd = rcu_dereference_check_sched_domain(rq->sd);
>
> if (!sd || sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 1;
>
> - for (; sd; sd = sd->parent)
> + for (; sd; sd = sd->parent) {
> atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> + if (sd == per_cpu(sd_busy, cpu))
> + break;
> + }
> unlock:
> rcu_read_unlock();
> }
> @@ -6756,7 +6765,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> * We may be recently in ticked or tickless idle mode. At the first
> * busy tick after returning from idle, we will update the busy stats.
> */
> - set_cpu_sd_state_busy();
> + set_cpu_sd_state_busy(cpu);
> nohz_balance_exit_idle(cpu);
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ffc7087..80c5fd2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -599,9 +599,8 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
> struct sched_domain *sd, *hsd = NULL;
>
> for_each_domain(cpu, sd) {
> - if (!(sd->flags & flag))
> - break;
> - hsd = sd;
> + if (sd->flags & flag)
> + hsd = sd;
> }
>
> return hsd;
>
next prev parent reply other threads:[~2013-10-23 4:03 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-21 11:44 [PATCH 0/3] sched: Fixes for task placement in SMT threads Vaidyanathan Srinivasan
2013-10-21 11:44 ` Vaidyanathan Srinivasan
2013-10-21 11:44 ` [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group Vaidyanathan Srinivasan
2013-10-21 11:44 ` Vaidyanathan Srinivasan
2013-10-22 14:35 ` Kamalesh Babulal
2013-10-22 14:35 ` Kamalesh Babulal
2013-10-22 16:40 ` Preeti U Murthy
2013-10-22 16:40 ` Preeti U Murthy
2013-10-22 22:11 ` Peter Zijlstra
2013-10-22 22:11 ` Peter Zijlstra
2013-10-23 4:00 ` Preeti U Murthy [this message]
2013-10-23 4:00 ` Preeti U Murthy
2013-10-23 4:21 ` Preeti U Murthy
2013-10-23 9:50 ` Preeti U Murthy
2013-10-23 9:50 ` Preeti U Murthy
2013-10-23 15:28 ` Vincent Guittot
2013-10-23 15:28 ` Vincent Guittot
2013-10-24 8:07 ` Preeti U Murthy
2013-10-24 8:07 ` Preeti U Murthy
2013-10-28 13:50 ` Peter Zijlstra
2013-10-28 13:50 ` Peter Zijlstra
2013-10-29 3:30 ` Preeti U Murthy
2013-10-29 3:30 ` Preeti U Murthy
2013-10-29 13:26 ` Peter Zijlstra
2013-10-29 13:26 ` Peter Zijlstra
2013-10-21 11:44 ` [PATCH 2/3] sched: Fix asymmetric scheduling for POWER7 Vaidyanathan Srinivasan
2013-10-21 11:44 ` Vaidyanathan Srinivasan
2013-10-21 22:55 ` Michael Neuling
2013-10-21 22:55 ` Michael Neuling
2013-10-22 22:18 ` Peter Zijlstra
2013-10-22 22:18 ` Peter Zijlstra
2013-10-21 11:45 ` [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources Vaidyanathan Srinivasan
2013-10-21 11:45 ` Vaidyanathan Srinivasan
2013-10-22 22:23 ` Peter Zijlstra
2013-10-22 22:23 ` Peter Zijlstra
2013-10-24 4:04 ` Preeti U Murthy
2013-10-24 4:04 ` Preeti U Murthy
2013-10-25 13:19 ` Preeti U Murthy
2013-10-25 13:19 ` Preeti U Murthy
2013-10-28 15:53 ` Peter Zijlstra
2013-10-28 15:53 ` Peter Zijlstra
2013-10-29 5:35 ` Preeti U Murthy
2013-10-29 5:35 ` Preeti U Murthy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=526749EC.9030005@linux.vnet.ibm.com \
--to=preeti@linux.vnet.ibm.com \
--cc=anton@samba.org \
--cc=bitbucket@online.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mikey@neuling.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.