[CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
@ 2025-02-11 14:01 Phil Auld
  2025-02-11 14:14 ` [PATCH] " Phil Auld
  2025-02-13  4:44 ` [CHANGE 1/2] " Madadi Vineeth Reddy
  0 siblings, 2 replies; 13+ messages in thread
From: Phil Auld @ 2025-02-11 14:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Juri Lelli, Frederic Weisbecker, Waiman Long

The exising code uses housekeeping_any_cpu() to select a cpu for
a given housekeeping task. However, this often ends up calling
cpumask_any_and() which is defined as cpumask_first_and() which has
the effect of alyways using the first cpu among those available.

The same applies when multiple NUMA nodes are involved. In that
case the first cpu in the local node is chosen which does provide
a bit of spreading but with multiple HK cpus per node the same
issues arise.

Spread the HK work out by having housekeeping_any_cpu() and
sched_numa_find_closest() use cpumask_any_and_distribute()
instead of cpumask_any_and().

Signed-off-by: Phil Auld <pauld@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/isolation.c | 2 +-
 kernel/sched/topology.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 81bc8b329ef1..93b038d48900 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
 			if (cpu < nr_cpu_ids)
 				return cpu;
 
-			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
 			if (likely(cpu < nr_cpu_ids))
 				return cpu;
 			/*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index c49aea8c1025..94133f843485 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	for (i = 0; i < sched_domains_numa_levels; i++) {
 		if (!masks[i][j])
 			break;
-		cpu = cpumask_any_and(cpus, masks[i][j]);
+		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
 		if (cpu < nr_cpu_ids) {
 			found = cpu;
 			break;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-11 14:01 [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu Phil Auld
@ 2025-02-11 14:14 ` Phil Auld
  2025-02-11 14:24   ` Waiman Long
  2025-02-13  4:44 ` [CHANGE 1/2] " Madadi Vineeth Reddy
  1 sibling, 1 reply; 13+ messages in thread
From: Phil Auld @ 2025-02-11 14:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Juri Lelli, Frederic Weisbecker, Waiman Long


Oops, sorry... meant "[PATCH]"

On Tue, Feb 11, 2025 at 02:01:04PM +0000 Phil Auld wrote:
> The exising code uses housekeeping_any_cpu() to select a cpu for
> a given housekeeping task. However, this often ends up calling
> cpumask_any_and() which is defined as cpumask_first_and() which has
> the effect of alyways using the first cpu among those available.
> 
> The same applies when multiple NUMA nodes are involved. In that
> case the first cpu in the local node is chosen which does provide
> a bit of spreading but with multiple HK cpus per node the same
> issues arise.
> 
> Spread the HK work out by having housekeeping_any_cpu() and
> sched_numa_find_closest() use cpumask_any_and_distribute()
> instead of cpumask_any_and().
> 
> Signed-off-by: Phil Auld <pauld@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: linux-kernel@vger.kernel.org
> ---
>  kernel/sched/isolation.c | 2 +-
>  kernel/sched/topology.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 81bc8b329ef1..93b038d48900 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
>  			if (cpu < nr_cpu_ids)
>  				return cpu;
>  
> -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
>  			if (likely(cpu < nr_cpu_ids))
>  				return cpu;
>  			/*
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index c49aea8c1025..94133f843485 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>  	for (i = 0; i < sched_domains_numa_levels; i++) {
>  		if (!masks[i][j])
>  			break;
> -		cpu = cpumask_any_and(cpus, masks[i][j]);
> +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
>  		if (cpu < nr_cpu_ids) {
>  			found = cpu;
>  			break;
> -- 
> 2.47.1
> 
> 

-- 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-11 14:14 ` [PATCH] " Phil Auld
@ 2025-02-11 14:24   ` Waiman Long
  0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2025-02-11 14:24 UTC (permalink / raw)
  To: Phil Auld, linux-kernel; +Cc: Peter Zijlstra, Juri Lelli, Frederic Weisbecker

On 2/11/25 9:14 AM, Phil Auld wrote:
> Oops, sorry... meant "[PATCH]"
>
> On Tue, Feb 11, 2025 at 02:01:04PM +0000 Phil Auld wrote:
>> The exising code uses housekeeping_any_cpu() to select a cpu for
>> a given housekeeping task. However, this often ends up calling
>> cpumask_any_and() which is defined as cpumask_first_and() which has
>> the effect of alyways using the first cpu among those available.
>>
>> The same applies when multiple NUMA nodes are involved. In that
>> case the first cpu in the local node is chosen which does provide
>> a bit of spreading but with multiple HK cpus per node the same
>> issues arise.
>>
>> Spread the HK work out by having housekeeping_any_cpu() and
>> sched_numa_find_closest() use cpumask_any_and_distribute()
>> instead of cpumask_any_and().
>>
>> Signed-off-by: Phil Auld <pauld@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Frederic Weisbecker <frederic@kernel.org>
>> Cc: Waiman Long <longman@redhat.com>
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>   kernel/sched/isolation.c | 2 +-
>>   kernel/sched/topology.c  | 2 +-
>>   2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 81bc8b329ef1..93b038d48900 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
>>   			if (cpu < nr_cpu_ids)
>>   				return cpu;
>>   
>> -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
>> +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
>>   			if (likely(cpu < nr_cpu_ids))
>>   				return cpu;
>>   			/*
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index c49aea8c1025..94133f843485 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>>   	for (i = 0; i < sched_domains_numa_levels; i++) {
>>   		if (!masks[i][j])
>>   			break;
>> -		cpu = cpumask_any_and(cpus, masks[i][j]);
>> +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
>>   		if (cpu < nr_cpu_ids) {
>>   			found = cpu;
>>   			break;
>> -- 
>> 2.47.1
>>
>>
LGTM

Reviewed-by: Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-11 14:01 [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu Phil Auld
  2025-02-11 14:14 ` [PATCH] " Phil Auld
@ 2025-02-13  4:44 ` Madadi Vineeth Reddy
  2025-02-13 14:26   ` Phil Auld
  1 sibling, 1 reply; 13+ messages in thread
From: Madadi Vineeth Reddy @ 2025-02-13  4:44 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, Juri Lelli, Frederic Weisbecker, Waiman Long,
	linux-kernel, Madadi Vineeth Reddy

Hi Phil Auld,

On 11/02/25 19:31, Phil Auld wrote:
> The exising code uses housekeeping_any_cpu() to select a cpu for
> a given housekeeping task. However, this often ends up calling
> cpumask_any_and() which is defined as cpumask_first_and() which has
> the effect of alyways using the first cpu among those available.
> 
> The same applies when multiple NUMA nodes are involved. In that
> case the first cpu in the local node is chosen which does provide
> a bit of spreading but with multiple HK cpus per node the same
> issues arise.
> 
> Spread the HK work out by having housekeeping_any_cpu() and
> sched_numa_find_closest() use cpumask_any_and_distribute()
> instead of cpumask_any_and().
> 

Got the overall intent of the patch for better load distribution on
housekeeping tasks. However, one potential drawback could be that by
spreading HK work across multiple CPUs might reduce the time that
some cores can spend in deeper idle states which can be beneficial for
power-sensitive systems.

Thoughts?

Thanks,
Madadi Vineeth Reddy

> Signed-off-by: Phil Auld <pauld@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: linux-kernel@vger.kernel.org
> ---
>  kernel/sched/isolation.c | 2 +-
>  kernel/sched/topology.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 81bc8b329ef1..93b038d48900 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
>  			if (cpu < nr_cpu_ids)
>  				return cpu;
>  
> -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
>  			if (likely(cpu < nr_cpu_ids))
>  				return cpu;
>  			/*
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index c49aea8c1025..94133f843485 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>  	for (i = 0; i < sched_domains_numa_levels; i++) {
>  		if (!masks[i][j])
>  			break;
> -		cpu = cpumask_any_and(cpus, masks[i][j]);
> +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
>  		if (cpu < nr_cpu_ids) {
>  			found = cpu;
>  			break;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-13  4:44 ` [CHANGE 1/2] " Madadi Vineeth Reddy
@ 2025-02-13 14:26   ` Phil Auld
  2025-02-14  5:38     ` Vishal Chourasia
  0 siblings, 1 reply; 13+ messages in thread
From: Phil Auld @ 2025-02-13 14:26 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Juri Lelli, Frederic Weisbecker, Waiman Long,
	linux-kernel

On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
> Hi Phil Auld,
> 
> On 11/02/25 19:31, Phil Auld wrote:
> > The exising code uses housekeeping_any_cpu() to select a cpu for
> > a given housekeeping task. However, this often ends up calling
> > cpumask_any_and() which is defined as cpumask_first_and() which has
> > the effect of alyways using the first cpu among those available.
> > 
> > The same applies when multiple NUMA nodes are involved. In that
> > case the first cpu in the local node is chosen which does provide
> > a bit of spreading but with multiple HK cpus per node the same
> > issues arise.
> > 
> > Spread the HK work out by having housekeeping_any_cpu() and
> > sched_numa_find_closest() use cpumask_any_and_distribute()
> > instead of cpumask_any_and().
> > 
> 
> Got the overall intent of the patch for better load distribution on
> housekeeping tasks. However, one potential drawback could be that by
> spreading HK work across multiple CPUs might reduce the time that
> some cores can spend in deeper idle states which can be beneficial for
> power-sensitive systems.
> 
> Thoughts?

NOHZ_full setups are not generally used in power sensitive systems I think.
They aren't in our use cases at least. 

In cases with many cpus a single housekeeping cpu can not keep up. Having
other HK cpus in deep idle states while the one in use is overloaded is
not a win. 

If your single HK cpu can keep up then only configure that one HK cpu.
The others will go idle and stay there.  And since they are nohz_full
might get to stay idle even longer.

I do have a patch that has this controlled by a sched feature if that
is of interest. Then it could be disabled if you don't want it. 

Cheers,
Phil

> 
> Thanks,
> Madadi Vineeth Reddy
> 
> > Signed-off-by: Phil Auld <pauld@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  kernel/sched/isolation.c | 2 +-
> >  kernel/sched/topology.c  | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > index 81bc8b329ef1..93b038d48900 100644
> > --- a/kernel/sched/isolation.c
> > +++ b/kernel/sched/isolation.c
> > @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
> >  			if (cpu < nr_cpu_ids)
> >  				return cpu;
> >  
> > -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
> >  			if (likely(cpu < nr_cpu_ids))
> >  				return cpu;
> >  			/*
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index c49aea8c1025..94133f843485 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> >  	for (i = 0; i < sched_domains_numa_levels; i++) {
> >  		if (!masks[i][j])
> >  			break;
> > -		cpu = cpumask_any_and(cpus, masks[i][j]);
> > +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
> >  		if (cpu < nr_cpu_ids) {
> >  			found = cpu;
> >  			break;
> 

-- 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-13 14:26   ` Phil Auld
@ 2025-02-14  5:38     ` Vishal Chourasia
  2025-02-18 15:00       ` Phil Auld
  0 siblings, 1 reply; 13+ messages in thread
From: Vishal Chourasia @ 2025-02-14  5:38 UTC (permalink / raw)
  To: Phil Auld
  Cc: Madadi Vineeth Reddy, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Waiman Long, linux-kernel

Hi Phil, Vineeth

On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
> On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
> > Hi Phil Auld,
> > 
> > On 11/02/25 19:31, Phil Auld wrote:
> > > The exising code uses housekeeping_any_cpu() to select a cpu for
> > > a given housekeeping task. However, this often ends up calling
> > > cpumask_any_and() which is defined as cpumask_first_and() which has
> > > the effect of alyways using the first cpu among those available.
> > > 
> > > The same applies when multiple NUMA nodes are involved. In that
> > > case the first cpu in the local node is chosen which does provide
> > > a bit of spreading but with multiple HK cpus per node the same
> > > issues arise.
> > > 
> > > Spread the HK work out by having housekeeping_any_cpu() and
> > > sched_numa_find_closest() use cpumask_any_and_distribute()
> > > instead of cpumask_any_and().
> > > 
> > 
> > Got the overall intent of the patch for better load distribution on
> > housekeeping tasks. However, one potential drawback could be that by
> > spreading HK work across multiple CPUs might reduce the time that
> > some cores can spend in deeper idle states which can be beneficial for
> > power-sensitive systems.
> > 
> > Thoughts?
> 
> NOHZ_full setups are not generally used in power sensitive systems I think.
> They aren't in our use cases at least. 
> 
> In cases with many cpus a single housekeeping cpu can not keep up. Having
> other HK cpus in deep idle states while the one in use is overloaded is
> not a win. 

To me, an overloaded CPU sounds like where more than one tasks are ready
to run, and a HK CPU is one receiving periodic scheduling clock
ticks, so HP CPU is bound to comes out of any power-saving state it is in.
> 
> If your single HK cpu can keep up then only configure that one HK cpu.
> The others will go idle and stay there.  And since they are nohz_full
> might get to stay idle even longer.
While it is good to distribute the load across each HK CPU in the HK 
cpumask (queuing jobs on different CPUs each time), this can cause
jitter in virtualized environments. Unnecessaryily evicting other
tenants, when it's better to overload a VP than to wake up other VPs of a
tenant.

> 
> I do have a patch that has this controlled by a sched feature if that
> is of interest. Then it could be disabled if you don't want it.

Vishal
> 
> Cheers,
> Phil
> 
> > 
> > Thanks,
> > Madadi Vineeth Reddy
> > 
> > > Signed-off-by: Phil Auld <pauld@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Juri Lelli <juri.lelli@redhat.com>
> > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > Cc: Waiman Long <longman@redhat.com>
> > > Cc: linux-kernel@vger.kernel.org
> > > ---
> > >  kernel/sched/isolation.c | 2 +-
> > >  kernel/sched/topology.c  | 2 +-
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > > index 81bc8b329ef1..93b038d48900 100644
> > > --- a/kernel/sched/isolation.c
> > > +++ b/kernel/sched/isolation.c
> > > @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
> > >  			if (cpu < nr_cpu_ids)
> > >  				return cpu;
> > >  
> > > -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > > +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
> > >  			if (likely(cpu < nr_cpu_ids))
> > >  				return cpu;
> > >  			/*
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index c49aea8c1025..94133f843485 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> > >  	for (i = 0; i < sched_domains_numa_levels; i++) {
> > >  		if (!masks[i][j])
> > >  			break;
> > > -		cpu = cpumask_any_and(cpus, masks[i][j]);
> > > +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
> > >  		if (cpu < nr_cpu_ids) {
> > >  			found = cpu;
> > >  			break;
> > 
> 
> -- 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-14  5:38     ` Vishal Chourasia
@ 2025-02-18 15:00       ` Phil Auld
  2025-02-18 15:23         ` Waiman Long
  2025-02-20  9:20         ` Vishal Chourasia
  0 siblings, 2 replies; 13+ messages in thread
From: Phil Auld @ 2025-02-18 15:00 UTC (permalink / raw)
  To: Vishal Chourasia
  Cc: Madadi Vineeth Reddy, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Waiman Long, linux-kernel

Hi Vishal.

On Fri, Feb 14, 2025 at 11:08:19AM +0530 Vishal Chourasia wrote:
> Hi Phil, Vineeth
> 
> On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
> > On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
> > > Hi Phil Auld,
> > > 
> > > On 11/02/25 19:31, Phil Auld wrote:
> > > > The exising code uses housekeeping_any_cpu() to select a cpu for
> > > > a given housekeeping task. However, this often ends up calling
> > > > cpumask_any_and() which is defined as cpumask_first_and() which has
> > > > the effect of alyways using the first cpu among those available.
> > > > 
> > > > The same applies when multiple NUMA nodes are involved. In that
> > > > case the first cpu in the local node is chosen which does provide
> > > > a bit of spreading but with multiple HK cpus per node the same
> > > > issues arise.
> > > > 
> > > > Spread the HK work out by having housekeeping_any_cpu() and
> > > > sched_numa_find_closest() use cpumask_any_and_distribute()
> > > > instead of cpumask_any_and().
> > > > 
> > > 
> > > Got the overall intent of the patch for better load distribution on
> > > housekeeping tasks. However, one potential drawback could be that by
> > > spreading HK work across multiple CPUs might reduce the time that
> > > some cores can spend in deeper idle states which can be beneficial for
> > > power-sensitive systems.
> > > 
> > > Thoughts?
> > 
> > NOHZ_full setups are not generally used in power sensitive systems I think.
> > They aren't in our use cases at least. 
> > 
> > In cases with many cpus a single housekeeping cpu can not keep up. Having
> > other HK cpus in deep idle states while the one in use is overloaded is
> > not a win. 
> 
> To me, an overloaded CPU sounds like where more than one tasks are ready
> to run, and a HK CPU is one receiving periodic scheduling clock
> ticks, so HP CPU is bound to comes out of any power-saving state it is in.

If the overload is caused by HK and interrupts there is nothing in the
system to help. Tasks, sure, can get load balanced.

And as you say, the HK cpus will have generally ticks happening anyway.

> > 
> > If your single HK cpu can keep up then only configure that one HK cpu.
> > The others will go idle and stay there.  And since they are nohz_full
> > might get to stay idle even longer.
> While it is good to distribute the load across each HK CPU in the HK 
> cpumask (queuing jobs on different CPUs each time), this can cause
> jitter in virtualized environments. Unnecessaryily evicting other
> tenants, when it's better to overload a VP than to wake up other VPs of a
> tenant.
> 

Sorry I'm not sure I understand your setup. Are your running virtual
tenants on the HK cpus?  nohz_full in the guests? Maybe you only need
on HK then it won't matter.

My concern is that currently there is no point in having more than
one HK cpu (per node in a NUMA case). The code as currently implemented
is just not doing what it needs to.

We have numerous cases where a single HK cpu just cannot keep up and
the remote_tick warning fires. It also can lead to the other things
(orchastration sw, HA keepalives etc) on the HK cpus getting starved
which leads to other issues.  In these cases we recommend increasing
the number of HK cpus.  But... that only helps the userspace tasks
somewhat. It does not help the actual housekeeping part.

It seems clear to me that the intent of the cpumask_any_and() calls
is to pick _any_ cpu in the hk mask. Not just the first, otherwise
it would just use cpumask_first_and().

I'm open to alternate suggestions of how to fix this.


Cheers,
Phil

> > 
> > I do have a patch that has this controlled by a sched feature if that
> > is of interest. Then it could be disabled if you don't want it.
> 
> Vishal
> > 
> > Cheers,
> > Phil
> > 
> > > 
> > > Thanks,
> > > Madadi Vineeth Reddy
> > > 
> > > > Signed-off-by: Phil Auld <pauld@redhat.com>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Juri Lelli <juri.lelli@redhat.com>
> > > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > > Cc: Waiman Long <longman@redhat.com>
> > > > Cc: linux-kernel@vger.kernel.org
> > > > ---
> > > >  kernel/sched/isolation.c | 2 +-
> > > >  kernel/sched/topology.c  | 2 +-
> > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > > > index 81bc8b329ef1..93b038d48900 100644
> > > > --- a/kernel/sched/isolation.c
> > > > +++ b/kernel/sched/isolation.c
> > > > @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
> > > >  			if (cpu < nr_cpu_ids)
> > > >  				return cpu;
> > > >  
> > > > -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > > > +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
> > > >  			if (likely(cpu < nr_cpu_ids))
> > > >  				return cpu;
> > > >  			/*
> > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > index c49aea8c1025..94133f843485 100644
> > > > --- a/kernel/sched/topology.c
> > > > +++ b/kernel/sched/topology.c
> > > > @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> > > >  	for (i = 0; i < sched_domains_numa_levels; i++) {
> > > >  		if (!masks[i][j])
> > > >  			break;
> > > > -		cpu = cpumask_any_and(cpus, masks[i][j]);
> > > > +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
> > > >  		if (cpu < nr_cpu_ids) {
> > > >  			found = cpu;
> > > >  			break;
> > > 
> > 
> > -- 
> > 
> 

-- 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-18 15:00       ` Phil Auld
@ 2025-02-18 15:23         ` Waiman Long
  2025-02-18 15:30           ` Phil Auld
  2025-02-20  9:20         ` Vishal Chourasia
  1 sibling, 1 reply; 13+ messages in thread
From: Waiman Long @ 2025-02-18 15:23 UTC (permalink / raw)
  To: Phil Auld, Vishal Chourasia
  Cc: Madadi Vineeth Reddy, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, linux-kernel


On 2/18/25 10:00 AM, Phil Auld wrote:
> Hi Vishal.
>
> On Fri, Feb 14, 2025 at 11:08:19AM +0530 Vishal Chourasia wrote:
>> Hi Phil, Vineeth
>>
>> On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
>>> On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
>>>> Hi Phil Auld,
>>>>
>>>> On 11/02/25 19:31, Phil Auld wrote:
>>>>> The exising code uses housekeeping_any_cpu() to select a cpu for
>>>>> a given housekeeping task. However, this often ends up calling
>>>>> cpumask_any_and() which is defined as cpumask_first_and() which has
>>>>> the effect of alyways using the first cpu among those available.
>>>>>
>>>>> The same applies when multiple NUMA nodes are involved. In that
>>>>> case the first cpu in the local node is chosen which does provide
>>>>> a bit of spreading but with multiple HK cpus per node the same
>>>>> issues arise.
>>>>>
>>>>> Spread the HK work out by having housekeeping_any_cpu() and
>>>>> sched_numa_find_closest() use cpumask_any_and_distribute()
>>>>> instead of cpumask_any_and().
>>>>>
>>>> Got the overall intent of the patch for better load distribution on
>>>> housekeeping tasks. However, one potential drawback could be that by
>>>> spreading HK work across multiple CPUs might reduce the time that
>>>> some cores can spend in deeper idle states which can be beneficial for
>>>> power-sensitive systems.
>>>>
>>>> Thoughts?
>>> NOHZ_full setups are not generally used in power sensitive systems I think.
>>> They aren't in our use cases at least.
>>>
>>> In cases with many cpus a single housekeeping cpu can not keep up. Having
>>> other HK cpus in deep idle states while the one in use is overloaded is
>>> not a win.
>> To me, an overloaded CPU sounds like where more than one tasks are ready
>> to run, and a HK CPU is one receiving periodic scheduling clock
>> ticks, so HP CPU is bound to comes out of any power-saving state it is in.
> If the overload is caused by HK and interrupts there is nothing in the
> system to help. Tasks, sure, can get load balanced.
>
> And as you say, the HK cpus will have generally ticks happening anyway.
>
>>> If your single HK cpu can keep up then only configure that one HK cpu.
>>> The others will go idle and stay there.  And since they are nohz_full
>>> might get to stay idle even longer.
>> While it is good to distribute the load across each HK CPU in the HK
>> cpumask (queuing jobs on different CPUs each time), this can cause
>> jitter in virtualized environments. Unnecessaryily evicting other
>> tenants, when it's better to overload a VP than to wake up other VPs of a
>> tenant.
>>
> Sorry I'm not sure I understand your setup. Are your running virtual
> tenants on the HK cpus?  nohz_full in the guests? Maybe you only need
> on HK then it won't matter.
>
> My concern is that currently there is no point in having more than
> one HK cpu (per node in a NUMA case). The code as currently implemented
> is just not doing what it needs to.
>
> We have numerous cases where a single HK cpu just cannot keep up and
> the remote_tick warning fires. It also can lead to the other things
> (orchastration sw, HA keepalives etc) on the HK cpus getting starved
> which leads to other issues.  In these cases we recommend increasing
> the number of HK cpus.  But... that only helps the userspace tasks
> somewhat. It does not help the actual housekeeping part.

That is the part that should go into the commit log as well as it is the 
rationale behind your patch.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-18 15:23         ` Waiman Long
@ 2025-02-18 15:30           ` Phil Auld
  2025-02-18 15:33             ` Waiman Long
  0 siblings, 1 reply; 13+ messages in thread
From: Phil Auld @ 2025-02-18 15:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Vishal Chourasia, Madadi Vineeth Reddy, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, linux-kernel

On Tue, Feb 18, 2025 at 10:23:50AM -0500 Waiman Long wrote:
> 
> On 2/18/25 10:00 AM, Phil Auld wrote:
> > Hi Vishal.
> > 
> > On Fri, Feb 14, 2025 at 11:08:19AM +0530 Vishal Chourasia wrote:
> > > Hi Phil, Vineeth
> > > 
> > > On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
> > > > On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
> > > > > Hi Phil Auld,
> > > > > 
> > > > > On 11/02/25 19:31, Phil Auld wrote:
> > > > > > The exising code uses housekeeping_any_cpu() to select a cpu for
> > > > > > a given housekeeping task. However, this often ends up calling
> > > > > > cpumask_any_and() which is defined as cpumask_first_and() which has
> > > > > > the effect of alyways using the first cpu among those available.
> > > > > > 
> > > > > > The same applies when multiple NUMA nodes are involved. In that
> > > > > > case the first cpu in the local node is chosen which does provide
> > > > > > a bit of spreading but with multiple HK cpus per node the same
> > > > > > issues arise.
> > > > > > 
> > > > > > Spread the HK work out by having housekeeping_any_cpu() and
> > > > > > sched_numa_find_closest() use cpumask_any_and_distribute()
> > > > > > instead of cpumask_any_and().
> > > > > > 
> > > > > Got the overall intent of the patch for better load distribution on
> > > > > housekeeping tasks. However, one potential drawback could be that by
> > > > > spreading HK work across multiple CPUs might reduce the time that
> > > > > some cores can spend in deeper idle states which can be beneficial for
> > > > > power-sensitive systems.
> > > > > 
> > > > > Thoughts?
> > > > NOHZ_full setups are not generally used in power sensitive systems I think.
> > > > They aren't in our use cases at least.
> > > > 
> > > > In cases with many cpus a single housekeeping cpu can not keep up. Having
> > > > other HK cpus in deep idle states while the one in use is overloaded is
> > > > not a win.
> > > To me, an overloaded CPU sounds like where more than one tasks are ready
> > > to run, and a HK CPU is one receiving periodic scheduling clock
> > > ticks, so HP CPU is bound to comes out of any power-saving state it is in.
> > If the overload is caused by HK and interrupts there is nothing in the
> > system to help. Tasks, sure, can get load balanced.
> > 
> > And as you say, the HK cpus will have generally ticks happening anyway.
> > 
> > > > If your single HK cpu can keep up then only configure that one HK cpu.
> > > > The others will go idle and stay there.  And since they are nohz_full
> > > > might get to stay idle even longer.
> > > While it is good to distribute the load across each HK CPU in the HK
> > > cpumask (queuing jobs on different CPUs each time), this can cause
> > > jitter in virtualized environments. Unnecessaryily evicting other
> > > tenants, when it's better to overload a VP than to wake up other VPs of a
> > > tenant.
> > > 
> > Sorry I'm not sure I understand your setup. Are your running virtual
> > tenants on the HK cpus?  nohz_full in the guests? Maybe you only need
> > on HK then it won't matter.
> > 
> > My concern is that currently there is no point in having more than
> > one HK cpu (per node in a NUMA case). The code as currently implemented
> > is just not doing what it needs to.
> > 
> > We have numerous cases where a single HK cpu just cannot keep up and
> > the remote_tick warning fires. It also can lead to the other things
> > (orchastration sw, HA keepalives etc) on the HK cpus getting starved
> > which leads to other issues.  In these cases we recommend increasing
> > the number of HK cpus.  But... that only helps the userspace tasks
> > somewhat. It does not help the actual housekeeping part.
> 
> That is the part that should go into the commit log as well as it is the
> rationale behind your patch.
>

Sure, I can add that piece and resend.


Cheers,
Phil


> Cheers,
> Longman
> 

-- 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-18 15:30           ` Phil Auld
@ 2025-02-18 15:33             ` Waiman Long
  0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2025-02-18 15:33 UTC (permalink / raw)
  To: Phil Auld, Waiman Long
  Cc: Vishal Chourasia, Madadi Vineeth Reddy, Peter Zijlstra,
	Juri Lelli, Frederic Weisbecker, linux-kernel

On 2/18/25 10:30 AM, Phil Auld wrote:
> On Tue, Feb 18, 2025 at 10:23:50AM -0500 Waiman Long wrote:
>> On 2/18/25 10:00 AM, Phil Auld wrote:
>>> Hi Vishal.
>>>
>>> On Fri, Feb 14, 2025 at 11:08:19AM +0530 Vishal Chourasia wrote:
>>>> Hi Phil, Vineeth
>>>>
>>>> On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
>>>>> On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
>>>>>> Hi Phil Auld,
>>>>>>
>>>>>> On 11/02/25 19:31, Phil Auld wrote:
>>>>>>> The exising code uses housekeeping_any_cpu() to select a cpu for
>>>>>>> a given housekeeping task. However, this often ends up calling
>>>>>>> cpumask_any_and() which is defined as cpumask_first_and() which has
>>>>>>> the effect of alyways using the first cpu among those available.
>>>>>>>
>>>>>>> The same applies when multiple NUMA nodes are involved. In that
>>>>>>> case the first cpu in the local node is chosen which does provide
>>>>>>> a bit of spreading but with multiple HK cpus per node the same
>>>>>>> issues arise.
>>>>>>>
>>>>>>> Spread the HK work out by having housekeeping_any_cpu() and
>>>>>>> sched_numa_find_closest() use cpumask_any_and_distribute()
>>>>>>> instead of cpumask_any_and().
>>>>>>>
>>>>>> Got the overall intent of the patch for better load distribution on
>>>>>> housekeeping tasks. However, one potential drawback could be that by
>>>>>> spreading HK work across multiple CPUs might reduce the time that
>>>>>> some cores can spend in deeper idle states which can be beneficial for
>>>>>> power-sensitive systems.
>>>>>>
>>>>>> Thoughts?
>>>>> NOHZ_full setups are not generally used in power sensitive systems I think.
>>>>> They aren't in our use cases at least.
>>>>>
>>>>> In cases with many cpus a single housekeeping cpu can not keep up. Having
>>>>> other HK cpus in deep idle states while the one in use is overloaded is
>>>>> not a win.
>>>> To me, an overloaded CPU sounds like where more than one tasks are ready
>>>> to run, and a HK CPU is one receiving periodic scheduling clock
>>>> ticks, so HP CPU is bound to comes out of any power-saving state it is in.
>>> If the overload is caused by HK and interrupts there is nothing in the
>>> system to help. Tasks, sure, can get load balanced.
>>>
>>> And as you say, the HK cpus will have generally ticks happening anyway.
>>>
>>>>> If your single HK cpu can keep up then only configure that one HK cpu.
>>>>> The others will go idle and stay there.  And since they are nohz_full
>>>>> might get to stay idle even longer.
>>>> While it is good to distribute the load across each HK CPU in the HK
>>>> cpumask (queuing jobs on different CPUs each time), this can cause
>>>> jitter in virtualized environments. Unnecessaryily evicting other
>>>> tenants, when it's better to overload a VP than to wake up other VPs of a
>>>> tenant.
>>>>
>>> Sorry I'm not sure I understand your setup. Are your running virtual
>>> tenants on the HK cpus?  nohz_full in the guests? Maybe you only need
>>> on HK then it won't matter.
>>>
>>> My concern is that currently there is no point in having more than
>>> one HK cpu (per node in a NUMA case). The code as currently implemented
>>> is just not doing what it needs to.
>>>
>>> We have numerous cases where a single HK cpu just cannot keep up and
>>> the remote_tick warning fires. It also can lead to the other things
>>> (orchastration sw, HA keepalives etc) on the HK cpus getting starved
>>> which leads to other issues.  In these cases we recommend increasing
>>> the number of HK cpus.  But... that only helps the userspace tasks
>>> somewhat. It does not help the actual housekeeping part.
>> That is the part that should go into the commit log as well as it is the
>> rationale behind your patch.
>>
> Sure, I can add that piece and resend.

While at it, you can also add some text to address the other concerns 
that reviewers have so far.

Cheers,
Longman

>
>
> Cheers,
> Phil
>
>
>> Cheers,
>> Longman
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-18 15:00       ` Phil Auld
  2025-02-18 15:23         ` Waiman Long
@ 2025-02-20  9:20         ` Vishal Chourasia
  2025-02-20 15:52           ` Phil Auld
  1 sibling, 1 reply; 13+ messages in thread
From: Vishal Chourasia @ 2025-02-20  9:20 UTC (permalink / raw)
  To: Phil Auld
  Cc: Madadi Vineeth Reddy, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Waiman Long, linux-kernel

On Tue, Feb 18, 2025 at 10:00:59AM -0500, Phil Auld wrote:
> Hi Vishal.
> 
> On Fri, Feb 14, 2025 at 11:08:19AM +0530 Vishal Chourasia wrote:
> > Hi Phil, Vineeth
> > 
> > On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
> > > On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
> > > > Hi Phil Auld,
> > > > 
> > > > On 11/02/25 19:31, Phil Auld wrote:
> > > > > The exising code uses housekeeping_any_cpu() to select a cpu for
> > > > > a given housekeeping task. However, this often ends up calling
> > > > > cpumask_any_and() which is defined as cpumask_first_and() which has
> > > > > the effect of alyways using the first cpu among those available.
> > > > > 
> > > > > The same applies when multiple NUMA nodes are involved. In that
> > > > > case the first cpu in the local node is chosen which does provide
> > > > > a bit of spreading but with multiple HK cpus per node the same
> > > > > issues arise.
> > > > > 
> > > > > Spread the HK work out by having housekeeping_any_cpu() and
> > > > > sched_numa_find_closest() use cpumask_any_and_distribute()
> > > > > instead of cpumask_any_and().
> > > > > 
> > > > 
> > > > Got the overall intent of the patch for better load distribution on
> > > > housekeeping tasks. However, one potential drawback could be that by
> > > > spreading HK work across multiple CPUs might reduce the time that
> > > > some cores can spend in deeper idle states which can be beneficial for
> > > > power-sensitive systems.
> > > > 
> > > > Thoughts?
> > > 
> > > NOHZ_full setups are not generally used in power sensitive systems I think.
> > > They aren't in our use cases at least. 
> > > 
> > > In cases with many cpus a single housekeeping cpu can not keep up. Having
> > > other HK cpus in deep idle states while the one in use is overloaded is
> > > not a win. 
> > 
> > To me, an overloaded CPU sounds like where more than one tasks are ready
> > to run, and a HK CPU is one receiving periodic scheduling clock
> > ticks, so HP CPU is bound to comes out of any power-saving state it is in.
> 
> If the overload is caused by HK and interrupts there is nothing in the
> system to help. Tasks, sure, can get load balanced.
> 
> And as you say, the HK cpus will have generally ticks happening anyway.
> 
> > > 
> > > If your single HK cpu can keep up then only configure that one HK cpu.
> > > The others will go idle and stay there.  And since they are nohz_full
> > > might get to stay idle even longer.
> > While it is good to distribute the load across each HK CPU in the HK 
> > cpumask (queuing jobs on different CPUs each time), this can cause
> > jitter in virtualized environments. Unnecessaryily evicting other
> > tenants, when it's better to overload a VP than to wake up other VPs of a
> > tenant.
> > 
> 
> Sorry I'm not sure I understand your setup. Are your running virtual
> tenants on the HK cpus?  nohz_full in the guests? Maybe you only need
> on HK then it won't matter.
> 
Firstly, I am unaware if nohz_full is being used in virtualized environment.
Please correct me it is. I am not saying it can't or shouldn't be used, it's just
I don't know if anybody is using it.

nohz_full in guests would mean that tick is disabled inside the guest but
the host might still be getting ticks. So, I am unsure, whether it is a good
idea to have nohz_full in virtualized environment.

Nevertheless, the idea of nohz_full is to reduce to the kernel interference
for CPUs marked as nohz_full. And, it does help with guest interference.

I would like to mention, In SPLPAR environment, scheduling work on
different HK CPU each time can caused VM preemption in a multi-tenant
setup in cases where CPUs in HK cpumask spans across VPs, its better to
consolidate them within few VPs.

VP is virtual core/processor.

> My concern is that currently there is no point in having more than
> one HK cpu (per node in a NUMA case). The code as currently implemented
> is just not doing what it needs to.
> 
> We have numerous cases where a single HK cpu just cannot keep up and
> the remote_tick warning fires. It also can lead to the other things
> (orchastration sw, HA keepalives etc) on the HK cpus getting starved
> which leads to other issues.  In these cases we recommend increasing
> the number of HK cpus.  But... that only helps the userspace tasks
> somewhat. It does not help the actual housekeeping part.
> 
> It seems clear to me that the intent of the cpumask_any_and() calls
> is to pick _any_ cpu in the hk mask. Not just the first, otherwise
> it would just use cpumask_first_and().
> 
> I'm open to alternate suggestions of how to fix this.
Your approach looks good to me.

I wanted to mention the case of overcommitted multi-tenant setup
where this will cause noisy neighbour sort of situation, but this can be
avoided by carefully selecting HK CPUs.

Thanks,
vishalc
> 
> 
> Cheers,
> Phil
> 
> > > 
> > > I do have a patch that has this controlled by a sched feature if that
> > > is of interest. Then it could be disabled if you don't want it.
> > 
> > Vishal
> > > 
> > > Cheers,
> > > Phil
> > > 
> > > > 
> > > > Thanks,
> > > > Madadi Vineeth Reddy
> > > > 
> > > > > Signed-off-by: Phil Auld <pauld@redhat.com>
> > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > Cc: Juri Lelli <juri.lelli@redhat.com>
> > > > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > > > Cc: Waiman Long <longman@redhat.com>
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > ---
> > > > >  kernel/sched/isolation.c | 2 +-
> > > > >  kernel/sched/topology.c  | 2 +-
> > > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > > > > index 81bc8b329ef1..93b038d48900 100644
> > > > > --- a/kernel/sched/isolation.c
> > > > > +++ b/kernel/sched/isolation.c
> > > > > @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
> > > > >  			if (cpu < nr_cpu_ids)
> > > > >  				return cpu;
> > > > >  
> > > > > -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > > > > +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
> > > > >  			if (likely(cpu < nr_cpu_ids))
> > > > >  				return cpu;
> > > > >  			/*
> > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > > index c49aea8c1025..94133f843485 100644
> > > > > --- a/kernel/sched/topology.c
> > > > > +++ b/kernel/sched/topology.c
> > > > > @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> > > > >  	for (i = 0; i < sched_domains_numa_levels; i++) {
> > > > >  		if (!masks[i][j])
> > > > >  			break;
> > > > > -		cpu = cpumask_any_and(cpus, masks[i][j]);
> > > > > +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
> > > > >  		if (cpu < nr_cpu_ids) {
> > > > >  			found = cpu;
> > > > >  			break;
> > > > 
> > > 
> > > -- 
> > > 
> > 
> 
> -- 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-20  9:20         ` Vishal Chourasia
@ 2025-02-20 15:52           ` Phil Auld
  2025-02-21  4:42             ` Vishal Chourasia
  0 siblings, 1 reply; 13+ messages in thread
From: Phil Auld @ 2025-02-20 15:52 UTC (permalink / raw)
  To: Vishal Chourasia
  Cc: Madadi Vineeth Reddy, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Waiman Long, linux-kernel


Hi Vishal,

On Thu, Feb 20, 2025 at 02:50:40PM +0530 Vishal Chourasia wrote:
> On Tue, Feb 18, 2025 at 10:00:59AM -0500, Phil Auld wrote:
> > Hi Vishal.
> > 
> > On Fri, Feb 14, 2025 at 11:08:19AM +0530 Vishal Chourasia wrote:
> > > Hi Phil, Vineeth
> > > 
> > > On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
> > > > On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
> > > > > Hi Phil Auld,
> > > > > 
> > > > > On 11/02/25 19:31, Phil Auld wrote:
> > > > > > The exising code uses housekeeping_any_cpu() to select a cpu for
> > > > > > a given housekeeping task. However, this often ends up calling
> > > > > > cpumask_any_and() which is defined as cpumask_first_and() which has
> > > > > > the effect of alyways using the first cpu among those available.
> > > > > > 
> > > > > > The same applies when multiple NUMA nodes are involved. In that
> > > > > > case the first cpu in the local node is chosen which does provide
> > > > > > a bit of spreading but with multiple HK cpus per node the same
> > > > > > issues arise.
> > > > > > 
> > > > > > Spread the HK work out by having housekeeping_any_cpu() and
> > > > > > sched_numa_find_closest() use cpumask_any_and_distribute()
> > > > > > instead of cpumask_any_and().
> > > > > > 
> > > > > 
> > > > > Got the overall intent of the patch for better load distribution on
> > > > > housekeeping tasks. However, one potential drawback could be that by
> > > > > spreading HK work across multiple CPUs might reduce the time that
> > > > > some cores can spend in deeper idle states which can be beneficial for
> > > > > power-sensitive systems.
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > NOHZ_full setups are not generally used in power sensitive systems I think.
> > > > They aren't in our use cases at least. 
> > > > 
> > > > In cases with many cpus a single housekeeping cpu can not keep up. Having
> > > > other HK cpus in deep idle states while the one in use is overloaded is
> > > > not a win. 
> > > 
> > > To me, an overloaded CPU sounds like where more than one tasks are ready
> > > to run, and a HK CPU is one receiving periodic scheduling clock
> > > ticks, so HP CPU is bound to comes out of any power-saving state it is in.
> > 
> > If the overload is caused by HK and interrupts there is nothing in the
> > system to help. Tasks, sure, can get load balanced.
> > 
> > And as you say, the HK cpus will have generally ticks happening anyway.
> > 
> > > > 
> > > > If your single HK cpu can keep up then only configure that one HK cpu.
> > > > The others will go idle and stay there.  And since they are nohz_full
> > > > might get to stay idle even longer.
> > > While it is good to distribute the load across each HK CPU in the HK 
> > > cpumask (queuing jobs on different CPUs each time), this can cause
> > > jitter in virtualized environments. Unnecessaryily evicting other
> > > tenants, when it's better to overload a VP than to wake up other VPs of a
> > > tenant.
> > > 
> > 
> > Sorry I'm not sure I understand your setup. Are your running virtual
> > tenants on the HK cpus?  nohz_full in the guests? Maybe you only need
> > on HK then it won't matter.
> > 
> Firstly, I am unaware if nohz_full is being used in virtualized environment.
> Please correct me it is. I am not saying it can't or shouldn't be used, it's just
> I don't know if anybody is using it.
>

I've seen some people trying it in various ways and to varying degrees of
success if I recall correctly.


> nohz_full in guests would mean that tick is disabled inside the guest but
> the host might still be getting ticks. So, I am unsure, whether it is a good
> idea to have nohz_full in virtualized environment.
> 
> Nevertheless, the idea of nohz_full is to reduce to the kernel interference
> for CPUs marked as nohz_full. And, it does help with guest interference.
> 
> I would like to mention, In SPLPAR environment, scheduling work on
> different HK CPU each time can caused VM preemption in a multi-tenant
> setup in cases where CPUs in HK cpumask spans across VPs, its better to
> consolidate them within few VPs.
> 
> VP is virtual core/processor.

I have a hard time reconciling you saying you are not using virtualization
and then talking about VMs and VPs ;)

Sometimes I forget how interesting PPC can be...

> 
> > My concern is that currently there is no point in having more than
> > one HK cpu (per node in a NUMA case). The code as currently implemented
> > is just not doing what it needs to.
> > 
> > We have numerous cases where a single HK cpu just cannot keep up and
> > the remote_tick warning fires. It also can lead to the other things
> > (orchastration sw, HA keepalives etc) on the HK cpus getting starved
> > which leads to other issues.  In these cases we recommend increasing
> > the number of HK cpus.  But... that only helps the userspace tasks
> > somewhat. It does not help the actual housekeeping part.
> > 
> > It seems clear to me that the intent of the cpumask_any_and() calls
> > is to pick _any_ cpu in the hk mask. Not just the first, otherwise
> > it would just use cpumask_first_and().
> > 
> > I'm open to alternate suggestions of how to fix this.
> Your approach looks good to me.
> 
> I wanted to mention the case of overcommitted multi-tenant setup
> where this will cause noisy neighbour sort of situation, but this can be
> avoided by carefully selecting HK CPUs.

Yes, it can be configured away, hopefully.

Still, this could easily be a sched feat if that make sense to people.


Thanks for the review!


Cheers,
Phil



> 
> Thanks,
> vishalc
> > 
> > 
> > Cheers,
> > Phil
> > 
> > > > 
> > > > I do have a patch that has this controlled by a sched feature if that
> > > > is of interest. Then it could be disabled if you don't want it.
> > > 
> > > Vishal
> > > > 
> > > > Cheers,
> > > > Phil
> > > > 
> > > > > 
> > > > > Thanks,
> > > > > Madadi Vineeth Reddy
> > > > > 
> > > > > > Signed-off-by: Phil Auld <pauld@redhat.com>
> > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > Cc: Juri Lelli <juri.lelli@redhat.com>
> > > > > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > > > > Cc: Waiman Long <longman@redhat.com>
> > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > ---
> > > > > >  kernel/sched/isolation.c | 2 +-
> > > > > >  kernel/sched/topology.c  | 2 +-
> > > > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > > > > > index 81bc8b329ef1..93b038d48900 100644
> > > > > > --- a/kernel/sched/isolation.c
> > > > > > +++ b/kernel/sched/isolation.c
> > > > > > @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
> > > > > >  			if (cpu < nr_cpu_ids)
> > > > > >  				return cpu;
> > > > > >  
> > > > > > -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > > > > > +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
> > > > > >  			if (likely(cpu < nr_cpu_ids))
> > > > > >  				return cpu;
> > > > > >  			/*
> > > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > > > index c49aea8c1025..94133f843485 100644
> > > > > > --- a/kernel/sched/topology.c
> > > > > > +++ b/kernel/sched/topology.c
> > > > > > @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> > > > > >  	for (i = 0; i < sched_domains_numa_levels; i++) {
> > > > > >  		if (!masks[i][j])
> > > > > >  			break;
> > > > > > -		cpu = cpumask_any_and(cpus, masks[i][j]);
> > > > > > +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
> > > > > >  		if (cpu < nr_cpu_ids) {
> > > > > >  			found = cpu;
> > > > > >  			break;
> > > > > 
> > > > 
> > > > -- 
> > > > 
> > > 
> > 
> > -- 
> > 
> 

-- 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu
  2025-02-20 15:52           ` Phil Auld
@ 2025-02-21  4:42             ` Vishal Chourasia
  0 siblings, 0 replies; 13+ messages in thread
From: Vishal Chourasia @ 2025-02-21  4:42 UTC (permalink / raw)
  To: Phil Auld
  Cc: Madadi Vineeth Reddy, Peter Zijlstra, Juri Lelli,
	Frederic Weisbecker, Waiman Long, linux-kernel

On Thu, Feb 20, 2025 at 10:52:29AM -0500, Phil Auld wrote:
> 
> Hi Vishal,
> 
> On Thu, Feb 20, 2025 at 02:50:40PM +0530 Vishal Chourasia wrote:
> > On Tue, Feb 18, 2025 at 10:00:59AM -0500, Phil Auld wrote:
> > > Hi Vishal.
> > > 
> > > On Fri, Feb 14, 2025 at 11:08:19AM +0530 Vishal Chourasia wrote:
> > > > Hi Phil, Vineeth
> > > > 
> > > > On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
> > > > > On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
> > > > > > Hi Phil Auld,
> > > > > > 
> > > > > > On 11/02/25 19:31, Phil Auld wrote:
> > > > > > > The exising code uses housekeeping_any_cpu() to select a cpu for
> > > > > > > a given housekeeping task. However, this often ends up calling
> > > > > > > cpumask_any_and() which is defined as cpumask_first_and() which has
> > > > > > > the effect of alyways using the first cpu among those available.
> > > > > > > 
> > > > > > > The same applies when multiple NUMA nodes are involved. In that
> > > > > > > case the first cpu in the local node is chosen which does provide
> > > > > > > a bit of spreading but with multiple HK cpus per node the same
> > > > > > > issues arise.
> > > > > > > 
> > > > > > > Spread the HK work out by having housekeeping_any_cpu() and
> > > > > > > sched_numa_find_closest() use cpumask_any_and_distribute()
> > > > > > > instead of cpumask_any_and().
> > > > > > > 
> > > > > > 
> > > > > > Got the overall intent of the patch for better load distribution on
> > > > > > housekeeping tasks. However, one potential drawback could be that by
> > > > > > spreading HK work across multiple CPUs might reduce the time that
> > > > > > some cores can spend in deeper idle states which can be beneficial for
> > > > > > power-sensitive systems.
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > NOHZ_full setups are not generally used in power sensitive systems I think.
> > > > > They aren't in our use cases at least. 
> > > > > 
> > > > > In cases with many cpus a single housekeeping cpu can not keep up. Having
> > > > > other HK cpus in deep idle states while the one in use is overloaded is
> > > > > not a win. 
> > > > 
> > > > To me, an overloaded CPU sounds like where more than one tasks are ready
> > > > to run, and a HK CPU is one receiving periodic scheduling clock
> > > > ticks, so HP CPU is bound to comes out of any power-saving state it is in.
> > > 
> > > If the overload is caused by HK and interrupts there is nothing in the
> > > system to help. Tasks, sure, can get load balanced.
> > > 
> > > And as you say, the HK cpus will have generally ticks happening anyway.
> > > 
> > > > > 
> > > > > If your single HK cpu can keep up then only configure that one HK cpu.
> > > > > The others will go idle and stay there.  And since they are nohz_full
> > > > > might get to stay idle even longer.
> > > > While it is good to distribute the load across each HK CPU in the HK 
> > > > cpumask (queuing jobs on different CPUs each time), this can cause
> > > > jitter in virtualized environments. Unnecessaryily evicting other
> > > > tenants, when it's better to overload a VP than to wake up other VPs of a
> > > > tenant.
> > > > 
> > > 
> > > Sorry I'm not sure I understand your setup. Are your running virtual
> > > tenants on the HK cpus?  nohz_full in the guests? Maybe you only need
> > > on HK then it won't matter.
> > > 
> > Firstly, I am unaware if nohz_full is being used in virtualized environment.
> > Please correct me it is. I am not saying it can't or shouldn't be used, it's just
> > I don't know if anybody is using it.
> >
> 
> I've seen some people trying it in various ways and to varying degrees of
> success if I recall correctly.
> 
> 
> > nohz_full in guests would mean that tick is disabled inside the guest but
> > the host might still be getting ticks. So, I am unsure, whether it is a good
> > idea to have nohz_full in virtualized environment.
> > 
> > Nevertheless, the idea of nohz_full is to reduce to the kernel interference
> > for CPUs marked as nohz_full. And, it does help with guest interference.
> > 
> > I would like to mention, In SPLPAR environment, scheduling work on
> > different HK CPU each time can caused VM preemption in a multi-tenant
> > setup in cases where CPUs in HK cpumask spans across VPs, its better to
> > consolidate them within few VPs.
> > 
> > VP is virtual core/processor.
> 
> I have a hard time reconciling you saying you are not using virtualization
> and then talking about VMs and VPs ;)
I was thinking of a possibility in virtualized environments, this isn't
specific to PPC 

Thanks
vishalc :)
> 
> Sometimes I forget how interesting PPC can be...
> 
> > 
> > > My concern is that currently there is no point in having more than
> > > one HK cpu (per node in a NUMA case). The code as currently implemented
> > > is just not doing what it needs to.
> > > 
> > > We have numerous cases where a single HK cpu just cannot keep up and
> > > the remote_tick warning fires. It also can lead to the other things
> > > (orchastration sw, HA keepalives etc) on the HK cpus getting starved
> > > which leads to other issues.  In these cases we recommend increasing
> > > the number of HK cpus.  But... that only helps the userspace tasks
> > > somewhat. It does not help the actual housekeeping part.
> > > 
> > > It seems clear to me that the intent of the cpumask_any_and() calls
> > > is to pick _any_ cpu in the hk mask. Not just the first, otherwise
> > > it would just use cpumask_first_and().
> > > 
> > > I'm open to alternate suggestions of how to fix this.
> > Your approach looks good to me.
> > 
> > I wanted to mention the case of overcommitted multi-tenant setup
> > where this will cause noisy neighbour sort of situation, but this can be
> > avoided by carefully selecting HK CPUs.
> 
> Yes, it can be configured away, hopefully.
> 
> Still, this could easily be a sched feat if that make sense to people.
> 
> 
> Thanks for the review!
> 
> 
> Cheers,
> Phil
> 
> 
> 
> > 
> > Thanks,
> > vishalc
> > > 
> > > 
> > > Cheers,
> > > Phil
> > > 
> > > > > 
> > > > > I do have a patch that has this controlled by a sched feature if that
> > > > > is of interest. Then it could be disabled if you don't want it.
> > > > 
> > > > Vishal
> > > > > 
> > > > > Cheers,
> > > > > Phil
> > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > Madadi Vineeth Reddy
> > > > > > 
> > > > > > > Signed-off-by: Phil Auld <pauld@redhat.com>
> > > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > > Cc: Juri Lelli <juri.lelli@redhat.com>
> > > > > > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > > > > > Cc: Waiman Long <longman@redhat.com>
> > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > ---
> > > > > > >  kernel/sched/isolation.c | 2 +-
> > > > > > >  kernel/sched/topology.c  | 2 +-
> > > > > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > > > > > > index 81bc8b329ef1..93b038d48900 100644
> > > > > > > --- a/kernel/sched/isolation.c
> > > > > > > +++ b/kernel/sched/isolation.c
> > > > > > > @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
> > > > > > >  			if (cpu < nr_cpu_ids)
> > > > > > >  				return cpu;
> > > > > > >  
> > > > > > > -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > > > > > > +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
> > > > > > >  			if (likely(cpu < nr_cpu_ids))
> > > > > > >  				return cpu;
> > > > > > >  			/*
> > > > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > > > > index c49aea8c1025..94133f843485 100644
> > > > > > > --- a/kernel/sched/topology.c
> > > > > > > +++ b/kernel/sched/topology.c
> > > > > > > @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> > > > > > >  	for (i = 0; i < sched_domains_numa_levels; i++) {
> > > > > > >  		if (!masks[i][j])
> > > > > > >  			break;
> > > > > > > -		cpu = cpumask_any_and(cpus, masks[i][j]);
> > > > > > > +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
> > > > > > >  		if (cpu < nr_cpu_ids) {
> > > > > > >  			found = cpu;
> > > > > > >  			break;
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > 
> > > > 
> > > 
> > > -- 
> > > 
> > 
> 
> -- 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-02-21  4:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-11 14:01 [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu Phil Auld
2025-02-11 14:14 ` [PATCH] " Phil Auld
2025-02-11 14:24   ` Waiman Long
2025-02-13  4:44 ` [CHANGE 1/2] " Madadi Vineeth Reddy
2025-02-13 14:26   ` Phil Auld
2025-02-14  5:38     ` Vishal Chourasia
2025-02-18 15:00       ` Phil Auld
2025-02-18 15:23         ` Waiman Long
2025-02-18 15:30           ` Phil Auld
2025-02-18 15:33             ` Waiman Long
2025-02-20  9:20         ` Vishal Chourasia
2025-02-20 15:52           ` Phil Auld
2025-02-21  4:42             ` Vishal Chourasia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox