All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Take the scheduling domain into account in numa balancin
@ 2025-01-03  6:59 Chuyi Zhou
  2025-01-03  6:59 ` [PATCH v2 1/3] sched/fair: Remove unused task_numa_migrate return value Chuyi Zhou
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Chuyi Zhou @ 2025-01-03  6:59 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, longman, riel
  Cc: chengming.zhou, kprateek.nayak, linux-kernel, Chuyi Zhou

This patchset tries to adjust the logic of handling isolate cpus in numa balancing.

patch#1: Clean up for task_numa_migrate().

patch#2: Skips the isolate cpus when gathering numa status and finding
idle cpus in update_numa_stats().

patch#3: Ensure that we do not select an isolated CPU in
task_numa_find_cpu(), even if it is present in the task's CPU mask.

v1 -> v2:
 - collect Prateek's reviewed-by and add the history of
   task_numa_migrate() to commit log.

Chuyi Zhou (3):
  sched/fair: Remove unused task_numa_migrate return value
  sched/fair: Ignore isolated cpus in update_numa_stat
  sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu

 kernel/sched/fair.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] sched/fair: Remove unused task_numa_migrate return value
  2025-01-03  6:59 [PATCH v2 0/3] Take the scheduling domain into account in numa balancin Chuyi Zhou
@ 2025-01-03  6:59 ` Chuyi Zhou
  2025-01-05 18:28   ` Madadi Vineeth Reddy
  2025-01-03  6:59 ` [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat Chuyi Zhou
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Chuyi Zhou @ 2025-01-03  6:59 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, longman, riel
  Cc: chengming.zhou, kprateek.nayak, linux-kernel, Chuyi Zhou

Initial NUMA Balancing implementation used the return value of
task_numa_migrate() to retry NUMA Balancing in commit 6b9a7460b6ba
("sched/numa: Retry migration of tasks to CPU on a preferred node") however
in the same series[1], Mel also included an optimization from Rik which
retried NUMA Balancing periodically irrespective the return value from
task_numa_migrate() in commit 2739d3eef3a9 ("sched/numa: Retry
task_numa_migrate() periodically").

The return value of task_numa_migrate now is unused, remove it.

[1] https://lore.kernel.org/all/1381141781-10992-34-git-send-email-mgorman@suse.de/

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/sched/fair.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d5127d9beaea..f544012b9320 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2486,7 +2486,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	}
 }
 
-static int task_numa_migrate(struct task_struct *p)
+static void task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
 		.p = p,
@@ -2531,7 +2531,7 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	if (unlikely(!sd)) {
 		sched_setnuma(p, task_node(p));
-		return -EINVAL;
+		return;
 	}
 
 	env.dst_nid = p->numa_preferred_nid;
@@ -2600,7 +2600,7 @@ static int task_numa_migrate(struct task_struct *p)
 	/* No better CPU than the current one was found. */
 	if (env.best_cpu == -1) {
 		trace_sched_stick_numa(p, env.src_cpu, NULL, -1);
-		return -EAGAIN;
+		return;
 	}
 
 	best_rq = cpu_rq(env.best_cpu);
@@ -2609,7 +2609,7 @@ static int task_numa_migrate(struct task_struct *p)
 		WRITE_ONCE(best_rq->numa_migrate_on, 0);
 		if (ret != 0)
 			trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu);
-		return ret;
+		return;
 	}
 
 	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
@@ -2618,7 +2618,6 @@ static int task_numa_migrate(struct task_struct *p)
 	if (ret != 0)
 		trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu);
 	put_task_struct(env.best_task);
-	return ret;
 }
 
 /* Attempt to migrate a task to a CPU on the preferred node. */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat
  2025-01-03  6:59 [PATCH v2 0/3] Take the scheduling domain into account in numa balancin Chuyi Zhou
  2025-01-03  6:59 ` [PATCH v2 1/3] sched/fair: Remove unused task_numa_migrate return value Chuyi Zhou
@ 2025-01-03  6:59 ` Chuyi Zhou
  2025-01-05 18:52   ` Madadi Vineeth Reddy
  2025-01-07 18:39   ` Waiman Long
  2025-01-03  6:59 ` [PATCH v2 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu Chuyi Zhou
  2025-01-03 11:36 ` [PATCH v2 0/3] Take the scheduling domain into account in numa balancin Peter Zijlstra
  3 siblings, 2 replies; 14+ messages in thread
From: Chuyi Zhou @ 2025-01-03  6:59 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, longman, riel
  Cc: chengming.zhou, kprateek.nayak, linux-kernel, Chuyi Zhou

Now update_numa_stats() iterates each cpu in a node to gather load
information for the node and attempts to find the idle cpu as a candidate
best_cpu within the node.

In update_numa_stats() we should take into account the scheduling domain.
This is because the "isolcpus" kernel command line option and cpuset iso-
late partitions can remove CPUs from load balance. Similar to task wakeup
and periodic load balancing, we should not involve isolated CPUs in NUMA
balancing. When gathering load information for nodes, we need to ignore the
load of isolated CPUs. This change also avoids selecting an isolated CPU
as the idle_cpu.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/sched/fair.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f544012b9320..a0139659fe7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2125,6 +2125,11 @@ static void update_numa_stats(struct task_numa_env *env,
 	for_each_cpu(cpu, cpumask_of_node(nid)) {
 		struct rq *rq = cpu_rq(cpu);
 
+		/* skip isolated cpus' load */
+		if (!rcu_dereference(rq->sd))
+			continue;
+
+		ns->weight++;
 		ns->load += cpu_load(rq);
 		ns->runnable += cpu_runnable(rq);
 		ns->util += cpu_util_cfs(cpu);
@@ -2144,8 +2149,6 @@ static void update_numa_stats(struct task_numa_env *env,
 	}
 	rcu_read_unlock();
 
-	ns->weight = cpumask_weight(cpumask_of_node(nid));
-
 	ns->node_type = numa_classify(env->imbalance_pct, ns);
 
 	if (idle_core >= 0)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu
  2025-01-03  6:59 [PATCH v2 0/3] Take the scheduling domain into account in numa balancin Chuyi Zhou
  2025-01-03  6:59 ` [PATCH v2 1/3] sched/fair: Remove unused task_numa_migrate return value Chuyi Zhou
  2025-01-03  6:59 ` [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat Chuyi Zhou
@ 2025-01-03  6:59 ` Chuyi Zhou
  2025-01-03 11:36 ` [PATCH v2 0/3] Take the scheduling domain into account in numa balancin Peter Zijlstra
  3 siblings, 0 replies; 14+ messages in thread
From: Chuyi Zhou @ 2025-01-03  6:59 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, longman, riel
  Cc: chengming.zhou, kprateek.nayak, linux-kernel, Chuyi Zhou

Now in task_numa_find_cpu(), we only skip CPUs that are not in the task's
cpumask, which could result in migrating the task to an isolated domain if
the task's cpumask includes isolated CPUs. This is because cpuset
configured partitions are always reflected in each member task's cpumask.
However, for isolcpus= kernel command line option, the isolated CPUs are
simply omitted from sched_domains without further restrictions on tasks'
cpumasks.

This change replaces the set of CPUs allowed to migrate the task from
p->cpus_ptr by the intersection of p->cpus_ptr and
housekeeping_cpumask(HK_TYPE_DOMAIN).

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/sched/fair.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0139659fe7a..05782b563609 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2081,6 +2081,12 @@ numa_type numa_classify(unsigned int imbalance_pct,
 	return node_fully_busy;
 }
 
+static inline bool numa_migrate_test_cpu(struct task_struct *p, int cpu)
+{
+	return cpumask_test_cpu(cpu, p->cpus_ptr) &&
+			housekeeping_cpu(cpu, HK_TYPE_DOMAIN);
+}
+
 #ifdef CONFIG_SCHED_SMT
 /* Forward declarations of select_idle_sibling helpers */
 static inline bool test_idle_cores(int cpu);
@@ -2168,7 +2174,7 @@ static void task_numa_assign(struct task_numa_env *env,
 		/* Find alternative idle CPU. */
 		for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start + 1) {
 			if (cpu == env->best_cpu || !idle_cpu(cpu) ||
-			    !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
+			    !numa_migrate_test_cpu(env->p, cpu)) {
 				continue;
 			}
 
@@ -2480,7 +2486,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
-		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
+		if (!numa_migrate_test_cpu(env->p, cpu))
 			continue;
 
 		env->dst_cpu = cpu;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] Take the scheduling domain into account in numa balancin
  2025-01-03  6:59 [PATCH v2 0/3] Take the scheduling domain into account in numa balancin Chuyi Zhou
                   ` (2 preceding siblings ...)
  2025-01-03  6:59 ` [PATCH v2 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu Chuyi Zhou
@ 2025-01-03 11:36 ` Peter Zijlstra
  2025-01-03 12:40   ` Chuyi Zhou
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2025-01-03 11:36 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, longman, riel, chengming.zhou,
	kprateek.nayak, linux-kernel

On Fri, Jan 03, 2025 at 02:59:27PM +0800, Chuyi Zhou wrote:
> This patchset tries to adjust the logic of handling isolate cpus in numa balancing.
> 
> patch#1: Clean up for task_numa_migrate().
> 
> patch#2: Skips the isolate cpus when gathering numa status and finding
> idle cpus in update_numa_stats().
> 
> patch#3: Ensure that we do not select an isolated CPU in
> task_numa_find_cpu(), even if it is present in the task's CPU mask.

Your $subject and actual patches do not patch.

Your subject suggests you're taking the scheduling domains into account
for numa balancing, your actual patches are bunch of special case hacks
that totally ignore the actual sched domains :-(

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] Take the scheduling domain into account in numa balancin
  2025-01-03 11:36 ` [PATCH v2 0/3] Take the scheduling domain into account in numa balancin Peter Zijlstra
@ 2025-01-03 12:40   ` Chuyi Zhou
  2025-01-07 10:48     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Chuyi Zhou @ 2025-01-03 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, longman, riel, chengming.zhou,
	kprateek.nayak, linux-kernel

Hello Peter,

在 2025/1/3 19:36, Peter Zijlstra 写道:
> Your $subject and actual patches do not patch.
> 
> Your subject suggests you're taking the scheduling domains into account
> for numa balancing, your actual patches are bunch of special case hacks
> that totally ignore the actual sched domains

The subject is indeed inappropriate, but the issues mentioned in this 
patchset still exist, right?

- We should not consider isolated CPUs in numa_stats.
- We should not select isolated CPUs as candidate CPUs.

The current patch handle the above cases with hacks because I thought 
this kind of change is minimal to fix this issue. Perhaps you have a 
better solution for this issue? If so, please let me know, and I am more 
than willing to continue addressing this problem.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] sched/fair: Remove unused task_numa_migrate return value
  2025-01-03  6:59 ` [PATCH v2 1/3] sched/fair: Remove unused task_numa_migrate return value Chuyi Zhou
@ 2025-01-05 18:28   ` Madadi Vineeth Reddy
  0 siblings, 0 replies; 14+ messages in thread
From: Madadi Vineeth Reddy @ 2025-01-05 18:28 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, longman, riel,
	chengming.zhou, kprateek.nayak, linux-kernel,
	Madadi Vineeth Reddy

Hi Chuyi Zhou,

On 03/01/25 12:29, Chuyi Zhou wrote:
> Initial NUMA Balancing implementation used the return value of
> task_numa_migrate() to retry NUMA Balancing in commit 6b9a7460b6ba
> ("sched/numa: Retry migration of tasks to CPU on a preferred node") however
> in the same series[1], Mel also included an optimization from Rik which
> retried NUMA Balancing periodically irrespective the return value from
> task_numa_migrate() in commit 2739d3eef3a9 ("sched/numa: Retry
> task_numa_migrate() periodically").

Maybe you can also add: 
Fixes: 2739d3eef3a9 ("sched/numa: Retry task_numa_migrate() periodically")
While it's not strictly required, it might be useful to include for context.
> 
> The return value of task_numa_migrate now is unused, remove it.

Yes, it's unused now.
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>

Thanks,
Madadi Vineeth Reddy

> 
> [1] https://lore.kernel.org/all/1381141781-10992-34-git-send-email-mgorman@suse.de/
> 
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  kernel/sched/fair.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d5127d9beaea..f544012b9320 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2486,7 +2486,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  	}
>  }
>  
> -static int task_numa_migrate(struct task_struct *p)
> +static void task_numa_migrate(struct task_struct *p)
>  {
>  	struct task_numa_env env = {
>  		.p = p,
> @@ -2531,7 +2531,7 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	if (unlikely(!sd)) {
>  		sched_setnuma(p, task_node(p));
> -		return -EINVAL;
> +		return;
>  	}
>  
>  	env.dst_nid = p->numa_preferred_nid;
> @@ -2600,7 +2600,7 @@ static int task_numa_migrate(struct task_struct *p)
>  	/* No better CPU than the current one was found. */
>  	if (env.best_cpu == -1) {
>  		trace_sched_stick_numa(p, env.src_cpu, NULL, -1);
> -		return -EAGAIN;
> +		return;
>  	}
>  
>  	best_rq = cpu_rq(env.best_cpu);
> @@ -2609,7 +2609,7 @@ static int task_numa_migrate(struct task_struct *p)
>  		WRITE_ONCE(best_rq->numa_migrate_on, 0);
>  		if (ret != 0)
>  			trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu);
> -		return ret;
> +		return;
>  	}
>  
>  	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
> @@ -2618,7 +2618,6 @@ static int task_numa_migrate(struct task_struct *p)
>  	if (ret != 0)
>  		trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu);
>  	put_task_struct(env.best_task);
> -	return ret;
>  }
>  
>  /* Attempt to migrate a task to a CPU on the preferred node. */


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat
  2025-01-03  6:59 ` [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat Chuyi Zhou
@ 2025-01-05 18:52   ` Madadi Vineeth Reddy
  2025-01-07 13:16     ` Chuyi Zhou
  2025-01-07 18:39   ` Waiman Long
  1 sibling, 1 reply; 14+ messages in thread
From: Madadi Vineeth Reddy @ 2025-01-05 18:52 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, longman, riel,
	chengming.zhou, kprateek.nayak, linux-kernel,
	Madadi Vineeth Reddy

On 03/01/25 12:29, Chuyi Zhou wrote:
> Now update_numa_stats() iterates each cpu in a node to gather load
> information for the node and attempts to find the idle cpu as a candidate
> best_cpu within the node.
> 
> In update_numa_stats() we should take into account the scheduling domain.
> This is because the "isolcpus" kernel command line option and cpuset iso-
> late partitions can remove CPUs from load balance. Similar to task wakeup
> and periodic load balancing, we should not involve isolated CPUs in NUMA
> balancing. When gathering load information for nodes, we need to ignore the
> load of isolated CPUs. This change also avoids selecting an isolated CPU
> as the idle_cpu.

If possible, would you be able to share any performance metrics or benchmarks
that demonstrate the impact of this patch on NUMA balancing or CPU migration
efficiency?

Thanks,
Madadi Vineeth Reddy

> 
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  kernel/sched/fair.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] Take the scheduling domain into account in numa balancin
  2025-01-03 12:40   ` Chuyi Zhou
@ 2025-01-07 10:48     ` Peter Zijlstra
  2025-01-07 13:03       ` Chuyi Zhou
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2025-01-07 10:48 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, longman, riel, chengming.zhou,
	kprateek.nayak, linux-kernel

On Fri, Jan 03, 2025 at 08:40:18PM +0800, Chuyi Zhou wrote:
> Hello Peter,
> 
> 在 2025/1/3 19:36, Peter Zijlstra 写道:
> > Your $subject and actual patches do not patch.
> > 
> > Your subject suggests you're taking the scheduling domains into account
> > for numa balancing, your actual patches are bunch of special case hacks
> > that totally ignore the actual sched domains
> 
> The subject is indeed inappropriate, but the issues mentioned in this
> patchset still exist, right?
> 
> - We should not consider isolated CPUs in numa_stats.
> - We should not select isolated CPUs as candidate CPUs.
> 
> The current patch handle the above cases with hacks because I thought this
> kind of change is minimal to fix this issue. Perhaps you have a better
> solution for this issue? If so, please let me know, and I am more than
> willing to continue addressing this problem.

I would much rather see you do what the subject claims :-)

At the very least you should mask the whole thing against rq->rd->span
if there is a rq->rd at all ofcourse. This is not just the iteration in
update_numa_stats() but also the for_each_node_state() iteration.

I'm not sure there's anything saner than:

	cpumask_intersects(rq->rd->span, cpumask_of_node(nid))


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] Take the scheduling domain into account in numa balancin
  2025-01-07 10:48     ` Peter Zijlstra
@ 2025-01-07 13:03       ` Chuyi Zhou
  0 siblings, 0 replies; 14+ messages in thread
From: Chuyi Zhou @ 2025-01-07 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, longman, riel, chengming.zhou,
	kprateek.nayak, linux-kernel



在 2025/1/7 18:48, Peter Zijlstra 写道:
> On Fri, Jan 03, 2025 at 08:40:18PM +0800, Chuyi Zhou wrote:
>> Hello Peter,
>>
>> 在 2025/1/3 19:36, Peter Zijlstra 写道:
>>> Your $subject and actual patches do not patch.
>>>
>>> Your subject suggests you're taking the scheduling domains into account
>>> for numa balancing, your actual patches are bunch of special case hacks
>>> that totally ignore the actual sched domains
>>
>> The subject is indeed inappropriate, but the issues mentioned in this
>> patchset still exist, right?
>>
>> - We should not consider isolated CPUs in numa_stats.
>> - We should not select isolated CPUs as candidate CPUs.
>>
>> The current patch handle the above cases with hacks because I thought this
>> kind of change is minimal to fix this issue. Perhaps you have a better
>> solution for this issue? If so, please let me know, and I am more than
>> willing to continue addressing this problem.
> 
> I would much rather see you do what the subject claims :-)
> 
> At the very least you should mask the whole thing against rq->rd->span
> if there is a rq->rd at all ofcourse. This is not just the iteration in
> update_numa_stats() but also the for_each_node_state() iteration.
> 
> I'm not sure there's anything saner than:
> 
> 	cpumask_intersects(rq->rd->span, cpumask_of_node(nid))
> 

Thank you for your suggestion. It will be implemented in the next version.

Thansk.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat
  2025-01-05 18:52   ` Madadi Vineeth Reddy
@ 2025-01-07 13:16     ` Chuyi Zhou
  2025-01-07 18:30       ` Madadi Vineeth Reddy
  0 siblings, 1 reply; 14+ messages in thread
From: Chuyi Zhou @ 2025-01-07 13:16 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, longman, riel,
	chengming.zhou, kprateek.nayak, linux-kernel

Hello Madadi Vineeth Reddy,

在 2025/1/6 02:52, Madadi Vineeth Reddy 写道:
> On 03/01/25 12:29, Chuyi Zhou wrote:
>> Now update_numa_stats() iterates each cpu in a node to gather load
>> information for the node and attempts to find the idle cpu as a candidate
>> best_cpu within the node.
>>
>> In update_numa_stats() we should take into account the scheduling domain.
>> This is because the "isolcpus" kernel command line option and cpuset iso-
>> late partitions can remove CPUs from load balance. Similar to task wakeup
>> and periodic load balancing, we should not involve isolated CPUs in NUMA
>> balancing. When gathering load information for nodes, we need to ignore the
>> load of isolated CPUs. This change also avoids selecting an isolated CPU
>> as the idle_cpu.
> 
> If possible, would you be able to share any performance metrics or benchmarks
> that demonstrate the impact of this patch on NUMA balancing or CPU migration
> efficiency?
> 
> Thanks,
> Madadi Vineeth Reddy
> 

This change will not alter the default behavior of NUMA balancing unless 
we enables both NUMA balancing and isolated CPUs features. Therefore, 
under normal default conditions, there should be no performance 
regression. On the other hand, involving isolated CPUs in NUMA balancing 
or load balancing is inappropriate, and this is the issue that this 
patch aims to fix.

It might be worth setting up a test scenario for cases where NUMA 
balancing and isolated CPUs are both enabled. Perhaps this testing will 
be done later on.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat
  2025-01-07 13:16     ` Chuyi Zhou
@ 2025-01-07 18:30       ` Madadi Vineeth Reddy
  0 siblings, 0 replies; 14+ messages in thread
From: Madadi Vineeth Reddy @ 2025-01-07 18:30 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, longman, riel,
	chengming.zhou, kprateek.nayak, linux-kernel,
	Madadi Vineeth Reddy

On 07/01/25 18:46, Chuyi Zhou wrote:
> Hello Madadi Vineeth Reddy,
> 
> 在 2025/1/6 02:52, Madadi Vineeth Reddy 写道:
>> On 03/01/25 12:29, Chuyi Zhou wrote:
>>> Now update_numa_stats() iterates each cpu in a node to gather load
>>> information for the node and attempts to find the idle cpu as a candidate
>>> best_cpu within the node.
>>>
>>> In update_numa_stats() we should take into account the scheduling domain.
>>> This is because the "isolcpus" kernel command line option and cpuset iso-
>>> late partitions can remove CPUs from load balance. Similar to task wakeup
>>> and periodic load balancing, we should not involve isolated CPUs in NUMA
>>> balancing. When gathering load information for nodes, we need to ignore the
>>> load of isolated CPUs. This change also avoids selecting an isolated CPU
>>> as the idle_cpu.
>>
>> If possible, would you be able to share any performance metrics or benchmarks
>> that demonstrate the impact of this patch on NUMA balancing or CPU migration
>> efficiency?
>>
>> Thanks,
>> Madadi Vineeth Reddy
>>
> 
> This change will not alter the default behavior of NUMA balancing unless we enables both NUMA balancing and isolated CPUs features. Therefore, under normal default conditions, there should be no performance regression. On the other hand, involving isolated CPUs in NUMA balancing or load balancing is inappropriate, and this is the issue that this patch aims to fix.
> 
> It might be worth setting up a test scenario for cases where NUMA balancing and isolated CPUs are both enabled. Perhaps this testing will be done later on.

Right, by benchmark, I meant the scenario where both NUMA balancing and isolated
CPUs are involved. It would make for an interesting test case.

Thanks,
Madadi Vineeth Reddy

> 
> Thanks.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat
  2025-01-03  6:59 ` [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat Chuyi Zhou
  2025-01-05 18:52   ` Madadi Vineeth Reddy
@ 2025-01-07 18:39   ` Waiman Long
  2025-01-08 10:47     ` Chuyi Zhou
  1 sibling, 1 reply; 14+ messages in thread
From: Waiman Long @ 2025-01-07 18:39 UTC (permalink / raw)
  To: Chuyi Zhou, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, riel
  Cc: chengming.zhou, kprateek.nayak, linux-kernel


On 1/3/25 1:59 AM, Chuyi Zhou wrote:
> Now update_numa_stats() iterates each cpu in a node to gather load
> information for the node and attempts to find the idle cpu as a candidate
> best_cpu within the node.
>
> In update_numa_stats() we should take into account the scheduling domain.
> This is because the "isolcpus" kernel command line option and cpuset iso-
> late partitions can remove CPUs from load balance. Similar to task wakeup
> and periodic load balancing, we should not involve isolated CPUs in NUMA
> balancing. When gathering load information for nodes, we need to ignore the
> load of isolated CPUs. This change also avoids selecting an isolated CPU
> as the idle_cpu.
>
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>   kernel/sched/fair.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f544012b9320..a0139659fe7a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2125,6 +2125,11 @@ static void update_numa_stats(struct task_numa_env *env,
>   	for_each_cpu(cpu, cpumask_of_node(nid)) {
>   		struct rq *rq = cpu_rq(cpu);
>   
> +		/* skip isolated cpus' load */
> +		if (!rcu_dereference(rq->sd))
> +			continue;
> +
> +		ns->weight++;
>   		ns->load += cpu_load(rq);
>   		ns->runnable += cpu_runnable(rq);
>   		ns->util += cpu_util_cfs(cpu);
> @@ -2144,8 +2149,6 @@ static void update_numa_stats(struct task_numa_env *env,
>   	}
>   	rcu_read_unlock();
>   
> -	ns->weight = cpumask_weight(cpumask_of_node(nid));
> -
>   	ns->node_type = numa_classify(env->imbalance_pct, ns);
>   
>   	if (idle_core >= 0)

You should initalize ns->weight to 0 first before iteration to prevent 
pre-existing ns->weight value from corrupting the result.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat
  2025-01-07 18:39   ` Waiman Long
@ 2025-01-08 10:47     ` Chuyi Zhou
  0 siblings, 0 replies; 14+ messages in thread
From: Chuyi Zhou @ 2025-01-08 10:47 UTC (permalink / raw)
  To: Waiman Long, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, riel
  Cc: chengming.zhou, kprateek.nayak, linux-kernel

Hello Waiman,

在 2025/1/8 02:39, Waiman Long 写道:
> 
> On 1/3/25 1:59 AM, Chuyi Zhou wrote:
>> Now update_numa_stats() iterates each cpu in a node to gather load
>> information for the node and attempts to find the idle cpu as a candidate
>> best_cpu within the node.
>>
>> In update_numa_stats() we should take into account the scheduling domain.
>> This is because the "isolcpus" kernel command line option and cpuset iso-
>> late partitions can remove CPUs from load balance. Similar to task wakeup
>> and periodic load balancing, we should not involve isolated CPUs in NUMA
>> balancing. When gathering load information for nodes, we need to 
>> ignore the
>> load of isolated CPUs. This change also avoids selecting an isolated CPU
>> as the idle_cpu.
>>
>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>> ---
>>   kernel/sched/fair.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f544012b9320..a0139659fe7a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2125,6 +2125,11 @@ static void update_numa_stats(struct 
>> task_numa_env *env,
>>       for_each_cpu(cpu, cpumask_of_node(nid)) {
>>           struct rq *rq = cpu_rq(cpu);
>> +        /* skip isolated cpus' load */
>> +        if (!rcu_dereference(rq->sd))
>> +            continue;
>> +
>> +        ns->weight++;
>>           ns->load += cpu_load(rq);
>>           ns->runnable += cpu_runnable(rq);
>>           ns->util += cpu_util_cfs(cpu);
>> @@ -2144,8 +2149,6 @@ static void update_numa_stats(struct 
>> task_numa_env *env,
>>       }
>>       rcu_read_unlock();
>> -    ns->weight = cpumask_weight(cpumask_of_node(nid));
>> -
>>       ns->node_type = numa_classify(env->imbalance_pct, ns);
>>       if (idle_core >= 0)
> 
> You should initalize ns->weight to 0 first before iteration to prevent 
> pre-existing ns->weight value from corrupting the result.
> 
> Cheers,
> Longman
> 

Thanks for your review.

We have already memset ns to 0 before the start of update_numa_stats(), 
so I think it should be okay here.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-01-08 10:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-03  6:59 [PATCH v2 0/3] Take the scheduling domain into account in numa balancin Chuyi Zhou
2025-01-03  6:59 ` [PATCH v2 1/3] sched/fair: Remove unused task_numa_migrate return value Chuyi Zhou
2025-01-05 18:28   ` Madadi Vineeth Reddy
2025-01-03  6:59 ` [PATCH v2 2/3] sched/fair: Ignore isolated cpus in update_numa_stat Chuyi Zhou
2025-01-05 18:52   ` Madadi Vineeth Reddy
2025-01-07 13:16     ` Chuyi Zhou
2025-01-07 18:30       ` Madadi Vineeth Reddy
2025-01-07 18:39   ` Waiman Long
2025-01-08 10:47     ` Chuyi Zhou
2025-01-03  6:59 ` [PATCH v2 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu Chuyi Zhou
2025-01-03 11:36 ` [PATCH v2 0/3] Take the scheduling domain into account in numa balancin Peter Zijlstra
2025-01-03 12:40   ` Chuyi Zhou
2025-01-07 10:48     ` Peter Zijlstra
2025-01-07 13:03       ` Chuyi Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.