public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/numa: Correct NUMA imbalance calculation
@ 2024-05-24  3:54 Zhang Qiao
  2024-06-03  7:27 ` kernel test robot
  0 siblings, 1 reply; 2+ messages in thread
From: Zhang Qiao @ 2024-05-24  3:54 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid
  Cc: linux-kernel, zhangqiao22

When perform load balance, a NUMA imbalance is allowed
if busy CPUs is less than the maximum threshold, it
remains a pair of communication tasks on the current
node when the source doamin is lightly loaded. In many
cases, this prevents communicating tasks being pulled apart.

But when I ran the lmbench bw_pipe testcase, I found that
it was a little inconsistent with the above expectations,
the communicating tasks were migrated to two different
NUMA nodes.

There may be two reasons for this issue:
1. calculate_imbalance() use local->sum_nr_running, it
may not be accurate, because the communication tasks run
on busiest group, it should be busiest->sum_nr_running.

2. In calculate_imbalance(), idles cpus are used to calculat
imbalance, but the group_weight may not be equal between local
and busiest group(My server has 4 NUMA nodes and kernel
builds 3 level NUMA sched_domain, some sched_group's weight
is different). In this case, even if both groups are very idle,
imbalance will be calculated very large, the difference of busy
cpus between groups might be more appropriate as imbalance value.

For lmbench bw_pipe(bw_pipe -P 1):
v6.6: 			1776.7533 MB/sec
v6.6 + this patch:	4323 	  MB/sec

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..c6170cde9c14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1323,7 +1323,6 @@ static inline bool is_core_idle(int cpu)
 }
 
 #ifdef CONFIG_NUMA
-#define NUMA_IMBALANCE_MIN 2
 
 static inline long
 adjust_numa_imbalance(int imbalance, int dst_running, int imb_numa_nr)
@@ -1342,7 +1341,7 @@ adjust_numa_imbalance(int imbalance, int dst_running, int imb_numa_nr)
 	 * Allow a small imbalance based on a simple pair of communicating
 	 * tasks that remain local when the destination is lightly loaded.
 	 */
-	if (imbalance <= NUMA_IMBALANCE_MIN)
+	if (imbalance <= imb_numa_nr)
 		return 0;
 
 	return imbalance;
@@ -10727,14 +10726,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			 */
 			env->migration_type = migrate_task;
 			env->imbalance = max_t(long, 0,
-					       (local->idle_cpus - busiest->idle_cpus));
+					(busiest->group_weight - busiest->idle_cpus) -
+					 (local->group_weight - local->idle_cpus));
 		}
 
 #ifdef CONFIG_NUMA
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-							       local->sum_nr_running + 1,
+							       busiest->sum_nr_running,
 							       env->sd->imb_numa_nr);
 		}
 #endif
-- 
2.18.0.huawei.25


^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-06-03  7:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-24  3:54 [PATCH] sched/numa: Correct NUMA imbalance calculation Zhang Qiao
2024-06-03  7:27 ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox