[PATCH] sched/fair: allow imbalance between LLCs under NUMA

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] sched/fair: allow imbalance between LLCs under NUMA
@ 2025-05-28  7:09 Jianyong Wu
  2025-05-29  6:39 ` K Prateek Nayak
  0 siblings, 1 reply; 11+ messages in thread
From: Jianyong Wu @ 2025-05-28  7:09 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, wujianyong, jianyong.wu

The efficiency gains from co-locating communicating tasks within the same
LLC are well-established. However, in multi-LLC NUMA systems, the load
balancer unintentionally sabotages this optimization.

Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
wakes the server within its initial LLC (e.g., LLC_0). The load balancer
subsequently migrates the client to a different LLC (e.g., LLC_1). When
the client next wakes the server, it now targets the server’s placement
to LLC_1 (the client’s new location). The server then migrates to LLC_1,
but the load balancer may reallocate the client to another
LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
perpetually chase each other across all four LLCs — a sustained
cross-LLC ping-pong within the NUMA node.

Our solution: Permit controlled load imbalance between LLCs on the same
NUMA node, prioritizing communication affinity over strict balance.

Impact: In a virtual machine with one socket, multiple NUMA nodes (each
with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
seconds as tasks cycled through all four LLCs. With the patch, migrations
stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
thrashing.

Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
---
 kernel/sched/fair.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..749210e6316b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 #endif

+		/* Allow imbalance between LLCs within a single NUMA node */
+		if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
+				&& env->sd->parent->flags & SD_NUMA) {
+			int child_weight = env->sd->child->span_weight;
+			int llc_nr = env->sd->span_weight / child_weight;
+			int imb_nr, min;
+
+			if (llc_nr > 1) {
+				/* Let the imbalance not be greater than half of child_weight */
+				min = child_weight >= 4 ? 2 : 1;
+				imb_nr = max_t(int, min, child_weight >> 2);
+				if (imb_nr >= env->imbalance)
+					env->imbalance = 0;
+			}
+		}
+
 		/* Number of tasks to move to restore balance */
 		env->imbalance >>= 1;

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-05-28  7:09 [PATCH] sched/fair: allow imbalance between LLCs under NUMA Jianyong Wu
@ 2025-05-29  6:39 ` K Prateek Nayak
  2025-05-29 10:32   ` Jianyong Wu
  0 siblings, 1 reply; 11+ messages in thread
From: K Prateek Nayak @ 2025-05-29  6:39 UTC (permalink / raw)
  To: Jianyong Wu, mingo, peterz, juri.lelli, vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, jianyong.wu

On 5/28/2025 12:39 PM, Jianyong Wu wrote:
> The efficiency gains from co-locating communicating tasks within the same
> LLC are well-established. However, in multi-LLC NUMA systems, the load
> balancer unintentionally sabotages this optimization.
> 
> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
> subsequently migrates the client to a different LLC (e.g., LLC_1). When
> the client next wakes the server, it now targets the server’s placement
> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
> but the load balancer may reallocate the client to another
> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
> perpetually chase each other across all four LLCs — a sustained
> cross-LLC ping-pong within the NUMA node.

Migration only happens if the CPU is overloaded right? I've only seen
this happen when a noise like kworker comes in. What exactly is
causing these migrations in your case and is it actually that bad
for iperf?

> 
> Our solution: Permit controlled load imbalance between LLCs on the same
> NUMA node, prioritizing communication affinity over strict balance.
> 
> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
> seconds as tasks cycled through all four LLCs. With the patch, migrations
> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
> thrashing.

Is there any improvement in iperf numbers with these changes?

> 
> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
> ---
>   kernel/sched/fair.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995a47..749210e6316b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>   		}
>   #endif
>   
> +		/* Allow imbalance between LLCs within a single NUMA node */
> +		if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
> +				&& env->sd->parent->flags & SD_NUMA) {

This does not imply multiple LLC in package. SD_SHARE_LLC is
SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
will be true on Intel with SNC enabled despite not having multiple LLC
and llc_nr will be number of cores there.

Perhaps multiple LLCs can be detected using:

     !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)

> +			int child_weight = env->sd->child->span_weight;
> +			int llc_nr = env->sd->span_weight / child_weight;
> +			int imb_nr, min;
> +
> +			if (llc_nr > 1) {
> +				/* Let the imbalance not be greater than half of child_weight */
> +				min = child_weight >= 4 ? 2 : 1;
> +				imb_nr = max_t(int, min, child_weight >> 2);

Isn't this just max_t(int, child_weight >> 2, 1)?

> +				if (imb_nr >= env->imbalance)
> +					env->imbalance = 0;

At this point, we are trying to even out the number of idle CPUs on the
destination and the busiest LLC. sched_balance_find_src_rq() will return
NULL if it doesn't find an overloaded rq. Is waiting behind a task
more beneficial than migrating to an idler LLC?

> +			}
> +		}
> +
>   		/* Number of tasks to move to restore balance */
>   		env->imbalance >>= 1;
>   

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-05-29  6:39 ` K Prateek Nayak
@ 2025-05-29 10:32   ` Jianyong Wu
  2025-05-30  6:09     ` K Prateek Nayak
  0 siblings, 1 reply; 11+ messages in thread
From: Jianyong Wu @ 2025-05-29 10:32 UTC (permalink / raw)
  To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

Hello K Prateek Nayak, thanks for reply.

On 5/29/2025 2:39 PM, K Prateek Nayak wrote:
> On 5/28/2025 12:39 PM, Jianyong Wu wrote:
>> The efficiency gains from co-locating communicating tasks within the same
>> LLC are well-established. However, in multi-LLC NUMA systems, the load
>> balancer unintentionally sabotages this optimization.
>>
>> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
>> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
>> subsequently migrates the client to a different LLC (e.g., LLC_1). When
>> the client next wakes the server, it now targets the server’s placement
>> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
>> but the load balancer may reallocate the client to another
>> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
>> perpetually chase each other across all four LLCs — a sustained
>> cross-LLC ping-pong within the NUMA node.
> 
> Migration only happens if the CPU is overloaded right?

This will happen even when 2 task are located in a cpuset of 16 cpus 
that shares an LLC. I don't think that it's overloaded for this case.

  I've only seen
> this happen when a noise like kworker comes in. What exactly is
> causing these migrations in your case and is it actually that bad
> for iperf?

I think it's the nohz idle balance that pulls these 2 iperf apart. But 
the root cause is that load balance doesn't permit even a slight 
imbalance among LLCs.

Exactly. It's easy to reproduce in those multi-LLCs NUMA system like 
some AMD servers.

> 
>>
>> Our solution: Permit controlled load imbalance between LLCs on the same
>> NUMA node, prioritizing communication affinity over strict balance.
>>
>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>> seconds as tasks cycled through all four LLCs. With the patch, migrations
>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>> thrashing.
> 
> Is there any improvement in iperf numbers with these changes?
> 
I observe a bit of improvement with this patch in my test.

>>
>> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
>> ---
>>   kernel/sched/fair.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0fb9bf995a47..749210e6316b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct 
>> lb_env *env, struct sd_lb_stats *s
>>           }
>>   #endif
>> +        /* Allow imbalance between LLCs within a single NUMA node */
>> +        if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && 
>> env->sd->parent
>> +                && env->sd->parent->flags & SD_NUMA) {
> 
> This does not imply multiple LLC in package. SD_SHARE_LLC is
> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
> will be true on Intel with SNC enabled despite not having multiple LLC
> and llc_nr will be number of cores there.
> 
> Perhaps multiple LLCs can be detected using:
> 
>      !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)

Great! Thanks!>
>> +            int child_weight = env->sd->child->span_weight;
>> +            int llc_nr = env->sd->span_weight / child_weight;
>> +            int imb_nr, min;
>> +
>> +            if (llc_nr > 1) {
>> +                /* Let the imbalance not be greater than half of 
>> child_weight */
>> +                min = child_weight >= 4 ? 2 : 1;
>> +                imb_nr = max_t(int, min, child_weight >> 2);
> 
> Isn't this just max_t(int, child_weight >> 2, 1)?

I expect that imb_nr can be 2 when child_weight is 4, as I observe that 
the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
However, this may cause the LLCs a bit overload. I'm not sure if it's a 
good idea.

> 
>> +                if (imb_nr >= env->imbalance)
>> +                    env->imbalance = 0;
> 
> At this point, we are trying to even out the number of idle CPUs on the
> destination and the busiest LLC. sched_balance_find_src_rq() will return
> NULL if it doesn't find an overloaded rq. Is waiting behind a task
> more beneficial than migrating to an idler LLC?
> 
It seems that a small imbalance may not impact so much that cause task 
waiting for schedule because we limit the imbalance not greater than 
half, in most case 1/4, of the LLC weight. The imbalance can reduce the 
frequency of task migration and load balance. it's better than enforcing 
a strict balance rules.
we have done similar things among NUMAs, so may be it's reasonable to 
migrate them across LLCs.

Thanks
Jianyong Wu

>> +            }
>> +        }
>> +
>>           /* Number of tasks to move to restore balance */
>>           env->imbalance >>= 1;
> 






^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-05-29 10:32   ` Jianyong Wu
@ 2025-05-30  6:09     ` K Prateek Nayak
  2025-05-30  7:36       ` Jianyong Wu
  2025-06-16  2:22       ` Jianyong Wu
  0 siblings, 2 replies; 11+ messages in thread
From: K Prateek Nayak @ 2025-05-30  6:09 UTC (permalink / raw)
  To: Jianyong Wu, Jianyong Wu, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

Hello Jianyong,

On 5/29/2025 4:02 PM, Jianyong Wu wrote:
> 
> This will happen even when 2 task are located in a cpuset of 16 cpus that shares an LLC. I don't think that it's overloaded for this case.

But if they are located on 2 different CPUs, sched_balance_find_src_rq()
should not return any CPU right? Probably just a timing thing with some
system noise that causes the CPU running the server / client to be
temporarily overloaded.

> 
>   I've only seen
>> this happen when a noise like kworker comes in. What exactly is
>> causing these migrations in your case and is it actually that bad
>> for iperf?
> 
> I think it's the nohz idle balance that pulls these 2 iperf apart. But the root cause is that load balance doesn't permit even a slight imbalance among LLCs.
> 
> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like some AMD servers.
> 
>>
>>>
>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>> NUMA node, prioritizing communication affinity over strict balance.
>>>
>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>> seconds as tasks cycled through all four LLCs. With the patch, migrations
>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>> thrashing.
>>
>> Is there any improvement in iperf numbers with these changes?
>>
> I observe a bit of improvement with this patch in my test.

I'll also give this series a spin on my end to see if it helps.

> 
>>>
>>> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
>>> ---
>>>   kernel/sched/fair.c | 16 ++++++++++++++++
>>>   1 file changed, 16 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 0fb9bf995a47..749210e6316b 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>           }
>>>   #endif
>>> +        /* Allow imbalance between LLCs within a single NUMA node */
>>> +        if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
>>> +                && env->sd->parent->flags & SD_NUMA) {
>>
>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>> will be true on Intel with SNC enabled despite not having multiple LLC
>> and llc_nr will be number of cores there.
>>
>> Perhaps multiple LLCs can be detected using:
>>
>>      !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)

This should have been just

     (sd->child->flags ^ sd->flags) & SD_SHARE_LLC

to find the LLC boundary. Not sure why I prefixed that "!". You also
have to ensure sd itself is not a NUMA domain which is possible with L3
as NUMA option EPYC platforms and Intel with SNC.

> 
> Great! Thanks!>
>>> +            int child_weight = env->sd->child->span_weight;
>>> +            int llc_nr = env->sd->span_weight / child_weight;
>>> +            int imb_nr, min;
>>> +
>>> +            if (llc_nr > 1) {
>>> +                /* Let the imbalance not be greater than half of child_weight */
>>> +                min = child_weight >= 4 ? 2 : 1;
>>> +                imb_nr = max_t(int, min, child_weight >> 2);
>>
>> Isn't this just max_t(int, child_weight >> 2, 1)?
> 
> I expect that imb_nr can be 2 when child_weight is 4, as I observe that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
> However, this may cause the LLCs a bit overload. I'm not sure if it's a good idea.

My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
working moments.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-05-30  6:09     ` K Prateek Nayak
@ 2025-05-30  7:36       ` Jianyong Wu
  2025-06-16  2:22       ` Jianyong Wu
  1 sibling, 0 replies; 11+ messages in thread
From: Jianyong Wu @ 2025-05-30  7:36 UTC (permalink / raw)
  To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

Hi Prateek,

On 5/30/2025 2:09 PM, K Prateek Nayak wrote:
> Hello Jianyong,
> 
> On 5/29/2025 4:02 PM, Jianyong Wu wrote:
>>
>> This will happen even when 2 task are located in a cpuset of 16 cpus 
>> that shares an LLC. I don't think that it's overloaded for this case.
> 
> But if they are located on 2 different CPUs, sched_balance_find_src_rq()
> should not return any CPU right? Probably just a timing thing with some
> system noise that causes the CPU running the server / client to be
> temporarily overloaded.
> 

I think it will.
Suppose this scenario. There are 2 LLCs each associated with 4 cpus, 
under a single NUMA node. Sometimes, one LLC has 4 tasks, each running 
on a separate cpu, while the other LLC has no task running. Should the 
load balancer take action to balance the workload? Absolutely yes.

The tricky point is that the balance action will fail during the first 
few attempts. In the meanwhile, sd->nr_balance_failed increments until 
it exceeds the threshold sd->cache_nice_tries + 2. At that point, active 
balancing is triggered. Eventually, the migration thread steps in to 
migrate the task. That's exactly what I've observed.

>>
>>   I've only seen
>>> this happen when a noise like kworker comes in. What exactly is
>>> causing these migrations in your case and is it actually that bad
>>> for iperf?
>>
>> I think it's the nohz idle balance that pulls these 2 iperf apart. But 
>> the root cause is that load balance doesn't permit even a slight 
>> imbalance among LLCs.
>>
>> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like 
>> some AMD servers.
>>
>>>
>>>>
>>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>>> NUMA node, prioritizing communication affinity over strict balance.
>>>>
>>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>>> seconds as tasks cycled through all four LLCs. With the patch, 
>>>> migrations
>>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>>> thrashing.
>>>
>>> Is there any improvement in iperf numbers with these changes?
>>>
>> I observe a bit of improvement with this patch in my test.
> 
> I'll also give this series a spin on my end to see if it helps.
> 
Great! Let me know how it goes on your end.

>>
>>>>
>>>> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
>>>> ---
>>>>   kernel/sched/fair.c | 16 ++++++++++++++++
>>>>   1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 0fb9bf995a47..749210e6316b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -11203,6 +11203,22 @@ static inline void 
>>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>>           }
>>>>   #endif
>>>> +        /* Allow imbalance between LLCs within a single NUMA node */
>>>> +        if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC 
>>>> && env->sd->parent
>>>> +                && env->sd->parent->flags & SD_NUMA) {
>>>
>>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>>> will be true on Intel with SNC enabled despite not having multiple LLC
>>> and llc_nr will be number of cores there.
>>>
>>> Perhaps multiple LLCs can be detected using:
>>>
>>>      !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
> 
> This should have been just
> 
>      (sd->child->flags ^ sd->flags) & SD_SHARE_LLC
> 
> to find the LLC boundary. Not sure why I prefixed that "!". You also
> have to ensure sd itself is not a NUMA domain which is possible with L3
> as NUMA option EPYC platforms and Intel with SNC.
>
Thanks again, I made a mistake too.

Thanks
Jianyong
>>
>> Great! Thanks!>
>>>> +            int child_weight = env->sd->child->span_weight;
>>>> +            int llc_nr = env->sd->span_weight / child_weight;
>>>> +            int imb_nr, min;
>>>> +
>>>> +            if (llc_nr > 1) {
>>>> +                /* Let the imbalance not be greater than half of 
>>>> child_weight */
>>>> +                min = child_weight >= 4 ? 2 : 1;
>>>> +                imb_nr = max_t(int, min, child_weight >> 2);
>>>
>>> Isn't this just max_t(int, child_weight >> 2, 1)?
>>
>> I expect that imb_nr can be 2 when child_weight is 4, as I observe 
>> that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
>> However, this may cause the LLCs a bit overload. I'm not sure if it's 
>> a good idea.
> 
> My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
> working moments.
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-05-30  6:09     ` K Prateek Nayak
  2025-05-30  7:36       ` Jianyong Wu
@ 2025-06-16  2:22       ` Jianyong Wu
  2025-06-17  4:06         ` K Prateek Nayak
  2025-06-18  6:37         ` K Prateek Nayak
  1 sibling, 2 replies; 11+ messages in thread
From: Jianyong Wu @ 2025-06-16  2:22 UTC (permalink / raw)
  To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

Hi Prateek,

On 5/30/2025 2:09 PM, K Prateek Nayak wrote:
> Hello Jianyong,
> 
> On 5/29/2025 4:02 PM, Jianyong Wu wrote:
>>
>> This will happen even when 2 task are located in a cpuset of 16 cpus 
>> that shares an LLC. I don't think that it's overloaded for this case.
> 
> But if they are located on 2 different CPUs, sched_balance_find_src_rq()
> should not return any CPU right? Probably just a timing thing with some
> system noise that causes the CPU running the server / client to be
> temporarily overloaded.
> 
>>
>>   I've only seen
>>> this happen when a noise like kworker comes in. What exactly is
>>> causing these migrations in your case and is it actually that bad
>>> for iperf?
>>
>> I think it's the nohz idle balance that pulls these 2 iperf apart. But 
>> the root cause is that load balance doesn't permit even a slight 
>> imbalance among LLCs.
>>
>> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like 
>> some AMD servers.
>>
>>>
>>>>
>>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>>> NUMA node, prioritizing communication affinity over strict balance.
>>>>
>>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>>> seconds as tasks cycled through all four LLCs. With the patch, 
>>>> migrations
>>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>>> thrashing.
>>>
>>> Is there any improvement in iperf numbers with these changes?
>>>
>> I observe a bit of improvement with this patch in my test.
> 
> I'll also give this series a spin on my end to see if it helps.

Would you mind letting me know if you've had a chance to try it out, or 
if there's any update on the progress?

Thanks
Jianyong>
>>
>>>>
>>>> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
>>>> ---
>>>>   kernel/sched/fair.c | 16 ++++++++++++++++
>>>>   1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 0fb9bf995a47..749210e6316b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -11203,6 +11203,22 @@ static inline void 
>>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>>           }
>>>>   #endif
>>>> +        /* Allow imbalance between LLCs within a single NUMA node */
>>>> +        if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC 
>>>> && env->sd->parent
>>>> +                && env->sd->parent->flags & SD_NUMA) {
>>>
>>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>>> will be true on Intel with SNC enabled despite not having multiple LLC
>>> and llc_nr will be number of cores there.
>>>
>>> Perhaps multiple LLCs can be detected using:
>>>
>>>      !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
> 
> This should have been just
> 
>      (sd->child->flags ^ sd->flags) & SD_SHARE_LLC
> 
> to find the LLC boundary. Not sure why I prefixed that "!". You also
> have to ensure sd itself is not a NUMA domain which is possible with L3
> as NUMA option EPYC platforms and Intel with SNC.
> 
>>
>> Great! Thanks!>
>>>> +            int child_weight = env->sd->child->span_weight;
>>>> +            int llc_nr = env->sd->span_weight / child_weight;
>>>> +            int imb_nr, min;
>>>> +
>>>> +            if (llc_nr > 1) {
>>>> +                /* Let the imbalance not be greater than half of 
>>>> child_weight */
>>>> +                min = child_weight >= 4 ? 2 : 1;
>>>> +                imb_nr = max_t(int, min, child_weight >> 2);
>>>
>>> Isn't this just max_t(int, child_weight >> 2, 1)?
>>
>> I expect that imb_nr can be 2 when child_weight is 4, as I observe 
>> that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
>> However, this may cause the LLCs a bit overload. I'm not sure if it's 
>> a good idea.
> 
> My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
> working moments.
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-06-16  2:22       ` Jianyong Wu
@ 2025-06-17  4:06         ` K Prateek Nayak
  2025-06-18  6:37         ` K Prateek Nayak
  1 sibling, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2025-06-17  4:06 UTC (permalink / raw)
  To: Jianyong Wu, Jianyong Wu, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

Hello Jianyong,

On 6/16/2025 7:52 AM, Jianyong Wu wrote:
>> I'll also give this series a spin on my end to see if it helps.
> 
> Would you mind letting me know if you've had a chance to try it out, or if there's any update on the progress?

I queued this up last night but didn't realize tip:sched/core had moved
since my last run so my baseline numbers might inaccurate for
comparison. Please give me one more day and I'll get back to you by
tomorrow after rerunning the baseline.

P.S. I saw a crash towards the end of my test run (might be unrelated)
I'll check on this too and get back to you.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-06-16  2:22       ` Jianyong Wu
  2025-06-17  4:06         ` K Prateek Nayak
@ 2025-06-18  6:37         ` K Prateek Nayak
  2025-06-19  6:08           ` Jianyong Wu
  1 sibling, 1 reply; 11+ messages in thread
From: K Prateek Nayak @ 2025-06-18  6:37 UTC (permalink / raw)
  To: Jianyong Wu, Jianyong Wu, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

Hello Jianyong,

On 6/16/2025 7:52 AM, Jianyong Wu wrote:
> Would you mind letting me know if you've had a chance to try it out, or if there's any update on the progress?

Here are my results from a dual socket 3rd Generation EPYC
system.

tl;dr I don't see any improvement and a few regressions too
but few of those data points also have a lot of variance.

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:	   tip:sched/core at commit 914873bc7df9 ("Merge tag
            'x86-build-2025-05-25' of
            git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

allow_imb: tip + this series as is

o Benchmark results

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)     allow_imb[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     1.03 [ -3.20]( 9.18)
      2-groups     1.00 [ -0.00]( 9.58)     1.06 [ -6.46]( 7.63)
      4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -1.30]( 1.90)
      8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.42]( 0.91)
     16-groups     1.00 [ -0.00]( 1.10)     0.99 [  1.09]( 1.13)


     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:           tip[pct imp](CV)     allow_imb[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     1.01 [  1.11]( 0.27)
         2     1.00 [  0.00]( 1.13)     1.00 [ -0.05]( 0.62)
         4     1.00 [  0.00]( 1.12)     1.02 [  2.36]( 0.19)
         8     1.00 [  0.00]( 0.93)     1.01 [  1.02]( 0.86)
        16     1.00 [  0.00]( 0.38)     1.01 [  0.71]( 1.71)
        32     1.00 [  0.00]( 0.66)     1.01 [  1.31]( 1.88)
        64     1.00 [  0.00]( 1.18)     0.98 [ -1.60]( 2.90)
       128     1.00 [  0.00]( 1.12)     1.02 [  1.60]( 0.42)
       256     1.00 [  0.00]( 0.42)     1.00 [  0.40]( 0.80)
       512     1.00 [  0.00]( 0.14)     1.01 [  0.97]( 0.25)
      1024     1.00 [  0.00]( 0.26)     1.01 [  1.29]( 0.19)


     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:           tip[pct imp](CV)     allow_imb[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     1.01 [  1.00]( 5.71)
     Scale     1.00 [  0.00]( 2.85)     0.98 [ -1.94]( 5.23)
       Add     1.00 [  0.00]( 3.39)     0.99 [ -1.39]( 4.77)
     Triad     1.00 [  0.00]( 6.39)     1.05 [  5.15]( 5.62)


     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:           tip[pct imp](CV)     allow_imb[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     1.01 [  1.28]( 2.01)
     Scale     1.00 [  0.00]( 4.34)     0.99 [ -0.65]( 3.74)
       Add     1.00 [  0.00]( 4.14)     1.01 [  0.54]( 1.63)
     Triad     1.00 [  0.00]( 1.00)     0.98 [ -2.28]( 4.89)


     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:           tip[pct imp](CV)     allow_imb[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     1.01 [  1.17]( 0.39)
      2-clients     1.00 [  0.00]( 0.58)     1.01 [  1.00]( 0.40)
      4-clients     1.00 [  0.00]( 0.35)     1.01 [  0.73]( 0.50)
      8-clients     1.00 [  0.00]( 0.48)     1.00 [  0.42]( 0.67)
     16-clients     1.00 [  0.00]( 0.66)     1.01 [  0.84]( 0.57)
     32-clients     1.00 [  0.00]( 1.15)     1.01 [  0.82]( 0.96)
     64-clients     1.00 [  0.00]( 1.38)     1.00 [ -0.24]( 3.09)
     128-clients    1.00 [  0.00]( 0.87)     1.00 [ -0.16]( 1.02)
     256-clients    1.00 [  0.00]( 5.36)     1.01 [  0.66]( 4.55)
     512-clients    1.00 [  0.00](54.39)     0.98 [ -1.59](57.35)


     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     1.04 [ -4.35]( 3.69)
       2     1.00 [ -0.00]( 1.15)     0.96 [  4.00]( 0.00)
       4     1.00 [ -0.00](13.46)     1.02 [ -2.08]( 2.04)
       8     1.00 [ -0.00]( 7.14)     0.82 [ 17.54]( 9.30)
      16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 7.83)
      32     1.00 [ -0.00]( 1.06)     1.01 [ -1.06]( 5.88)
      64     1.00 [ -0.00]( 5.48)     1.05 [ -4.65]( 2.71)
     128     1.00 [ -0.00](10.45)     1.09 [ -9.11](14.18)
     256     1.00 [ -0.00](31.14)     1.05 [ -5.15]( 9.79)
     512     1.00 [ -0.00]( 1.52)     0.96 [  4.30]( 0.26)


     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     1.00 [  0.29]( 0.61)
       2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.26)
       4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.00)
       8     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.15)
      16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
      32     1.00 [  0.00]( 3.41)     0.97 [ -2.86]( 2.91)
      64     1.00 [  0.00]( 1.05)     0.97 [ -3.17]( 7.39)
     128     1.00 [  0.00]( 0.00)     1.00 [ -0.38]( 0.39)
     256     1.00 [  0.00]( 0.72)     1.01 [  0.61]( 0.96)
     512     1.00 [  0.00]( 0.57)     1.01 [  0.72]( 0.21)


     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.69 [ 31.25]( 8.13)
       2     1.00 [ -0.00]( 0.00)     0.93 [  7.14]( 8.37)
       4     1.00 [ -0.00]( 3.78)     1.07 [ -7.14](14.79)
       8     1.00 [ -0.00]( 0.00)     1.08 [ -8.33]( 7.56)
      16     1.00 [ -0.00]( 7.56)     1.08 [ -7.69](34.36)
      32     1.00 [ -0.00](15.11)     1.00 [ -0.00](12.99)
      64     1.00 [ -0.00]( 9.63)     0.80 [ 20.00](11.17)
     128     1.00 [ -0.00]( 4.86)     0.98 [  2.01](13.01)
     256     1.00 [ -0.00]( 2.34)     1.01 [ -1.00]( 3.51)
     512     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.20)


     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     0.98 [  2.08]( 3.51)
       2     1.00 [ -0.00]( 0.87)     0.99 [  0.54]( 3.29)
       4     1.00 [ -0.00]( 1.21)     1.06 [ -5.92]( 0.82)
       8     1.00 [ -0.00]( 0.27)     1.03 [ -3.15]( 1.86)
      16     1.00 [ -0.00]( 4.04)     1.00 [ -0.27]( 2.27)
      32     1.00 [ -0.00]( 7.35)     1.30 [-30.45](20.57)
      64     1.00 [ -0.00]( 3.54)     1.01 [ -0.67]( 0.82)
     128     1.00 [ -0.00]( 0.37)     1.00 [  0.21]( 0.18)
     256     1.00 [ -0.00]( 9.57)     0.99 [  1.43]( 7.69)
     512     1.00 [ -0.00]( 1.82)     1.02 [ -2.10]( 0.89)


     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                  %diff
     ycsb-cassandra               0.07%
     ycsb-mongodb                -0.66%

     deathstarbench-1x            0.36%
     deathstarbench-2x            2.39%
     deathstarbench-3x           -0.09%
     deathstarbench-6x            1.53%

     hammerdb+mysql 16VU         -0.27%
     hammerdb+mysql 64VU         -0.32%

---

I cannot make a hard case for this optimization. You can perhaps
share your iperf numbers if you are seeing significant
improvements there.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-06-18  6:37         ` K Prateek Nayak
@ 2025-06-19  6:08           ` Jianyong Wu
  2025-06-19  6:30             ` K Prateek Nayak
  0 siblings, 1 reply; 11+ messages in thread
From: Jianyong Wu @ 2025-06-19  6:08 UTC (permalink / raw)
  To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

Hi Prateek,

Thank you for taking the time to test this patch.

This patch aims to reduce meaningless task migrations, such as those in 
iperf tests, which having not considered performance so much. In my 
iperf tests, there wasn't significant performance improvement observed. 
(Notably, the number of task migrations decreased substantially.) Even 
when I bound iperf tasks to the same LLC, performance metrics didn't 
improve significantly. Therefore, this change is unlikely to enhance 
iperf performance notably, indicating that task migration has minimal 
effect on iperf tests.

IMO, we should allow at least two tasks per LLC to enable inter-task 
communication. Theoretically, this could yield better performance, even 
though I haven't found a valid scenario to support this yet.

If this change has bad effect for performance, is there any suggestion 
to mitigate the iperf migration issue? Or just leave it there?

Any suggestions would be greatly appreciated.

Thanks
Jianyong

On 6/18/2025 2:37 PM, K Prateek Nayak wrote:
> Hello Jianyong,
> 
> On 6/16/2025 7:52 AM, Jianyong Wu wrote:
>> Would you mind letting me know if you've had a chance to try it out, 
>> or if there's any update on the progress?
> 
> Here are my results from a dual socket 3rd Generation EPYC
> system.
> 
> tl;dr I don't see any improvement and a few regressions too
> but few of those data points also have a lot of variance.
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> o Kernel details
> 
> tip:       tip:sched/core at commit 914873bc7df9 ("Merge tag
>             'x86-build-2025-05-25' of
>             git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> 
> allow_imb: tip + this series as is
> 
> o Benchmark results
> 
>      ==================================================================
>      Test          : hackbench
>      Units         : Normalized time in seconds
>      Interpretation: Lower is better
>      Statistic     : AMean
>      ==================================================================
>      Case:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>       1-groups     1.00 [ -0.00](13.74)     1.03 [ -3.20]( 9.18)
>       2-groups     1.00 [ -0.00]( 9.58)     1.06 [ -6.46]( 7.63)
>       4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -1.30]( 1.90)
>       8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.42]( 0.91)
>      16-groups     1.00 [ -0.00]( 1.10)     0.99 [  1.09]( 1.13)
> 
> 
>      ==================================================================
>      Test          : tbench
>      Units         : Normalized throughput
>      Interpretation: Higher is better
>      Statistic     : AMean
>      ==================================================================
>      Clients:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>          1     1.00 [  0.00]( 0.82)     1.01 [  1.11]( 0.27)
>          2     1.00 [  0.00]( 1.13)     1.00 [ -0.05]( 0.62)
>          4     1.00 [  0.00]( 1.12)     1.02 [  2.36]( 0.19)
>          8     1.00 [  0.00]( 0.93)     1.01 [  1.02]( 0.86)
>         16     1.00 [  0.00]( 0.38)     1.01 [  0.71]( 1.71)
>         32     1.00 [  0.00]( 0.66)     1.01 [  1.31]( 1.88)
>         64     1.00 [  0.00]( 1.18)     0.98 [ -1.60]( 2.90)
>        128     1.00 [  0.00]( 1.12)     1.02 [  1.60]( 0.42)
>        256     1.00 [  0.00]( 0.42)     1.00 [  0.40]( 0.80)
>        512     1.00 [  0.00]( 0.14)     1.01 [  0.97]( 0.25)
>       1024     1.00 [  0.00]( 0.26)     1.01 [  1.29]( 0.19)
> 
> 
>      ==================================================================
>      Test          : stream-10
>      Units         : Normalized Bandwidth, MB/s
>      Interpretation: Higher is better
>      Statistic     : HMean
>      ==================================================================
>      Test:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>       Copy     1.00 [  0.00]( 8.37)     1.01 [  1.00]( 5.71)
>      Scale     1.00 [  0.00]( 2.85)     0.98 [ -1.94]( 5.23)
>        Add     1.00 [  0.00]( 3.39)     0.99 [ -1.39]( 4.77)
>      Triad     1.00 [  0.00]( 6.39)     1.05 [  5.15]( 5.62)
> 
> 
>      ==================================================================
>      Test          : stream-100
>      Units         : Normalized Bandwidth, MB/s
>      Interpretation: Higher is better
>      Statistic     : HMean
>      ==================================================================
>      Test:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>       Copy     1.00 [  0.00]( 3.91)     1.01 [  1.28]( 2.01)
>      Scale     1.00 [  0.00]( 4.34)     0.99 [ -0.65]( 3.74)
>        Add     1.00 [  0.00]( 4.14)     1.01 [  0.54]( 1.63)
>      Triad     1.00 [  0.00]( 1.00)     0.98 [ -2.28]( 4.89)
> 
> 
>      ==================================================================
>      Test          : netperf
>      Units         : Normalized Througput
>      Interpretation: Higher is better
>      Statistic     : AMean
>      ==================================================================
>      Clients:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>       1-clients     1.00 [  0.00]( 0.41)     1.01 [  1.17]( 0.39)
>       2-clients     1.00 [  0.00]( 0.58)     1.01 [  1.00]( 0.40)
>       4-clients     1.00 [  0.00]( 0.35)     1.01 [  0.73]( 0.50)
>       8-clients     1.00 [  0.00]( 0.48)     1.00 [  0.42]( 0.67)
>      16-clients     1.00 [  0.00]( 0.66)     1.01 [  0.84]( 0.57)
>      32-clients     1.00 [  0.00]( 1.15)     1.01 [  0.82]( 0.96)
>      64-clients     1.00 [  0.00]( 1.38)     1.00 [ -0.24]( 3.09)
>      128-clients    1.00 [  0.00]( 0.87)     1.00 [ -0.16]( 1.02)
>      256-clients    1.00 [  0.00]( 5.36)     1.01 [  0.66]( 4.55)
>      512-clients    1.00 [  0.00](54.39)     0.98 [ -1.59](57.35)
> 
> 
>      ==================================================================
>      Test          : schbench
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>        1     1.00 [ -0.00]( 8.54)     1.04 [ -4.35]( 3.69)
>        2     1.00 [ -0.00]( 1.15)     0.96 [  4.00]( 0.00)
>        4     1.00 [ -0.00](13.46)     1.02 [ -2.08]( 2.04)
>        8     1.00 [ -0.00]( 7.14)     0.82 [ 17.54]( 9.30)
>       16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 7.83)
>       32     1.00 [ -0.00]( 1.06)     1.01 [ -1.06]( 5.88)
>       64     1.00 [ -0.00]( 5.48)     1.05 [ -4.65]( 2.71)
>      128     1.00 [ -0.00](10.45)     1.09 [ -9.11](14.18)
>      256     1.00 [ -0.00](31.14)     1.05 [ -5.15]( 9.79)
>      512     1.00 [ -0.00]( 1.52)     0.96 [  4.30]( 0.26)
> 
> 
>      ==================================================================
>      Test          : new-schbench-requests-per-second
>      Units         : Normalized Requests per second
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>        1     1.00 [  0.00]( 1.07)     1.00 [  0.29]( 0.61)
>        2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.26)
>        4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.00)
>        8     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.15)
>       16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
>       32     1.00 [  0.00]( 3.41)     0.97 [ -2.86]( 2.91)
>       64     1.00 [  0.00]( 1.05)     0.97 [ -3.17]( 7.39)
>      128     1.00 [  0.00]( 0.00)     1.00 [ -0.38]( 0.39)
>      256     1.00 [  0.00]( 0.72)     1.01 [  0.61]( 0.96)
>      512     1.00 [  0.00]( 0.57)     1.01 [  0.72]( 0.21)
> 
> 
>      ==================================================================
>      Test          : new-schbench-wakeup-latency
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>        1     1.00 [ -0.00]( 9.11)     0.69 [ 31.25]( 8.13)
>        2     1.00 [ -0.00]( 0.00)     0.93 [  7.14]( 8.37)
>        4     1.00 [ -0.00]( 3.78)     1.07 [ -7.14](14.79)
>        8     1.00 [ -0.00]( 0.00)     1.08 [ -8.33]( 7.56)
>       16     1.00 [ -0.00]( 7.56)     1.08 [ -7.69](34.36)
>       32     1.00 [ -0.00](15.11)     1.00 [ -0.00](12.99)
>       64     1.00 [ -0.00]( 9.63)     0.80 [ 20.00](11.17)
>      128     1.00 [ -0.00]( 4.86)     0.98 [  2.01](13.01)
>      256     1.00 [ -0.00]( 2.34)     1.01 [ -1.00]( 3.51)
>      512     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.20)
> 
> 
>      ==================================================================
>      Test          : new-schbench-request-latency
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>        1     1.00 [ -0.00]( 2.73)     0.98 [  2.08]( 3.51)
>        2     1.00 [ -0.00]( 0.87)     0.99 [  0.54]( 3.29)
>        4     1.00 [ -0.00]( 1.21)     1.06 [ -5.92]( 0.82)
>        8     1.00 [ -0.00]( 0.27)     1.03 [ -3.15]( 1.86)
>       16     1.00 [ -0.00]( 4.04)     1.00 [ -0.27]( 2.27)
>       32     1.00 [ -0.00]( 7.35)     1.30 [-30.45](20.57)
>       64     1.00 [ -0.00]( 3.54)     1.01 [ -0.67]( 0.82)
>      128     1.00 [ -0.00]( 0.37)     1.00 [  0.21]( 0.18)
>      256     1.00 [ -0.00]( 9.57)     0.99 [  1.43]( 7.69)
>      512     1.00 [ -0.00]( 1.82)     1.02 [ -2.10]( 0.89)
> 
> 
>      ==================================================================
>      Test          : Various longer running benchmarks
>      Units         : %diff in throughput reported
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      Benchmarks:                  %diff
>      ycsb-cassandra               0.07%
>      ycsb-mongodb                -0.66%
> 
>      deathstarbench-1x            0.36%
>      deathstarbench-2x            2.39%
>      deathstarbench-3x           -0.09%
>      deathstarbench-6x            1.53%
> 
>      hammerdb+mysql 16VU         -0.27%
>      hammerdb+mysql 64VU         -0.32%
> 
> ---
> 
> I cannot make a hard case for this optimization. You can perhaps
> share your iperf numbers if you are seeing significant
> improvements there.
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-06-19  6:08           ` Jianyong Wu
@ 2025-06-19  6:30             ` K Prateek Nayak
  2025-06-19  6:59               ` Jianyong Wu
  0 siblings, 1 reply; 11+ messages in thread
From: K Prateek Nayak @ 2025-06-19  6:30 UTC (permalink / raw)
  To: Jianyong Wu, Jianyong Wu, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

Hello Jianyong,

On 6/19/2025 11:38 AM, Jianyong Wu wrote:
> If this change has bad effect for performance, is there any suggestion
> to mitigate the iperf migration issue?

How big of a performance difference are you seeing? I still don't see
any numbers from your testing on the thread.

> Or just leave it there?
  
Ideally, the cache-aware load balancing series [1] should be able to
address these concerns. I suggest testing iperf with those changes and
checking if that solves the issues of excessive migration.

[1] https://lore.kernel.org/lkml/cover.1750268218.git.tim.c.chen@linux.intel.com/

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
  2025-06-19  6:30             ` K Prateek Nayak
@ 2025-06-19  6:59               ` Jianyong Wu
  0 siblings, 0 replies; 11+ messages in thread
From: Jianyong Wu @ 2025-06-19  6:59 UTC (permalink / raw)
  To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

Hi Prateek,

On 6/19/2025 2:30 PM, K Prateek Nayak wrote:
> Hello Jianyong,
> 
> On 6/19/2025 11:38 AM, Jianyong Wu wrote:
>> If this change has bad effect for performance, is there any suggestion
>> to mitigate the iperf migration issue?
> 
> How big of a performance difference are you seeing? I still don't see
> any numbers from your testing on the thread.

Sorry for that. Here is the data.

On a machine with 8 NUMAs, each with 4 LLCs, totally 128 cores with smt2.

test command:
server: iperf3 -s
client: iperf3 -c 127.0.0.1 -t 100 -i 2

==================================================
default                  allow imb
25.3 Gbits/sec           26.7 Gbits/sec     +5.5%
==================================================


> 
>> Or just leave it there?
> 
> Ideally, the cache-aware load balancing series [1] should be able to
> address these concerns. I suggest testing iperf with those changes and
> checking if that solves the issues of excessive migration.
> 
> [1] https://lore.kernel.org/lkml/ 
> cover.1750268218.git.tim.c.chen@linux.intel.com/
> 

I know this patch set. Maybe a little heave. I'll check for iperf test.

Thanks
Jianyong

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-06-19  6:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-28  7:09 [PATCH] sched/fair: allow imbalance between LLCs under NUMA Jianyong Wu
2025-05-29  6:39 ` K Prateek Nayak
2025-05-29 10:32   ` Jianyong Wu
2025-05-30  6:09     ` K Prateek Nayak
2025-05-30  7:36       ` Jianyong Wu
2025-06-16  2:22       ` Jianyong Wu
2025-06-17  4:06         ` K Prateek Nayak
2025-06-18  6:37         ` K Prateek Nayak
2025-06-19  6:08           ` Jianyong Wu
2025-06-19  6:30             ` K Prateek Nayak
2025-06-19  6:59               ` Jianyong Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).