* [PATCH] sched/fair: allow imbalance between LLCs under NUMA
@ 2025-05-28 7:09 Jianyong Wu
2025-05-29 6:39 ` K Prateek Nayak
0 siblings, 1 reply; 11+ messages in thread
From: Jianyong Wu @ 2025-05-28 7:09 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel, wujianyong, jianyong.wu
The efficiency gains from co-locating communicating tasks within the same
LLC are well-established. However, in multi-LLC NUMA systems, the load
balancer unintentionally sabotages this optimization.
Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
wakes the server within its initial LLC (e.g., LLC_0). The load balancer
subsequently migrates the client to a different LLC (e.g., LLC_1). When
the client next wakes the server, it now targets the server’s placement
to LLC_1 (the client’s new location). The server then migrates to LLC_1,
but the load balancer may reallocate the client to another
LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
perpetually chase each other across all four LLCs — a sustained
cross-LLC ping-pong within the NUMA node.
Our solution: Permit controlled load imbalance between LLCs on the same
NUMA node, prioritizing communication affinity over strict balance.
Impact: In a virtual machine with one socket, multiple NUMA nodes (each
with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
seconds as tasks cycled through all four LLCs. With the patch, migrations
stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
thrashing.
Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
---
kernel/sched/fair.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..749210e6316b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
}
#endif
+ /* Allow imbalance between LLCs within a single NUMA node */
+ if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
+ && env->sd->parent->flags & SD_NUMA) {
+ int child_weight = env->sd->child->span_weight;
+ int llc_nr = env->sd->span_weight / child_weight;
+ int imb_nr, min;
+
+ if (llc_nr > 1) {
+ /* Let the imbalance not be greater than half of child_weight */
+ min = child_weight >= 4 ? 2 : 1;
+ imb_nr = max_t(int, min, child_weight >> 2);
+ if (imb_nr >= env->imbalance)
+ env->imbalance = 0;
+ }
+ }
+
/* Number of tasks to move to restore balance */
env->imbalance >>= 1;
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-05-28 7:09 [PATCH] sched/fair: allow imbalance between LLCs under NUMA Jianyong Wu
@ 2025-05-29 6:39 ` K Prateek Nayak
2025-05-29 10:32 ` Jianyong Wu
0 siblings, 1 reply; 11+ messages in thread
From: K Prateek Nayak @ 2025-05-29 6:39 UTC (permalink / raw)
To: Jianyong Wu, mingo, peterz, juri.lelli, vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel, jianyong.wu
On 5/28/2025 12:39 PM, Jianyong Wu wrote:
> The efficiency gains from co-locating communicating tasks within the same
> LLC are well-established. However, in multi-LLC NUMA systems, the load
> balancer unintentionally sabotages this optimization.
>
> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
> subsequently migrates the client to a different LLC (e.g., LLC_1). When
> the client next wakes the server, it now targets the server’s placement
> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
> but the load balancer may reallocate the client to another
> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
> perpetually chase each other across all four LLCs — a sustained
> cross-LLC ping-pong within the NUMA node.
Migration only happens if the CPU is overloaded right? I've only seen
this happen when a noise like kworker comes in. What exactly is
causing these migrations in your case and is it actually that bad
for iperf?
>
> Our solution: Permit controlled load imbalance between LLCs on the same
> NUMA node, prioritizing communication affinity over strict balance.
>
> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
> seconds as tasks cycled through all four LLCs. With the patch, migrations
> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
> thrashing.
Is there any improvement in iperf numbers with these changes?
>
> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
> ---
> kernel/sched/fair.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995a47..749210e6316b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> }
> #endif
>
> + /* Allow imbalance between LLCs within a single NUMA node */
> + if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
> + && env->sd->parent->flags & SD_NUMA) {
This does not imply multiple LLC in package. SD_SHARE_LLC is
SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
will be true on Intel with SNC enabled despite not having multiple LLC
and llc_nr will be number of cores there.
Perhaps multiple LLCs can be detected using:
!((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
> + int child_weight = env->sd->child->span_weight;
> + int llc_nr = env->sd->span_weight / child_weight;
> + int imb_nr, min;
> +
> + if (llc_nr > 1) {
> + /* Let the imbalance not be greater than half of child_weight */
> + min = child_weight >= 4 ? 2 : 1;
> + imb_nr = max_t(int, min, child_weight >> 2);
Isn't this just max_t(int, child_weight >> 2, 1)?
> + if (imb_nr >= env->imbalance)
> + env->imbalance = 0;
At this point, we are trying to even out the number of idle CPUs on the
destination and the busiest LLC. sched_balance_find_src_rq() will return
NULL if it doesn't find an overloaded rq. Is waiting behind a task
more beneficial than migrating to an idler LLC?
> + }
> + }
> +
> /* Number of tasks to move to restore balance */
> env->imbalance >>= 1;
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-05-29 6:39 ` K Prateek Nayak
@ 2025-05-29 10:32 ` Jianyong Wu
2025-05-30 6:09 ` K Prateek Nayak
0 siblings, 1 reply; 11+ messages in thread
From: Jianyong Wu @ 2025-05-29 10:32 UTC (permalink / raw)
To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
Hello K Prateek Nayak, thanks for reply.
On 5/29/2025 2:39 PM, K Prateek Nayak wrote:
> On 5/28/2025 12:39 PM, Jianyong Wu wrote:
>> The efficiency gains from co-locating communicating tasks within the same
>> LLC are well-established. However, in multi-LLC NUMA systems, the load
>> balancer unintentionally sabotages this optimization.
>>
>> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
>> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
>> subsequently migrates the client to a different LLC (e.g., LLC_1). When
>> the client next wakes the server, it now targets the server’s placement
>> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
>> but the load balancer may reallocate the client to another
>> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
>> perpetually chase each other across all four LLCs — a sustained
>> cross-LLC ping-pong within the NUMA node.
>
> Migration only happens if the CPU is overloaded right?
This will happen even when 2 task are located in a cpuset of 16 cpus
that shares an LLC. I don't think that it's overloaded for this case.
I've only seen
> this happen when a noise like kworker comes in. What exactly is
> causing these migrations in your case and is it actually that bad
> for iperf?
I think it's the nohz idle balance that pulls these 2 iperf apart. But
the root cause is that load balance doesn't permit even a slight
imbalance among LLCs.
Exactly. It's easy to reproduce in those multi-LLCs NUMA system like
some AMD servers.
>
>>
>> Our solution: Permit controlled load imbalance between LLCs on the same
>> NUMA node, prioritizing communication affinity over strict balance.
>>
>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>> seconds as tasks cycled through all four LLCs. With the patch, migrations
>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>> thrashing.
>
> Is there any improvement in iperf numbers with these changes?
>
I observe a bit of improvement with this patch in my test.
>>
>> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
>> ---
>> kernel/sched/fair.c | 16 ++++++++++++++++
>> 1 file changed, 16 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0fb9bf995a47..749210e6316b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct
>> lb_env *env, struct sd_lb_stats *s
>> }
>> #endif
>> + /* Allow imbalance between LLCs within a single NUMA node */
>> + if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC &&
>> env->sd->parent
>> + && env->sd->parent->flags & SD_NUMA) {
>
> This does not imply multiple LLC in package. SD_SHARE_LLC is
> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
> will be true on Intel with SNC enabled despite not having multiple LLC
> and llc_nr will be number of cores there.
>
> Perhaps multiple LLCs can be detected using:
>
> !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
Great! Thanks!>
>> + int child_weight = env->sd->child->span_weight;
>> + int llc_nr = env->sd->span_weight / child_weight;
>> + int imb_nr, min;
>> +
>> + if (llc_nr > 1) {
>> + /* Let the imbalance not be greater than half of
>> child_weight */
>> + min = child_weight >= 4 ? 2 : 1;
>> + imb_nr = max_t(int, min, child_weight >> 2);
>
> Isn't this just max_t(int, child_weight >> 2, 1)?
I expect that imb_nr can be 2 when child_weight is 4, as I observe that
the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
However, this may cause the LLCs a bit overload. I'm not sure if it's a
good idea.
>
>> + if (imb_nr >= env->imbalance)
>> + env->imbalance = 0;
>
> At this point, we are trying to even out the number of idle CPUs on the
> destination and the busiest LLC. sched_balance_find_src_rq() will return
> NULL if it doesn't find an overloaded rq. Is waiting behind a task
> more beneficial than migrating to an idler LLC?
>
It seems that a small imbalance may not impact so much that cause task
waiting for schedule because we limit the imbalance not greater than
half, in most case 1/4, of the LLC weight. The imbalance can reduce the
frequency of task migration and load balance. it's better than enforcing
a strict balance rules.
we have done similar things among NUMAs, so may be it's reasonable to
migrate them across LLCs.
Thanks
Jianyong Wu
>> + }
>> + }
>> +
>> /* Number of tasks to move to restore balance */
>> env->imbalance >>= 1;
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-05-29 10:32 ` Jianyong Wu
@ 2025-05-30 6:09 ` K Prateek Nayak
2025-05-30 7:36 ` Jianyong Wu
2025-06-16 2:22 ` Jianyong Wu
0 siblings, 2 replies; 11+ messages in thread
From: K Prateek Nayak @ 2025-05-30 6:09 UTC (permalink / raw)
To: Jianyong Wu, Jianyong Wu, mingo, peterz, juri.lelli,
vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
Hello Jianyong,
On 5/29/2025 4:02 PM, Jianyong Wu wrote:
>
> This will happen even when 2 task are located in a cpuset of 16 cpus that shares an LLC. I don't think that it's overloaded for this case.
But if they are located on 2 different CPUs, sched_balance_find_src_rq()
should not return any CPU right? Probably just a timing thing with some
system noise that causes the CPU running the server / client to be
temporarily overloaded.
>
> I've only seen
>> this happen when a noise like kworker comes in. What exactly is
>> causing these migrations in your case and is it actually that bad
>> for iperf?
>
> I think it's the nohz idle balance that pulls these 2 iperf apart. But the root cause is that load balance doesn't permit even a slight imbalance among LLCs.
>
> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like some AMD servers.
>
>>
>>>
>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>> NUMA node, prioritizing communication affinity over strict balance.
>>>
>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>> seconds as tasks cycled through all four LLCs. With the patch, migrations
>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>> thrashing.
>>
>> Is there any improvement in iperf numbers with these changes?
>>
> I observe a bit of improvement with this patch in my test.
I'll also give this series a spin on my end to see if it helps.
>
>>>
>>> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
>>> ---
>>> kernel/sched/fair.c | 16 ++++++++++++++++
>>> 1 file changed, 16 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 0fb9bf995a47..749210e6316b 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>> }
>>> #endif
>>> + /* Allow imbalance between LLCs within a single NUMA node */
>>> + if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
>>> + && env->sd->parent->flags & SD_NUMA) {
>>
>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>> will be true on Intel with SNC enabled despite not having multiple LLC
>> and llc_nr will be number of cores there.
>>
>> Perhaps multiple LLCs can be detected using:
>>
>> !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
This should have been just
(sd->child->flags ^ sd->flags) & SD_SHARE_LLC
to find the LLC boundary. Not sure why I prefixed that "!". You also
have to ensure sd itself is not a NUMA domain which is possible with L3
as NUMA option EPYC platforms and Intel with SNC.
>
> Great! Thanks!>
>>> + int child_weight = env->sd->child->span_weight;
>>> + int llc_nr = env->sd->span_weight / child_weight;
>>> + int imb_nr, min;
>>> +
>>> + if (llc_nr > 1) {
>>> + /* Let the imbalance not be greater than half of child_weight */
>>> + min = child_weight >= 4 ? 2 : 1;
>>> + imb_nr = max_t(int, min, child_weight >> 2);
>>
>> Isn't this just max_t(int, child_weight >> 2, 1)?
>
> I expect that imb_nr can be 2 when child_weight is 4, as I observe that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
> However, this may cause the LLCs a bit overload. I'm not sure if it's a good idea.
My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
working moments.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-05-30 6:09 ` K Prateek Nayak
@ 2025-05-30 7:36 ` Jianyong Wu
2025-06-16 2:22 ` Jianyong Wu
1 sibling, 0 replies; 11+ messages in thread
From: Jianyong Wu @ 2025-05-30 7:36 UTC (permalink / raw)
To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
Hi Prateek,
On 5/30/2025 2:09 PM, K Prateek Nayak wrote:
> Hello Jianyong,
>
> On 5/29/2025 4:02 PM, Jianyong Wu wrote:
>>
>> This will happen even when 2 task are located in a cpuset of 16 cpus
>> that shares an LLC. I don't think that it's overloaded for this case.
>
> But if they are located on 2 different CPUs, sched_balance_find_src_rq()
> should not return any CPU right? Probably just a timing thing with some
> system noise that causes the CPU running the server / client to be
> temporarily overloaded.
>
I think it will.
Suppose this scenario. There are 2 LLCs each associated with 4 cpus,
under a single NUMA node. Sometimes, one LLC has 4 tasks, each running
on a separate cpu, while the other LLC has no task running. Should the
load balancer take action to balance the workload? Absolutely yes.
The tricky point is that the balance action will fail during the first
few attempts. In the meanwhile, sd->nr_balance_failed increments until
it exceeds the threshold sd->cache_nice_tries + 2. At that point, active
balancing is triggered. Eventually, the migration thread steps in to
migrate the task. That's exactly what I've observed.
>>
>> I've only seen
>>> this happen when a noise like kworker comes in. What exactly is
>>> causing these migrations in your case and is it actually that bad
>>> for iperf?
>>
>> I think it's the nohz idle balance that pulls these 2 iperf apart. But
>> the root cause is that load balance doesn't permit even a slight
>> imbalance among LLCs.
>>
>> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like
>> some AMD servers.
>>
>>>
>>>>
>>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>>> NUMA node, prioritizing communication affinity over strict balance.
>>>>
>>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>>> seconds as tasks cycled through all four LLCs. With the patch,
>>>> migrations
>>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>>> thrashing.
>>>
>>> Is there any improvement in iperf numbers with these changes?
>>>
>> I observe a bit of improvement with this patch in my test.
>
> I'll also give this series a spin on my end to see if it helps.
>
Great! Let me know how it goes on your end.
>>
>>>>
>>>> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
>>>> ---
>>>> kernel/sched/fair.c | 16 ++++++++++++++++
>>>> 1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 0fb9bf995a47..749210e6316b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -11203,6 +11203,22 @@ static inline void
>>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>> }
>>>> #endif
>>>> + /* Allow imbalance between LLCs within a single NUMA node */
>>>> + if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC
>>>> && env->sd->parent
>>>> + && env->sd->parent->flags & SD_NUMA) {
>>>
>>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>>> will be true on Intel with SNC enabled despite not having multiple LLC
>>> and llc_nr will be number of cores there.
>>>
>>> Perhaps multiple LLCs can be detected using:
>>>
>>> !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
>
> This should have been just
>
> (sd->child->flags ^ sd->flags) & SD_SHARE_LLC
>
> to find the LLC boundary. Not sure why I prefixed that "!". You also
> have to ensure sd itself is not a NUMA domain which is possible with L3
> as NUMA option EPYC platforms and Intel with SNC.
>
Thanks again, I made a mistake too.
Thanks
Jianyong
>>
>> Great! Thanks!>
>>>> + int child_weight = env->sd->child->span_weight;
>>>> + int llc_nr = env->sd->span_weight / child_weight;
>>>> + int imb_nr, min;
>>>> +
>>>> + if (llc_nr > 1) {
>>>> + /* Let the imbalance not be greater than half of
>>>> child_weight */
>>>> + min = child_weight >= 4 ? 2 : 1;
>>>> + imb_nr = max_t(int, min, child_weight >> 2);
>>>
>>> Isn't this just max_t(int, child_weight >> 2, 1)?
>>
>> I expect that imb_nr can be 2 when child_weight is 4, as I observe
>> that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
>> However, this may cause the LLCs a bit overload. I'm not sure if it's
>> a good idea.
>
> My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
> working moments.
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-05-30 6:09 ` K Prateek Nayak
2025-05-30 7:36 ` Jianyong Wu
@ 2025-06-16 2:22 ` Jianyong Wu
2025-06-17 4:06 ` K Prateek Nayak
2025-06-18 6:37 ` K Prateek Nayak
1 sibling, 2 replies; 11+ messages in thread
From: Jianyong Wu @ 2025-06-16 2:22 UTC (permalink / raw)
To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
Hi Prateek,
On 5/30/2025 2:09 PM, K Prateek Nayak wrote:
> Hello Jianyong,
>
> On 5/29/2025 4:02 PM, Jianyong Wu wrote:
>>
>> This will happen even when 2 task are located in a cpuset of 16 cpus
>> that shares an LLC. I don't think that it's overloaded for this case.
>
> But if they are located on 2 different CPUs, sched_balance_find_src_rq()
> should not return any CPU right? Probably just a timing thing with some
> system noise that causes the CPU running the server / client to be
> temporarily overloaded.
>
>>
>> I've only seen
>>> this happen when a noise like kworker comes in. What exactly is
>>> causing these migrations in your case and is it actually that bad
>>> for iperf?
>>
>> I think it's the nohz idle balance that pulls these 2 iperf apart. But
>> the root cause is that load balance doesn't permit even a slight
>> imbalance among LLCs.
>>
>> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like
>> some AMD servers.
>>
>>>
>>>>
>>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>>> NUMA node, prioritizing communication affinity over strict balance.
>>>>
>>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>>> seconds as tasks cycled through all four LLCs. With the patch,
>>>> migrations
>>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>>> thrashing.
>>>
>>> Is there any improvement in iperf numbers with these changes?
>>>
>> I observe a bit of improvement with this patch in my test.
>
> I'll also give this series a spin on my end to see if it helps.
Would you mind letting me know if you've had a chance to try it out, or
if there's any update on the progress?
Thanks
Jianyong>
>>
>>>>
>>>> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
>>>> ---
>>>> kernel/sched/fair.c | 16 ++++++++++++++++
>>>> 1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 0fb9bf995a47..749210e6316b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -11203,6 +11203,22 @@ static inline void
>>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>> }
>>>> #endif
>>>> + /* Allow imbalance between LLCs within a single NUMA node */
>>>> + if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC
>>>> && env->sd->parent
>>>> + && env->sd->parent->flags & SD_NUMA) {
>>>
>>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>>> will be true on Intel with SNC enabled despite not having multiple LLC
>>> and llc_nr will be number of cores there.
>>>
>>> Perhaps multiple LLCs can be detected using:
>>>
>>> !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
>
> This should have been just
>
> (sd->child->flags ^ sd->flags) & SD_SHARE_LLC
>
> to find the LLC boundary. Not sure why I prefixed that "!". You also
> have to ensure sd itself is not a NUMA domain which is possible with L3
> as NUMA option EPYC platforms and Intel with SNC.
>
>>
>> Great! Thanks!>
>>>> + int child_weight = env->sd->child->span_weight;
>>>> + int llc_nr = env->sd->span_weight / child_weight;
>>>> + int imb_nr, min;
>>>> +
>>>> + if (llc_nr > 1) {
>>>> + /* Let the imbalance not be greater than half of
>>>> child_weight */
>>>> + min = child_weight >= 4 ? 2 : 1;
>>>> + imb_nr = max_t(int, min, child_weight >> 2);
>>>
>>> Isn't this just max_t(int, child_weight >> 2, 1)?
>>
>> I expect that imb_nr can be 2 when child_weight is 4, as I observe
>> that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
>> However, this may cause the LLCs a bit overload. I'm not sure if it's
>> a good idea.
>
> My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
> working moments.
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-06-16 2:22 ` Jianyong Wu
@ 2025-06-17 4:06 ` K Prateek Nayak
2025-06-18 6:37 ` K Prateek Nayak
1 sibling, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2025-06-17 4:06 UTC (permalink / raw)
To: Jianyong Wu, Jianyong Wu, mingo, peterz, juri.lelli,
vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
Hello Jianyong,
On 6/16/2025 7:52 AM, Jianyong Wu wrote:
>> I'll also give this series a spin on my end to see if it helps.
>
> Would you mind letting me know if you've had a chance to try it out, or if there's any update on the progress?
I queued this up last night but didn't realize tip:sched/core had moved
since my last run so my baseline numbers might inaccurate for
comparison. Please give me one more day and I'll get back to you by
tomorrow after rerunning the baseline.
P.S. I saw a crash towards the end of my test run (might be unrelated)
I'll check on this too and get back to you.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-06-16 2:22 ` Jianyong Wu
2025-06-17 4:06 ` K Prateek Nayak
@ 2025-06-18 6:37 ` K Prateek Nayak
2025-06-19 6:08 ` Jianyong Wu
1 sibling, 1 reply; 11+ messages in thread
From: K Prateek Nayak @ 2025-06-18 6:37 UTC (permalink / raw)
To: Jianyong Wu, Jianyong Wu, mingo, peterz, juri.lelli,
vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
Hello Jianyong,
On 6/16/2025 7:52 AM, Jianyong Wu wrote:
> Would you mind letting me know if you've had a chance to try it out, or if there's any update on the progress?
Here are my results from a dual socket 3rd Generation EPYC
system.
tl;dr I don't see any improvement and a few regressions too
but few of those data points also have a lot of variance.
o Machine details
- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)
o Kernel details
tip: tip:sched/core at commit 914873bc7df9 ("Merge tag
'x86-build-2025-05-25' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
allow_imb: tip + this series as is
o Benchmark results
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) allow_imb[pct imp](CV)
1-groups 1.00 [ -0.00](13.74) 1.03 [ -3.20]( 9.18)
2-groups 1.00 [ -0.00]( 9.58) 1.06 [ -6.46]( 7.63)
4-groups 1.00 [ -0.00]( 2.10) 1.01 [ -1.30]( 1.90)
8-groups 1.00 [ -0.00]( 1.51) 0.99 [ 1.42]( 0.91)
16-groups 1.00 [ -0.00]( 1.10) 0.99 [ 1.09]( 1.13)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) allow_imb[pct imp](CV)
1 1.00 [ 0.00]( 0.82) 1.01 [ 1.11]( 0.27)
2 1.00 [ 0.00]( 1.13) 1.00 [ -0.05]( 0.62)
4 1.00 [ 0.00]( 1.12) 1.02 [ 2.36]( 0.19)
8 1.00 [ 0.00]( 0.93) 1.01 [ 1.02]( 0.86)
16 1.00 [ 0.00]( 0.38) 1.01 [ 0.71]( 1.71)
32 1.00 [ 0.00]( 0.66) 1.01 [ 1.31]( 1.88)
64 1.00 [ 0.00]( 1.18) 0.98 [ -1.60]( 2.90)
128 1.00 [ 0.00]( 1.12) 1.02 [ 1.60]( 0.42)
256 1.00 [ 0.00]( 0.42) 1.00 [ 0.40]( 0.80)
512 1.00 [ 0.00]( 0.14) 1.01 [ 0.97]( 0.25)
1024 1.00 [ 0.00]( 0.26) 1.01 [ 1.29]( 0.19)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) allow_imb[pct imp](CV)
Copy 1.00 [ 0.00]( 8.37) 1.01 [ 1.00]( 5.71)
Scale 1.00 [ 0.00]( 2.85) 0.98 [ -1.94]( 5.23)
Add 1.00 [ 0.00]( 3.39) 0.99 [ -1.39]( 4.77)
Triad 1.00 [ 0.00]( 6.39) 1.05 [ 5.15]( 5.62)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) allow_imb[pct imp](CV)
Copy 1.00 [ 0.00]( 3.91) 1.01 [ 1.28]( 2.01)
Scale 1.00 [ 0.00]( 4.34) 0.99 [ -0.65]( 3.74)
Add 1.00 [ 0.00]( 4.14) 1.01 [ 0.54]( 1.63)
Triad 1.00 [ 0.00]( 1.00) 0.98 [ -2.28]( 4.89)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) allow_imb[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.41) 1.01 [ 1.17]( 0.39)
2-clients 1.00 [ 0.00]( 0.58) 1.01 [ 1.00]( 0.40)
4-clients 1.00 [ 0.00]( 0.35) 1.01 [ 0.73]( 0.50)
8-clients 1.00 [ 0.00]( 0.48) 1.00 [ 0.42]( 0.67)
16-clients 1.00 [ 0.00]( 0.66) 1.01 [ 0.84]( 0.57)
32-clients 1.00 [ 0.00]( 1.15) 1.01 [ 0.82]( 0.96)
64-clients 1.00 [ 0.00]( 1.38) 1.00 [ -0.24]( 3.09)
128-clients 1.00 [ 0.00]( 0.87) 1.00 [ -0.16]( 1.02)
256-clients 1.00 [ 0.00]( 5.36) 1.01 [ 0.66]( 4.55)
512-clients 1.00 [ 0.00](54.39) 0.98 [ -1.59](57.35)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) allow_imb[pct imp](CV)
1 1.00 [ -0.00]( 8.54) 1.04 [ -4.35]( 3.69)
2 1.00 [ -0.00]( 1.15) 0.96 [ 4.00]( 0.00)
4 1.00 [ -0.00](13.46) 1.02 [ -2.08]( 2.04)
8 1.00 [ -0.00]( 7.14) 0.82 [ 17.54]( 9.30)
16 1.00 [ -0.00]( 3.49) 1.05 [ -5.08]( 7.83)
32 1.00 [ -0.00]( 1.06) 1.01 [ -1.06]( 5.88)
64 1.00 [ -0.00]( 5.48) 1.05 [ -4.65]( 2.71)
128 1.00 [ -0.00](10.45) 1.09 [ -9.11](14.18)
256 1.00 [ -0.00](31.14) 1.05 [ -5.15]( 9.79)
512 1.00 [ -0.00]( 1.52) 0.96 [ 4.30]( 0.26)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) allow_imb[pct imp](CV)
1 1.00 [ 0.00]( 1.07) 1.00 [ 0.29]( 0.61)
2 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.26)
4 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.00)
8 1.00 [ 0.00]( 0.15) 1.00 [ 0.29]( 0.15)
16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 3.41) 0.97 [ -2.86]( 2.91)
64 1.00 [ 0.00]( 1.05) 0.97 [ -3.17]( 7.39)
128 1.00 [ 0.00]( 0.00) 1.00 [ -0.38]( 0.39)
256 1.00 [ 0.00]( 0.72) 1.01 [ 0.61]( 0.96)
512 1.00 [ 0.00]( 0.57) 1.01 [ 0.72]( 0.21)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) allow_imb[pct imp](CV)
1 1.00 [ -0.00]( 9.11) 0.69 [ 31.25]( 8.13)
2 1.00 [ -0.00]( 0.00) 0.93 [ 7.14]( 8.37)
4 1.00 [ -0.00]( 3.78) 1.07 [ -7.14](14.79)
8 1.00 [ -0.00]( 0.00) 1.08 [ -8.33]( 7.56)
16 1.00 [ -0.00]( 7.56) 1.08 [ -7.69](34.36)
32 1.00 [ -0.00](15.11) 1.00 [ -0.00](12.99)
64 1.00 [ -0.00]( 9.63) 0.80 [ 20.00](11.17)
128 1.00 [ -0.00]( 4.86) 0.98 [ 2.01](13.01)
256 1.00 [ -0.00]( 2.34) 1.01 [ -1.00]( 3.51)
512 1.00 [ -0.00]( 0.40) 1.00 [ 0.38]( 0.20)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) allow_imb[pct imp](CV)
1 1.00 [ -0.00]( 2.73) 0.98 [ 2.08]( 3.51)
2 1.00 [ -0.00]( 0.87) 0.99 [ 0.54]( 3.29)
4 1.00 [ -0.00]( 1.21) 1.06 [ -5.92]( 0.82)
8 1.00 [ -0.00]( 0.27) 1.03 [ -3.15]( 1.86)
16 1.00 [ -0.00]( 4.04) 1.00 [ -0.27]( 2.27)
32 1.00 [ -0.00]( 7.35) 1.30 [-30.45](20.57)
64 1.00 [ -0.00]( 3.54) 1.01 [ -0.67]( 0.82)
128 1.00 [ -0.00]( 0.37) 1.00 [ 0.21]( 0.18)
256 1.00 [ -0.00]( 9.57) 0.99 [ 1.43]( 7.69)
512 1.00 [ -0.00]( 1.82) 1.02 [ -2.10]( 0.89)
==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: %diff
ycsb-cassandra 0.07%
ycsb-mongodb -0.66%
deathstarbench-1x 0.36%
deathstarbench-2x 2.39%
deathstarbench-3x -0.09%
deathstarbench-6x 1.53%
hammerdb+mysql 16VU -0.27%
hammerdb+mysql 64VU -0.32%
---
I cannot make a hard case for this optimization. You can perhaps
share your iperf numbers if you are seeing significant
improvements there.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-06-18 6:37 ` K Prateek Nayak
@ 2025-06-19 6:08 ` Jianyong Wu
2025-06-19 6:30 ` K Prateek Nayak
0 siblings, 1 reply; 11+ messages in thread
From: Jianyong Wu @ 2025-06-19 6:08 UTC (permalink / raw)
To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
Hi Prateek,
Thank you for taking the time to test this patch.
This patch aims to reduce meaningless task migrations, such as those in
iperf tests, which having not considered performance so much. In my
iperf tests, there wasn't significant performance improvement observed.
(Notably, the number of task migrations decreased substantially.) Even
when I bound iperf tasks to the same LLC, performance metrics didn't
improve significantly. Therefore, this change is unlikely to enhance
iperf performance notably, indicating that task migration has minimal
effect on iperf tests.
IMO, we should allow at least two tasks per LLC to enable inter-task
communication. Theoretically, this could yield better performance, even
though I haven't found a valid scenario to support this yet.
If this change has bad effect for performance, is there any suggestion
to mitigate the iperf migration issue? Or just leave it there?
Any suggestions would be greatly appreciated.
Thanks
Jianyong
On 6/18/2025 2:37 PM, K Prateek Nayak wrote:
> Hello Jianyong,
>
> On 6/16/2025 7:52 AM, Jianyong Wu wrote:
>> Would you mind letting me know if you've had a chance to try it out,
>> or if there's any update on the progress?
>
> Here are my results from a dual socket 3rd Generation EPYC
> system.
>
> tl;dr I don't see any improvement and a few regressions too
> but few of those data points also have a lot of variance.
>
> o Machine details
>
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
>
> o Kernel details
>
> tip: tip:sched/core at commit 914873bc7df9 ("Merge tag
> 'x86-build-2025-05-25' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
>
> allow_imb: tip + this series as is
>
> o Benchmark results
>
> ==================================================================
> Test : hackbench
> Units : Normalized time in seconds
> Interpretation: Lower is better
> Statistic : AMean
> ==================================================================
> Case: tip[pct imp](CV) allow_imb[pct imp](CV)
> 1-groups 1.00 [ -0.00](13.74) 1.03 [ -3.20]( 9.18)
> 2-groups 1.00 [ -0.00]( 9.58) 1.06 [ -6.46]( 7.63)
> 4-groups 1.00 [ -0.00]( 2.10) 1.01 [ -1.30]( 1.90)
> 8-groups 1.00 [ -0.00]( 1.51) 0.99 [ 1.42]( 0.91)
> 16-groups 1.00 [ -0.00]( 1.10) 0.99 [ 1.09]( 1.13)
>
>
> ==================================================================
> Test : tbench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) allow_imb[pct imp](CV)
> 1 1.00 [ 0.00]( 0.82) 1.01 [ 1.11]( 0.27)
> 2 1.00 [ 0.00]( 1.13) 1.00 [ -0.05]( 0.62)
> 4 1.00 [ 0.00]( 1.12) 1.02 [ 2.36]( 0.19)
> 8 1.00 [ 0.00]( 0.93) 1.01 [ 1.02]( 0.86)
> 16 1.00 [ 0.00]( 0.38) 1.01 [ 0.71]( 1.71)
> 32 1.00 [ 0.00]( 0.66) 1.01 [ 1.31]( 1.88)
> 64 1.00 [ 0.00]( 1.18) 0.98 [ -1.60]( 2.90)
> 128 1.00 [ 0.00]( 1.12) 1.02 [ 1.60]( 0.42)
> 256 1.00 [ 0.00]( 0.42) 1.00 [ 0.40]( 0.80)
> 512 1.00 [ 0.00]( 0.14) 1.01 [ 0.97]( 0.25)
> 1024 1.00 [ 0.00]( 0.26) 1.01 [ 1.29]( 0.19)
>
>
> ==================================================================
> Test : stream-10
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) allow_imb[pct imp](CV)
> Copy 1.00 [ 0.00]( 8.37) 1.01 [ 1.00]( 5.71)
> Scale 1.00 [ 0.00]( 2.85) 0.98 [ -1.94]( 5.23)
> Add 1.00 [ 0.00]( 3.39) 0.99 [ -1.39]( 4.77)
> Triad 1.00 [ 0.00]( 6.39) 1.05 [ 5.15]( 5.62)
>
>
> ==================================================================
> Test : stream-100
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) allow_imb[pct imp](CV)
> Copy 1.00 [ 0.00]( 3.91) 1.01 [ 1.28]( 2.01)
> Scale 1.00 [ 0.00]( 4.34) 0.99 [ -0.65]( 3.74)
> Add 1.00 [ 0.00]( 4.14) 1.01 [ 0.54]( 1.63)
> Triad 1.00 [ 0.00]( 1.00) 0.98 [ -2.28]( 4.89)
>
>
> ==================================================================
> Test : netperf
> Units : Normalized Througput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) allow_imb[pct imp](CV)
> 1-clients 1.00 [ 0.00]( 0.41) 1.01 [ 1.17]( 0.39)
> 2-clients 1.00 [ 0.00]( 0.58) 1.01 [ 1.00]( 0.40)
> 4-clients 1.00 [ 0.00]( 0.35) 1.01 [ 0.73]( 0.50)
> 8-clients 1.00 [ 0.00]( 0.48) 1.00 [ 0.42]( 0.67)
> 16-clients 1.00 [ 0.00]( 0.66) 1.01 [ 0.84]( 0.57)
> 32-clients 1.00 [ 0.00]( 1.15) 1.01 [ 0.82]( 0.96)
> 64-clients 1.00 [ 0.00]( 1.38) 1.00 [ -0.24]( 3.09)
> 128-clients 1.00 [ 0.00]( 0.87) 1.00 [ -0.16]( 1.02)
> 256-clients 1.00 [ 0.00]( 5.36) 1.01 [ 0.66]( 4.55)
> 512-clients 1.00 [ 0.00](54.39) 0.98 [ -1.59](57.35)
>
>
> ==================================================================
> Test : schbench
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) allow_imb[pct imp](CV)
> 1 1.00 [ -0.00]( 8.54) 1.04 [ -4.35]( 3.69)
> 2 1.00 [ -0.00]( 1.15) 0.96 [ 4.00]( 0.00)
> 4 1.00 [ -0.00](13.46) 1.02 [ -2.08]( 2.04)
> 8 1.00 [ -0.00]( 7.14) 0.82 [ 17.54]( 9.30)
> 16 1.00 [ -0.00]( 3.49) 1.05 [ -5.08]( 7.83)
> 32 1.00 [ -0.00]( 1.06) 1.01 [ -1.06]( 5.88)
> 64 1.00 [ -0.00]( 5.48) 1.05 [ -4.65]( 2.71)
> 128 1.00 [ -0.00](10.45) 1.09 [ -9.11](14.18)
> 256 1.00 [ -0.00](31.14) 1.05 [ -5.15]( 9.79)
> 512 1.00 [ -0.00]( 1.52) 0.96 [ 4.30]( 0.26)
>
>
> ==================================================================
> Test : new-schbench-requests-per-second
> Units : Normalized Requests per second
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) allow_imb[pct imp](CV)
> 1 1.00 [ 0.00]( 1.07) 1.00 [ 0.29]( 0.61)
> 2 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.26)
> 4 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.00)
> 8 1.00 [ 0.00]( 0.15) 1.00 [ 0.29]( 0.15)
> 16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
> 32 1.00 [ 0.00]( 3.41) 0.97 [ -2.86]( 2.91)
> 64 1.00 [ 0.00]( 1.05) 0.97 [ -3.17]( 7.39)
> 128 1.00 [ 0.00]( 0.00) 1.00 [ -0.38]( 0.39)
> 256 1.00 [ 0.00]( 0.72) 1.01 [ 0.61]( 0.96)
> 512 1.00 [ 0.00]( 0.57) 1.01 [ 0.72]( 0.21)
>
>
> ==================================================================
> Test : new-schbench-wakeup-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) allow_imb[pct imp](CV)
> 1 1.00 [ -0.00]( 9.11) 0.69 [ 31.25]( 8.13)
> 2 1.00 [ -0.00]( 0.00) 0.93 [ 7.14]( 8.37)
> 4 1.00 [ -0.00]( 3.78) 1.07 [ -7.14](14.79)
> 8 1.00 [ -0.00]( 0.00) 1.08 [ -8.33]( 7.56)
> 16 1.00 [ -0.00]( 7.56) 1.08 [ -7.69](34.36)
> 32 1.00 [ -0.00](15.11) 1.00 [ -0.00](12.99)
> 64 1.00 [ -0.00]( 9.63) 0.80 [ 20.00](11.17)
> 128 1.00 [ -0.00]( 4.86) 0.98 [ 2.01](13.01)
> 256 1.00 [ -0.00]( 2.34) 1.01 [ -1.00]( 3.51)
> 512 1.00 [ -0.00]( 0.40) 1.00 [ 0.38]( 0.20)
>
>
> ==================================================================
> Test : new-schbench-request-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) allow_imb[pct imp](CV)
> 1 1.00 [ -0.00]( 2.73) 0.98 [ 2.08]( 3.51)
> 2 1.00 [ -0.00]( 0.87) 0.99 [ 0.54]( 3.29)
> 4 1.00 [ -0.00]( 1.21) 1.06 [ -5.92]( 0.82)
> 8 1.00 [ -0.00]( 0.27) 1.03 [ -3.15]( 1.86)
> 16 1.00 [ -0.00]( 4.04) 1.00 [ -0.27]( 2.27)
> 32 1.00 [ -0.00]( 7.35) 1.30 [-30.45](20.57)
> 64 1.00 [ -0.00]( 3.54) 1.01 [ -0.67]( 0.82)
> 128 1.00 [ -0.00]( 0.37) 1.00 [ 0.21]( 0.18)
> 256 1.00 [ -0.00]( 9.57) 0.99 [ 1.43]( 7.69)
> 512 1.00 [ -0.00]( 1.82) 1.02 [ -2.10]( 0.89)
>
>
> ==================================================================
> Test : Various longer running benchmarks
> Units : %diff in throughput reported
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> Benchmarks: %diff
> ycsb-cassandra 0.07%
> ycsb-mongodb -0.66%
>
> deathstarbench-1x 0.36%
> deathstarbench-2x 2.39%
> deathstarbench-3x -0.09%
> deathstarbench-6x 1.53%
>
> hammerdb+mysql 16VU -0.27%
> hammerdb+mysql 64VU -0.32%
>
> ---
>
> I cannot make a hard case for this optimization. You can perhaps
> share your iperf numbers if you are seeing significant
> improvements there.
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-06-19 6:08 ` Jianyong Wu
@ 2025-06-19 6:30 ` K Prateek Nayak
2025-06-19 6:59 ` Jianyong Wu
0 siblings, 1 reply; 11+ messages in thread
From: K Prateek Nayak @ 2025-06-19 6:30 UTC (permalink / raw)
To: Jianyong Wu, Jianyong Wu, mingo, peterz, juri.lelli,
vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
Hello Jianyong,
On 6/19/2025 11:38 AM, Jianyong Wu wrote:
> If this change has bad effect for performance, is there any suggestion
> to mitigate the iperf migration issue?
How big of a performance difference are you seeing? I still don't see
any numbers from your testing on the thread.
> Or just leave it there?
Ideally, the cache-aware load balancing series [1] should be able to
address these concerns. I suggest testing iperf with those changes and
checking if that solves the issues of excessive migration.
[1] https://lore.kernel.org/lkml/cover.1750268218.git.tim.c.chen@linux.intel.com/
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
2025-06-19 6:30 ` K Prateek Nayak
@ 2025-06-19 6:59 ` Jianyong Wu
0 siblings, 0 replies; 11+ messages in thread
From: Jianyong Wu @ 2025-06-19 6:59 UTC (permalink / raw)
To: K Prateek Nayak, Jianyong Wu, mingo, peterz, juri.lelli,
vincent.guittot
Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
Hi Prateek,
On 6/19/2025 2:30 PM, K Prateek Nayak wrote:
> Hello Jianyong,
>
> On 6/19/2025 11:38 AM, Jianyong Wu wrote:
>> If this change has bad effect for performance, is there any suggestion
>> to mitigate the iperf migration issue?
>
> How big of a performance difference are you seeing? I still don't see
> any numbers from your testing on the thread.
Sorry for that. Here is the data.
On a machine with 8 NUMAs, each with 4 LLCs, totally 128 cores with smt2.
test command:
server: iperf3 -s
client: iperf3 -c 127.0.0.1 -t 100 -i 2
==================================================
default allow imb
25.3 Gbits/sec 26.7 Gbits/sec +5.5%
==================================================
>
>> Or just leave it there?
>
> Ideally, the cache-aware load balancing series [1] should be able to
> address these concerns. I suggest testing iperf with those changes and
> checking if that solves the issues of excessive migration.
>
> [1] https://lore.kernel.org/lkml/
> cover.1750268218.git.tim.c.chen@linux.intel.com/
>
I know this patch set. Maybe a little heave. I'll check for iperf test.
Thanks
Jianyong
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-06-19 6:59 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-28 7:09 [PATCH] sched/fair: allow imbalance between LLCs under NUMA Jianyong Wu
2025-05-29 6:39 ` K Prateek Nayak
2025-05-29 10:32 ` Jianyong Wu
2025-05-30 6:09 ` K Prateek Nayak
2025-05-30 7:36 ` Jianyong Wu
2025-06-16 2:22 ` Jianyong Wu
2025-06-17 4:06 ` K Prateek Nayak
2025-06-18 6:37 ` K Prateek Nayak
2025-06-19 6:08 ` Jianyong Wu
2025-06-19 6:30 ` K Prateek Nayak
2025-06-19 6:59 ` Jianyong Wu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).