* [PATCH v4 0/2] sched/fair: Wake task within the cluster when possible
@ 2022-06-09 12:06 Yicong Yang
2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
0 siblings, 2 replies; 23+ messages in thread
From: Yicong Yang @ 2022-06-09 12:06 UTC (permalink / raw)
To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
gautham.shenoy, linux-kernel, linux-arm-kernel
Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
guodong.xu, hesham.almatary, john.garry, shenyang39
This is the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.
[1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
Change since v3:
- fix compile error when !CONFIG_SCHED_CLUSTER, reported by lkp test.
Link: https://lore.kernel.org/lkml/20220608095758.60504-1-yangyicong@hisilicon.com/
Change since v2:
- leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
- remove the ping-pong suppression
- address the comment from Tim, thanks.
Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/
Change since v1:
- regain the performance data based on v5.17-rc1
- rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/
Barry Song (2):
sched: Add per_cpu cluster domain info and cpus_share_resources API
sched/fair: Scan cluster before scanning LLC in wake-up path
include/linux/sched/sd_flags.h | 7 ++++++
include/linux/sched/topology.h | 8 ++++++-
kernel/sched/core.c | 12 ++++++++++
kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 15 ++++++++++++
6 files changed, 84 insertions(+), 4 deletions(-)
--
2.24.0
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 23+ messages in thread* [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-09 12:06 [PATCH v4 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang @ 2022-06-09 12:06 ` Yicong Yang 2022-06-09 22:28 ` Tim Chen 2022-06-15 14:19 ` K Prateek Nayak 2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang 1 sibling, 2 replies; 23+ messages in thread From: Yicong Yang @ 2022-06-09 12:06 UTC (permalink / raw) To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 From: Barry Song <song.bao.hua@hisilicon.com> Add per-cpu cluster domain info and cpus_share_resources() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> --- include/linux/sched/sd_flags.h | 7 +++++++ include/linux/sched/topology.h | 8 +++++++- kernel/sched/core.c | 12 ++++++++++++ kernel/sched/sched.h | 2 ++ kernel/sched/topology.c | 15 +++++++++++++++ 5 files changed, 43 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h index 57bde66d95f7..42ed454e8b18 100644 --- a/include/linux/sched/sd_flags.h +++ b/include/linux/sched/sd_flags.h @@ -109,6 +109,13 @@ SD_FLAG(SD_ASYM_CPUCAPACITY_FULL, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS) */ SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS) +/* + * Domain members share CPU cluster (LLC tags or L2 cache) + * + * NEEDS_GROUPS: Clusters are shared between groups. + */ +SD_FLAG(SD_CLUSTER, SDF_NEEDS_GROUPS) + /* * Domain members share CPU package resources (i.e. caches) * diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 56cffe42abbc..df489a1db6b7 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -45,7 +45,7 @@ static inline int cpu_smt_flags(void) #ifdef CONFIG_SCHED_CLUSTER static inline int cpu_cluster_flags(void) { - return SD_SHARE_PKG_RESOURCES; + return SD_CLUSTER | SD_SHARE_PKG_RESOURCES; } #endif @@ -178,6 +178,7 @@ cpumask_var_t *alloc_sched_domains(unsigned int ndoms); void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms); bool cpus_share_cache(int this_cpu, int that_cpu); +bool cpus_share_resources(int this_cpu, int that_cpu); typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); typedef int (*sched_domain_flags_f)(void); @@ -231,6 +232,11 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu) return true; } +static inline bool cpus_share_resources(int this_cpu, int that_cpu) +{ + return true; +} + #endif /* !CONFIG_SMP */ #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bfa7452ca92e..79a6f012b0cf 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3808,6 +3808,18 @@ bool cpus_share_cache(int this_cpu, int that_cpu) return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu); } +/* + * Whether CPUs are share cache resources, which means LLC on non-cluster + * machines and LLC tag or L2 on machines with clusters. + */ +bool cpus_share_resources(int this_cpu, int that_cpu) +{ + if (this_cpu == that_cpu) + return true; + + return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu); +} + static inline bool ttwu_queue_cond(int cpu, int wake_flags) { /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 01259611beb9..b9bcfcf8d14d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id); +DECLARE_PER_CPU(int, sd_share_id); DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 05b6c2ad90b9..0595827d481d 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd) DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); DEFINE_PER_CPU(int, sd_llc_id); +DEFINE_PER_CPU(int, sd_share_id); +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); @@ -689,6 +691,18 @@ static void update_top_cache_domain(int cpu) per_cpu(sd_llc_id, cpu) = id; rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds); + sd = lowest_flag_domain(cpu, SD_CLUSTER); + if (sd) + id = cpumask_first(sched_domain_span(sd)); + rcu_assign_pointer(per_cpu(sd_cluster, cpu), sd); + + /* + * This assignment should be placed after the sd_llc_id as + * we want this id equals to cluster id on cluster machines + * but equals to LLC id on non-Cluster machines. + */ + per_cpu(sd_share_id, cpu) = id; + sd = lowest_flag_domain(cpu, SD_NUMA); rcu_assign_pointer(per_cpu(sd_numa, cpu), sd); @@ -1532,6 +1546,7 @@ static struct cpumask ***sched_domains_numa_masks; */ #define TOPOLOGY_SD_FLAGS \ (SD_SHARE_CPUCAPACITY | \ + SD_CLUSTER | \ SD_SHARE_PKG_RESOURCES | \ SD_NUMA | \ SD_ASYM_PACKING) -- 2.24.0 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang @ 2022-06-09 22:28 ` Tim Chen 2022-06-10 6:54 ` Yicong Yang 2022-06-15 14:19 ` K Prateek Nayak 1 sibling, 1 reply; 23+ messages in thread From: Tim Chen @ 2022-06-09 22:28 UTC (permalink / raw) To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote: > > > +/* > + * Whether CPUs are share cache resources, which means LLC on non-cluster > + * machines and LLC tag or L2 on machines with clusters. > + */ > +bool cpus_share_resources(int this_cpu, int that_cpu) Suggest cpus_share_lowest_cache to be a bit more informative > +{ > + if (this_cpu == that_cpu) > + return true; > + > + return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu); > +} > + > static inline bool ttwu_queue_cond(int cpu, int wake_flags) > { > /* > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 01259611beb9..b9bcfcf8d14d 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); > DECLARE_PER_CPU(int, sd_llc_size); > DECLARE_PER_CPU(int, sd_llc_id); > +DECLARE_PER_CPU(int, sd_share_id); > DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 05b6c2ad90b9..0595827d481d 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd) > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); > DEFINE_PER_CPU(int, sd_llc_size); > DEFINE_PER_CPU(int, sd_llc_id); > +DEFINE_PER_CPU(int, sd_share_id); Some minor nits about the name of "sd_share_id". It is not quite obvious what it is. Maybe something like sd_lowest_cache_id to denote it is the id of lowest shared cache domain between CPU. Otherwise the patch looks good to me. You can add Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Tim _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-09 22:28 ` Tim Chen @ 2022-06-10 6:54 ` Yicong Yang 0 siblings, 0 replies; 23+ messages in thread From: Yicong Yang @ 2022-06-10 6:54 UTC (permalink / raw) To: Tim Chen, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 On 2022/6/10 6:28, Tim Chen wrote: > On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote: >> >> >> +/* >> + * Whether CPUs are share cache resources, which means LLC on non-cluster >> + * machines and LLC tag or L2 on machines with clusters. >> + */ >> +bool cpus_share_resources(int this_cpu, int that_cpu) > > Suggest cpus_share_lowest_cache to be a bit more informative > >> +{ >> + if (this_cpu == that_cpu) >> + return true; >> + >> + return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu); >> +} >> + >> static inline bool ttwu_queue_cond(int cpu, int wake_flags) >> { >> /* >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 01259611beb9..b9bcfcf8d14d 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); >> DECLARE_PER_CPU(int, sd_llc_size); >> DECLARE_PER_CPU(int, sd_llc_id); >> +DECLARE_PER_CPU(int, sd_share_id); >> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c >> index 05b6c2ad90b9..0595827d481d 100644 >> --- a/kernel/sched/topology.c >> +++ b/kernel/sched/topology.c >> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd) >> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); >> DEFINE_PER_CPU(int, sd_llc_size); >> DEFINE_PER_CPU(int, sd_llc_id); >> +DEFINE_PER_CPU(int, sd_share_id); > > Some minor nits about the name of "sd_share_id". > It is not quite obvious what it is. > > Maybe something like sd_lowest_cache_id to denote > it is the id of lowest shared cache domain between CPU. > Thanks for the suggestion! Since Vincent and Gautham have suggested this in v1 [1], I'd like to wait a bit before changing to this to see if they have any comment. > Otherwise the patch looks good to me. You can add > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > Thanks. [1] https://lore.kernel.org/lkml/CAKfTPtBKLDyNPXg7uLbQ3jUnEwppfC+E29=oJ1tWzzqHsNpApw@mail.gmail.com/ _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang 2022-06-09 22:28 ` Tim Chen @ 2022-06-15 14:19 ` K Prateek Nayak 2022-06-15 15:43 ` Gautham R. Shenoy 2022-06-16 7:55 ` Yicong Yang 1 sibling, 2 replies; 23+ messages in thread From: K Prateek Nayak @ 2022-06-15 14:19 UTC (permalink / raw) To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 Hello Yicong, I replied to the v3 of the series by mistake! (https://lore.kernel.org/lkml/0b065646-5b05-cbdb-b20c-e1dfef3f4d79@amd.com/) But rest assured all the analysis discussed there was done with the v4 patch series. I'll add the same observations below so we can continue discussion on v4. We are observing some serious regression with tbench with this patch series applied. The issue doesn't seem to be related to the actual functionality of the patch but how the patch changes the per CPU variable layout. Discussed below are the results from running tbench on a dual socket Zen3 (2 x 64C/128T) configured in different NPS modes. NPS Modes are used to logically divide single socket into multiple NUMA region. Following is the NUMA configuration for each NPS mode on the system: NPS1: Each socket is a NUMA node. Total 2 NUMA nodes in the dual socket machine. Node 0: 0-63, 128-191 Node 1: 64-127, 192-255 NPS2: Each socket is further logically divided into 2 NUMA regions. Total 4 NUMA nodes exist over 2 socket. Node 0: 0-31, 128-159 Node 1: 32-63, 160-191 Node 2: 64-95, 192-223 Node 3: 96-127, 223-255 NPS4: Each socket is logically divided into 4 NUMA regions. Total 8 NUMA nodes exist over 2 socket. Node 0: 0-15, 128-143 Node 1: 16-31, 144-159 Node 2: 32-47, 160-175 Node 3: 48-63, 176-191 Node 4: 64-79, 192-207 Node 5: 80-95, 208-223 Node 6: 96-111, 223-231 Node 7: 112-127, 232-255 Benchmark Results: Kernel versions: - tip: 5.19.0-rc2 tip sched/core - cluster: 5.19.0-rc2 tip sched/core + both the patches of the series When we started testing, the tip was at: commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle" * - Data points of concern ~~~~~~ tbench ~~~~~~ NPS1 Clients: tip cluster 1 444.41 (0.00 pct) 439.27 (-1.15 pct) 2 879.23 (0.00 pct) 831.49 (-5.42 pct) * 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) * 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) * 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) * 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) * 128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) NPS2 Clients: tip cluster 1 445.45 (0.00 pct) 441.75 (-0.83 pct) 2 869.24 (0.00 pct) 845.61 (-2.71 pct) 4 1644.28 (0.00 pct) 1586.49 (-3.51 pct) 8 3120.83 (0.00 pct) 2967.01 (-4.92 pct) * 16 5972.29 (0.00 pct) 5208.58 (-12.78 pct) * 32 11776.38 (0.00 pct) 10229.53 (-13.13 pct) * 64 20933.15 (0.00 pct) 17033.45 (-18.62 pct) * 128 32195.00 (0.00 pct) 29507.85 (-8.34 pct) * 256 24641.52 (0.00 pct) 27225.00 (10.48 pct) 512 50806.96 (0.00 pct) 51377.50 (1.12 pct) 1024 51993.96 (0.00 pct) 50773.35 (-2.34 pct) NPS4 Clients: tip cluster 1 442.10 (0.00 pct) 435.06 (-1.59 pct) 2 870.94 (0.00 pct) 858.64 (-1.41 pct) 4 1615.30 (0.00 pct) 1607.27 (-0.49 pct) 8 3195.95 (0.00 pct) 3020.63 (-5.48 pct) * 16 5937.41 (0.00 pct) 5719.87 (-3.66 pct) 32 11800.41 (0.00 pct) 11229.65 (-4.83 pct) * 64 20844.71 (0.00 pct) 20432.79 (-1.97 pct) 128 31003.62 (0.00 pct) 29441.20 (-5.03 pct) * 256 27476.37 (0.00 pct) 25857.30 (-5.89 pct) * [Know to have run to run variance] 512 52276.72 (0.00 pct) 51659.16 (-1.18 pct) 1024 51372.10 (0.00 pct) 51026.87 (-0.67 pct) Note: tbench results for 256 workers are known to have run to run variation on the test machine. Any regression seen for the data point can be safely ignored. The behavior is consistent for both tip and patched kernel across multiple runs of tbench. ~~~~~~~~~~~~~~~~~~~~ Analysis done so far ~~~~~~~~~~~~~~~~~~~~ To root cause this issue quicker, we have focused on 8 to 64 clients data points with the machine running in NPS1 mode. - Even on disabling HW prefetcher, the behavior remains consistent signifying HW prefetcher is not the cause of the problem. - Bisecting: When we ran the tests with only Patch 1 of the series, the regression was visible and the numbers were worse. Clients: tip cluster Patch 1 Only 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct) 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct) 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct) 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct) We further bisected the hunks to narrow down the cause to the per CPU variable declarations. On 6/9/2022 5:36 PM, Yicong Yang wrote: > > [..snip..] > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 01259611beb9..b9bcfcf8d14d 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); > DECLARE_PER_CPU(int, sd_llc_size); > DECLARE_PER_CPU(int, sd_llc_id); > +DECLARE_PER_CPU(int, sd_share_id); > DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); The main reason for the regression seems to be the above declarations. The regression seem to go away if we do one of the following: - Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY() instead of DECLARE_PER_CPU() and change the corresponding definition below to DEFINE_PER_CPU_READ_MOSTLY(). Clients: tip Patch 1 Patch 1 (READ_MOSTLY) 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3237.33 (-0.56 pct) 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5914.53 (-2.92 pct) 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11536.05 (3.40 pct) 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21162.33 (0.67 pct) - Convert sd_share_id and sd_cluster to static arrays. Clients: tip Patch 1 Patch 1 (Static Array) 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3203.27 (-1.61 pct) 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 6198.35 (1.73 pct) 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11385.76 (2.05 pct) 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21919.80 (4.28 pct) - Move the declarations of sd_share_id and sd_cluster to the top Clients: tip Patch 1 Patch 1 (Declarion on Top) 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct) 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct) 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct) 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct) Unfortunately, none of these are complete solutions. For example, using DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression but doesn't eliminate it entirely: Clients: tip cluster cluster (READ_MOSTLY) 1 444.41 (0.00 pct) 439.27 (-1.15 pct) 435.95 (-1.90 pct) 2 879.23 (0.00 pct) 831.49 (-5.42 pct) 842.09 (-4.22 pct) 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1598.77 (-3.03 pct) 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3090.86 (-5.29 pct) * 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5360.28 (-10.82 pct) * 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11083.66 (-8.08 pct) * 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 20984.30 (-1.28 pct) 128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 30735.20 (-0.19 pct) 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 24021.21 (-4.44 pct) 512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 51672.73 (0.75 pct) 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52620.02 (-1.04 pct) We are still trying to root cause the underlying issue that brought about such drastic regression in tbench performance. > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 05b6c2ad90b9..0595827d481d 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd) > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); > DEFINE_PER_CPU(int, sd_llc_size); > DEFINE_PER_CPU(int, sd_llc_id); > +DEFINE_PER_CPU(int, sd_share_id); > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); > > [..snip..] > We would like some time to investigate this issue and root cause the reason for this regression. -- Thanks and Regards, Prateek _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-15 14:19 ` K Prateek Nayak @ 2022-06-15 15:43 ` Gautham R. Shenoy 2022-06-16 8:10 ` Yicong Yang 2022-06-16 7:55 ` Yicong Yang 1 sibling, 1 reply; 23+ messages in thread From: Gautham R. Shenoy @ 2022-06-15 15:43 UTC (permalink / raw) To: K Prateek Nayak Cc: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen, linux-kernel, linux-arm-kernel, dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 On Wed, Jun 15, 2022 at 07:49:22PM +0530, K Prateek Nayak wrote: [..snip..] > > - Bisecting: > > When we ran the tests with only Patch 1 of the series, the > regression was visible and the numbers were worse. > > Clients: tip cluster Patch 1 Only > 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct) > 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct) > 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct) > 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct) > > We further bisected the hunks to narrow down the cause to the per CPU > variable declarations. > > > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 01259611beb9..b9bcfcf8d14d 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) > > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); > > DECLARE_PER_CPU(int, sd_llc_size); > > DECLARE_PER_CPU(int, sd_llc_id); > > +DECLARE_PER_CPU(int, sd_share_id); > > DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); > > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); > > The main reason for the regression seems to be the above declarations. I think you meant that the regressions are due to the DEFINE_PER_CPU() instances from the following hunk: > > @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd) > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); > > DEFINE_PER_CPU(int, sd_llc_size); > > DEFINE_PER_CPU(int, sd_llc_id); > > +DEFINE_PER_CPU(int, sd_share_id); > > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); > > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); > > The System.map diff for these variables between tip vs tip + cluster-sched-v4 on your test system looks as follows: 0000000000020520 D sd_asym_packing 0000000000020528 D sd_numa -0000000000020530 D sd_llc_shared -0000000000020538 D sd_llc_id -000000000002053c D sd_llc_size -0000000000020540 D sd_llc +0000000000020530 D sd_cluster +0000000000020538 D sd_llc_shared +0000000000020540 D sd_share_id +0000000000020544 D sd_llc_id +0000000000020548 D sd_llc_size +0000000000020550 D sd_llc The allocations are in the reverse-order of the definitions. That perhaps explains why you no longer see the regression when you define the sd_share_id and sd_cluster per-cpu definitions at the beginning as indicated by the following > - Move the declarations of sd_share_id and sd_cluster to the top > > Clients: tip Patch 1 Patch 1 (Declarion on Top) > 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct) > 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct) > 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct) > 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct) -- Thanks and Regards gautham. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-15 15:43 ` Gautham R. Shenoy @ 2022-06-16 8:10 ` Yicong Yang 0 siblings, 0 replies; 23+ messages in thread From: Yicong Yang @ 2022-06-16 8:10 UTC (permalink / raw) To: Gautham R. Shenoy, K Prateek Nayak Cc: yangyicong, peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen, linux-kernel, linux-arm-kernel, dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 On 2022/6/15 23:43, Gautham R. Shenoy wrote: > On Wed, Jun 15, 2022 at 07:49:22PM +0530, K Prateek Nayak wrote: > > [..snip..] > >> >> - Bisecting: >> >> When we ran the tests with only Patch 1 of the series, the >> regression was visible and the numbers were worse. >> >> Clients: tip cluster Patch 1 Only >> 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct) >> 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct) >> 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct) >> 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct) >> >> We further bisected the hunks to narrow down the cause to the per CPU >> variable declarations. >> >> >>> >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index 01259611beb9..b9bcfcf8d14d 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) >>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); >>> DECLARE_PER_CPU(int, sd_llc_size); >>> DECLARE_PER_CPU(int, sd_llc_id); >>> +DECLARE_PER_CPU(int, sd_share_id); >>> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >> >> The main reason for the regression seems to be the above declarations. > > I think you meant that the regressions are due to the DEFINE_PER_CPU() > instances from the following hunk: > >>> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd) >>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); >>> DEFINE_PER_CPU(int, sd_llc_size); >>> DEFINE_PER_CPU(int, sd_llc_id); >>> +DEFINE_PER_CPU(int, sd_share_id); >>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); >>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >>> > > > The System.map diff for these variables between tip vs tip + > cluster-sched-v4 on your test system looks as follows: > > 0000000000020520 D sd_asym_packing > 0000000000020528 D sd_numa > -0000000000020530 D sd_llc_shared > -0000000000020538 D sd_llc_id > -000000000002053c D sd_llc_size > -0000000000020540 D sd_llc > +0000000000020530 D sd_cluster > +0000000000020538 D sd_llc_shared looks like below are in another cacheline (for 64B cacheline)? while previous sd_llc_id and sd_llc_shared are in the same. > +0000000000020540 D sd_share_id > +0000000000020544 D sd_llc_id > +0000000000020548 D sd_llc_size > +0000000000020550 D sd_llc > > The allocations are in the reverse-order of the definitions. > > That perhaps explains why you no longer see the regression when you > define the sd_share_id and sd_cluster per-cpu definitions at the > beginning as indicated by the following > >> - Move the declarations of sd_share_id and sd_cluster to the top >> >> Clients: tip Patch 1 Patch 1 (Declarion on Top) >> 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct) >> 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct) >> 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct) >> 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct) > > > -- > Thanks and Regards > gautham. > . > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-15 14:19 ` K Prateek Nayak 2022-06-15 15:43 ` Gautham R. Shenoy @ 2022-06-16 7:55 ` Yicong Yang [not found] ` <6bf4f032-7d07-d4a4-4f5a-28f3871131c0@amd.com> 1 sibling, 1 reply; 23+ messages in thread From: Yicong Yang @ 2022-06-16 7:55 UTC (permalink / raw) To: K Prateek Nayak, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 Hi Prateek, Seems my previous reply is in the wrong format so the server rejected it..just repost.. Thanks a lot for the test! On 2022/6/15 22:19, K Prateek Nayak wrote: > Hello Yicong, > > I replied to the v3 of the series by mistake! > (https://lore.kernel.org/lkml/0b065646-5b05-cbdb-b20c-e1dfef3f4d79@amd.com/) > But rest assured all the analysis discussed there was done with > the v4 patch series. I'll add the same observations below so we > can continue discussion on v4. > > We are observing some serious regression with tbench with this patch > series applied. The issue doesn't seem to be related to the actual > functionality of the patch but how the patch changes the per CPU > variable layout. > > Discussed below are the results from running tbench on a dual > socket Zen3 (2 x 64C/128T) configured in different NPS modes. > > NPS Modes are used to logically divide single socket into > multiple NUMA region. > Following is the NUMA configuration for each NPS mode on the system: > > NPS1: Each socket is a NUMA node. > Total 2 NUMA nodes in the dual socket machine. > > Node 0: 0-63, 128-191 > Node 1: 64-127, 192-255 > > NPS2: Each socket is further logically divided into 2 NUMA regions. > Total 4 NUMA nodes exist over 2 socket. > > Node 0: 0-31, 128-159 > Node 1: 32-63, 160-191 > Node 2: 64-95, 192-223 > Node 3: 96-127, 223-255 > > NPS4: Each socket is logically divided into 4 NUMA regions. > Total 8 NUMA nodes exist over 2 socket. > > Node 0: 0-15, 128-143 > Node 1: 16-31, 144-159 > Node 2: 32-47, 160-175 > Node 3: 48-63, 176-191 > Node 4: 64-79, 192-207 > Node 5: 80-95, 208-223 > Node 6: 96-111, 223-231 > Node 7: 112-127, 232-255 > > Benchmark Results: > > Kernel versions: > - tip: 5.19.0-rc2 tip sched/core > - cluster: 5.19.0-rc2 tip sched/core + both the patches of the series > > When we started testing, the tip was at: > commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle" > > * - Data points of concern > > ~~~~~~ > tbench > ~~~~~~ > > NPS1 > > Clients: tip cluster > 1 444.41 (0.00 pct) 439.27 (-1.15 pct) > 2 879.23 (0.00 pct) 831.49 (-5.42 pct) * > 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) > 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) * > 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) * > 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) * > 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) * > 128 30795.27 (0.00 pct) 30861.34 (0.21 pct) > 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) > 512 51287.93 (0.00 pct) 51855.55 (1.10 pct) > 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) > > NPS2 > > Clients: tip cluster > 1 445.45 (0.00 pct) 441.75 (-0.83 pct) > 2 869.24 (0.00 pct) 845.61 (-2.71 pct) > 4 1644.28 (0.00 pct) 1586.49 (-3.51 pct) > 8 3120.83 (0.00 pct) 2967.01 (-4.92 pct) * > 16 5972.29 (0.00 pct) 5208.58 (-12.78 pct) * > 32 11776.38 (0.00 pct) 10229.53 (-13.13 pct) * > 64 20933.15 (0.00 pct) 17033.45 (-18.62 pct) * > 128 32195.00 (0.00 pct) 29507.85 (-8.34 pct) * > 256 24641.52 (0.00 pct) 27225.00 (10.48 pct) > 512 50806.96 (0.00 pct) 51377.50 (1.12 pct) > 1024 51993.96 (0.00 pct) 50773.35 (-2.34 pct) > > NPS4 > > Clients: tip cluster > 1 442.10 (0.00 pct) 435.06 (-1.59 pct) > 2 870.94 (0.00 pct) 858.64 (-1.41 pct) > 4 1615.30 (0.00 pct) 1607.27 (-0.49 pct) > 8 3195.95 (0.00 pct) 3020.63 (-5.48 pct) * > 16 5937.41 (0.00 pct) 5719.87 (-3.66 pct) > 32 11800.41 (0.00 pct) 11229.65 (-4.83 pct) * > 64 20844.71 (0.00 pct) 20432.79 (-1.97 pct) > 128 31003.62 (0.00 pct) 29441.20 (-5.03 pct) * > 256 27476.37 (0.00 pct) 25857.30 (-5.89 pct) * [Know to have run to run variance] > 512 52276.72 (0.00 pct) 51659.16 (-1.18 pct) > 1024 51372.10 (0.00 pct) 51026.87 (-0.67 pct) > > Note: tbench results for 256 workers are known to have > run to run variation on the test machine. Any regression > seen for the data point can be safely ignored. > > The behavior is consistent for both tip and patched kernel > across multiple runs of tbench. > > ~~~~~~~~~~~~~~~~~~~~ > Analysis done so far > ~~~~~~~~~~~~~~~~~~~~ > > To root cause this issue quicker, we have focused on 8 to 64 clients > data points with the machine running in NPS1 mode. > > - Even on disabling HW prefetcher, the behavior remains consistent > signifying HW prefetcher is not the cause of the problem. > > - Bisecting: > > When we ran the tests with only Patch 1 of the series, the > regression was visible and the numbers were worse. > > Clients: tip cluster Patch 1 Only > 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct) > 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct) > 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct) > 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct) > > We further bisected the hunks to narrow down the cause to the per CPU > variable declarations. > > > On 6/9/2022 5:36 PM, Yicong Yang wrote: >> >> [..snip..] >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 01259611beb9..b9bcfcf8d14d 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); >> DECLARE_PER_CPU(int, sd_llc_size); >> DECLARE_PER_CPU(int, sd_llc_id); >> +DECLARE_PER_CPU(int, sd_share_id); >> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); > > The main reason for the regression seems to be the above declarations. > The regression seem to go away if we do one of the following: > > - Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY() > instead of DECLARE_PER_CPU() and change the corresponding definition > below to DEFINE_PER_CPU_READ_MOSTLY(). > > Clients: tip Patch 1 Patch 1 (READ_MOSTLY) > 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3237.33 (-0.56 pct) > 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5914.53 (-2.92 pct) > 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11536.05 (3.40 pct) > 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21162.33 (0.67 pct) > > - Convert sd_share_id and sd_cluster to static arrays. > > Clients: tip Patch 1 Patch 1 (Static Array) > 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3203.27 (-1.61 pct) > 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 6198.35 (1.73 pct) > 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11385.76 (2.05 pct) > 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21919.80 (4.28 pct) > > - Move the declarations of sd_share_id and sd_cluster to the top > > Clients: tip Patch 1 Patch 1 (Declarion on Top) > 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct) > 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct) > 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct) > 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct) > > Unfortunately, none of these are complete solutions. For example, using > DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression > but doesn't eliminate it entirely: > > Clients: tip cluster cluster (READ_MOSTLY) > 1 444.41 (0.00 pct) 439.27 (-1.15 pct) 435.95 (-1.90 pct) > 2 879.23 (0.00 pct) 831.49 (-5.42 pct) 842.09 (-4.22 pct) > 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1598.77 (-3.03 pct) > 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3090.86 (-5.29 pct) * > 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5360.28 (-10.82 pct) * > 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11083.66 (-8.08 pct) * > 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 20984.30 (-1.28 pct) > 128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 30735.20 (-0.19 pct) > 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 24021.21 (-4.44 pct) > 512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 51672.73 (0.75 pct) > 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52620.02 (-1.04 pct) > > We are still trying to root cause the underlying issue that > brought about such drastic regression in tbench performance. > >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c >> index 05b6c2ad90b9..0595827d481d 100644 >> --- a/kernel/sched/topology.c >> +++ b/kernel/sched/topology.c >> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd) >> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); >> DEFINE_PER_CPU(int, sd_llc_size); >> DEFINE_PER_CPU(int, sd_llc_id); >> +DEFINE_PER_CPU(int, sd_share_id); >> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); >> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >> >> [..snip..] >> > > We would like some time to investigate this issue and root cause > the reason for this regression. One significant difference I can see for now is that Kunpeng 920 is a non-SMT machine while Zen3 is a SMT machine. So in the select_idle_sibling() path we won't use sd_llc_shared. Since sd_share_id and sd_cluster are inserted between sd_llc_id and sd_llc_shared which are both used in the path on your test, I guess the change of the layout may affect something like the cache behaviour. Can you also test with SMT disabled? Or change the layout a bit to see if there's any difference, like: DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); +DEFINE_PER_CPU(int, sd_share_id); +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); I need to further look into it and have some tests on a SMT machine. Would you mind to share the kernel config as well? I'd like to compare the config as well. Thanks, Yicong _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <6bf4f032-7d07-d4a4-4f5a-28f3871131c0@amd.com>]
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API [not found] ` <6bf4f032-7d07-d4a4-4f5a-28f3871131c0@amd.com> @ 2022-06-17 16:50 ` Tim Chen 2022-06-20 11:20 ` K Prateek Nayak 0 siblings, 1 reply; 23+ messages in thread From: Tim Chen @ 2022-06-17 16:50 UTC (permalink / raw) To: K Prateek Nayak, Yicong Yang, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39, feng.tang On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote: > > > -- > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index e9f3dc6dcbf4..97a3895416ab 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) > return sd; > } > > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); > +DECLARE_PER_CPU(int, sd_share_id); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); > DECLARE_PER_CPU(int, sd_llc_size); > DECLARE_PER_CPU(int, sd_llc_id); > -DECLARE_PER_CPU(int, sd_share_id); > DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); > -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); > -- > > The System-map of each kernel is as follows: > > - On "tip" > > 0000000000020518 D sd_asym_cpucapacity > 0000000000020520 D sd_asym_packing > 0000000000020528 D sd_numa > 0000000000020530 D sd_llc_shared > 0000000000020538 D sd_llc_id > 000000000002053c D sd_llc_size > -------------------------------------------- 64B Cacheline Boundary > 0000000000020540 D sd_llc > > - On "tip + Patch 1 only" and "tip + both patches" > > 0000000000020518 D sd_asym_cpucapacity > 0000000000020520 D sd_asym_packing > 0000000000020528 D sd_numa > 0000000000020530 D sd_cluster <----- > 0000000000020538 D sd_llc_shared > -------------------------------------------- 64B Cacheline Boundary > 0000000000020540 D sd_share_id <----- > 0000000000020544 D sd_llc_id > 0000000000020548 D sd_llc_size > 0000000000020550 D sd_llc > > > - On "tip + both patches (Move declaration to top)" > > 0000000000020518 D sd_asym_cpucapacity > 0000000000020520 D sd_asym_packing > 0000000000020528 D sd_numa > 0000000000020530 D sd_llc_shared > 0000000000020538 D sd_llc_id > 000000000002053c D sd_llc_size > -------------------------------------------- 64B Cacheline Boundary > 0000000000020540 D sd_llc Wonder if it will help to try keep sd_llc and sd_llc_size into the same cache line. They are both used in the wake up path. > 0000000000020548 D sd_share_id <----- > 0000000000020550 D sd_cluster <----- > > > Or change the layout a bit to see if there's any difference, > > like: > > > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); > > DEFINE_PER_CPU(int, sd_llc_size); > > DEFINE_PER_CPU(int, sd_llc_id); > > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); > > +DEFINE_PER_CPU(int, sd_share_id); > > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); > > > > I need to further look into it and have some tests on a SMT machine. Would you mind to share > > the kernel config as well? I'd like to compare the config as well. > > I've attached the kernel config used to build the test kernel > to this mail. > > > Thanks, > > Yicong > > We are trying to debug the issue using perf and find an optimal > arrangement of the per cpu declarations to get the relevant data > used in the wakeup path on the same 64B cache line. A check of perf c2c profile difference between tip and the move new declarations to the top case could be useful. It may give some additional clues of possibel false sharing issues. Tim > > We'll keep you posted of out finding. Let me know if you need > anything else in the meantime. > -- > Thanks and Regards, > Prateek _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-17 16:50 ` Tim Chen @ 2022-06-20 11:20 ` K Prateek Nayak 2022-06-20 13:37 ` Abel Wu 0 siblings, 1 reply; 23+ messages in thread From: K Prateek Nayak @ 2022-06-20 11:20 UTC (permalink / raw) To: Tim Chen, Yicong Yang, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39, feng.tang Hello Tim, Thank you for looking into this. On 6/17/2022 10:20 PM, Tim Chen wrote: > On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote: >> >> >> -- >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index e9f3dc6dcbf4..97a3895416ab 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) >> return sd; >> } >> >> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >> +DECLARE_PER_CPU(int, sd_share_id); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); >> DECLARE_PER_CPU(int, sd_llc_size); >> DECLARE_PER_CPU(int, sd_llc_id); >> -DECLARE_PER_CPU(int, sd_share_id); >> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); >> -- >> >> The System-map of each kernel is as follows: >> >> - On "tip" >> >> 0000000000020518 D sd_asym_cpucapacity >> 0000000000020520 D sd_asym_packing >> 0000000000020528 D sd_numa >> 0000000000020530 D sd_llc_shared >> 0000000000020538 D sd_llc_id >> 000000000002053c D sd_llc_size >> -------------------------------------------- 64B Cacheline Boundary >> 0000000000020540 D sd_llc >> >> - On "tip + Patch 1 only" and "tip + both patches" >> >> 0000000000020518 D sd_asym_cpucapacity >> 0000000000020520 D sd_asym_packing >> 0000000000020528 D sd_numa >> 0000000000020530 D sd_cluster <----- >> 0000000000020538 D sd_llc_shared >> -------------------------------------------- 64B Cacheline Boundary >> 0000000000020540 D sd_share_id <----- >> 0000000000020544 D sd_llc_id >> 0000000000020548 D sd_llc_size >> 0000000000020550 D sd_llc >> >> >> - On "tip + both patches (Move declaration to top)" >> >> 0000000000020518 D sd_asym_cpucapacity >> 0000000000020520 D sd_asym_packing >> 0000000000020528 D sd_numa >> 0000000000020530 D sd_llc_shared >> 0000000000020538 D sd_llc_id >> 000000000002053c D sd_llc_size >> -------------------------------------------- 64B Cacheline Boundary >> 0000000000020540 D sd_llc > > Wonder if it will help to try keep sd_llc and sd_llc_size into the same > cache line. They are both used in the wake up path. We are still evaluating keeping which set of variables on the same cache line will provide the best results. I would have expected the two kernel variants - "tip" and the "tip + both patches (Move declaration to top)" - to give similar results as their System map for all the old variables remain the same and the addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc". However, now we see a regression for higher number of client. This probably hints that access to "sd_cluster" variable in Patch 2 and bringing in the extra cache line could be responsible for the regression we see with "tip + both patches (Move declaration to top)" > > >> 0000000000020548 D sd_share_id <----- >> 0000000000020550 D sd_cluster <----- >> >>> Or change the layout a bit to see if there's any difference, >>> like: >>> >>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); >>> DEFINE_PER_CPU(int, sd_llc_size); >>> DEFINE_PER_CPU(int, sd_llc_id); >>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >>> +DEFINE_PER_CPU(int, sd_share_id); >>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); >>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >>> >>> I need to further look into it and have some tests on a SMT machine. Would you mind to share >>> the kernel config as well? I'd like to compare the config as well. >> >> I've attached the kernel config used to build the test kernel >> to this mail. >> >>> Thanks, >>> Yicong >> >> We are trying to debug the issue using perf and find an optimal >> arrangement of the per cpu declarations to get the relevant data >> used in the wakeup path on the same 64B cache line. > > A check of perf c2c profile difference between tip and the move new declarations to > the top case could be useful. It may give some additional clues of possibel > false sharing issues. Thank you for the suggestion. We are currently looking at perf counter data to see how the cache efficiency has changed between the two kernels. We suspect that the need for the data in the other cache line too in the wakeup path is resulting in higher cache misses in the levels closer to the core. I don't think it is a false sharing problem as these per CPU data are set when the system first boots up and will only be change again during a CPU hotplug operation. However, it might be beneficial to see the c2c profile if perf counters don't indicate anything out of the ordinary. > > Tim > >> >> We'll keep you posted of out finding. Let me know if you need >> anything else in the meantime. >> -- >> Thanks and Regards, >> Prateek > -- Thanks and Regards, Prateek _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-20 11:20 ` K Prateek Nayak @ 2022-06-20 13:37 ` Abel Wu 2022-06-28 11:55 ` K Prateek Nayak 0 siblings, 1 reply; 23+ messages in thread From: Abel Wu @ 2022-06-20 13:37 UTC (permalink / raw) To: K Prateek Nayak, Tim Chen, Yicong Yang, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39, feng.tang On 6/20/22 7:20 PM, K Prateek Nayak Wrote: > Hello Tim, > > Thank you for looking into this. > > On 6/17/2022 10:20 PM, Tim Chen wrote: >> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote: >>> >>> >>> -- >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index e9f3dc6dcbf4..97a3895416ab 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) >>> return sd; >>> } >>> >>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>> +DECLARE_PER_CPU(int, sd_share_id); >>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); >>> DECLARE_PER_CPU(int, sd_llc_size); >>> DECLARE_PER_CPU(int, sd_llc_id); >>> -DECLARE_PER_CPU(int, sd_share_id); >>> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); >>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); >>> -- >>> >>> The System-map of each kernel is as follows: >>> >>> - On "tip" >>> >>> 0000000000020518 D sd_asym_cpucapacity >>> 0000000000020520 D sd_asym_packing >>> 0000000000020528 D sd_numa >>> 0000000000020530 D sd_llc_shared >>> 0000000000020538 D sd_llc_id >>> 000000000002053c D sd_llc_size >>> -------------------------------------------- 64B Cacheline Boundary >>> 0000000000020540 D sd_llc >>> >>> - On "tip + Patch 1 only" and "tip + both patches" >>> >>> 0000000000020518 D sd_asym_cpucapacity >>> 0000000000020520 D sd_asym_packing >>> 0000000000020528 D sd_numa >>> 0000000000020530 D sd_cluster <----- >>> 0000000000020538 D sd_llc_shared >>> -------------------------------------------- 64B Cacheline Boundary >>> 0000000000020540 D sd_share_id <----- >>> 0000000000020544 D sd_llc_id >>> 0000000000020548 D sd_llc_size >>> 0000000000020550 D sd_llc >>> >>> >>> - On "tip + both patches (Move declaration to top)" >>> >>> 0000000000020518 D sd_asym_cpucapacity >>> 0000000000020520 D sd_asym_packing >>> 0000000000020528 D sd_numa >>> 0000000000020530 D sd_llc_shared >>> 0000000000020538 D sd_llc_id >>> 000000000002053c D sd_llc_size >>> -------------------------------------------- 64B Cacheline Boundary >>> 0000000000020540 D sd_llc >> >> Wonder if it will help to try keep sd_llc and sd_llc_size into the same >> cache line. They are both used in the wake up path. > > We are still evaluating keeping which set of variables on the same > cache line will provide the best results. > > I would have expected the two kernel variants - "tip" and the > "tip + both patches (Move declaration to top)" - to give similar results > as their System map for all the old variables remain the same and the > addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc". > However, now we see a regression for higher number of client. > > This probably hints that access to "sd_cluster" variable in Patch 2 and > bringing in the extra cache line could be responsible for the regression > we see with "tip + both patches (Move declaration to top)" > >> >> >>> 0000000000020548 D sd_share_id <----- >>> 0000000000020550 D sd_cluster <----- >>> >>>> Or change the layout a bit to see if there's any difference, >>>> like: >>>> >>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); >>>> DEFINE_PER_CPU(int, sd_llc_size); >>>> DEFINE_PER_CPU(int, sd_llc_id); >>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >>>> +DEFINE_PER_CPU(int, sd_share_id); >>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); >>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >>>> >>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share >>>> the kernel config as well? I'd like to compare the config as well. >>> >>> I've attached the kernel config used to build the test kernel >>> to this mail. >>> >>>> Thanks, >>>> Yicong >>> >>> We are trying to debug the issue using perf and find an optimal >>> arrangement of the per cpu declarations to get the relevant data >>> used in the wakeup path on the same 64B cache line. >> >> A check of perf c2c profile difference between tip and the move new declarations to >> the top case could be useful. It may give some additional clues of possibel >> false sharing issues. > > Thank you for the suggestion. We are currently looking at perf counter > data to see how the cache efficiency has changed between the two kernels. > We suspect that the need for the data in the other cache line too in the > wakeup path is resulting in higher cache misses in the levels closer to > the core. > > I don't think it is a false sharing problem as these per CPU data are > set when the system first boots up and will only be change again during > a CPU hotplug operation. However, it might be beneficial to see the c2c > profile if perf counters don't indicate anything out of the ordinary. > Would it be possible if any other frequent-written variables share same cacheline with these per-cpu data causing false sharing? What about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-20 13:37 ` Abel Wu @ 2022-06-28 11:55 ` K Prateek Nayak 2022-06-29 3:22 ` Yicong Yang 0 siblings, 1 reply; 23+ messages in thread From: K Prateek Nayak @ 2022-06-28 11:55 UTC (permalink / raw) To: Abel Wu, Tim Chen, Yicong Yang, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39, feng.tang Hello Abel, Thank you for looking into this issue. tl;dr We have found two variables that need to be co-located on the same cache line to restore tbench performance. As this issue is unrelated to the functionality of the patch, it should not hold this patch series from landing given the performance improvements seen on systems with CPU clusters. The results of our analysis are discussed in detail below. On 6/20/2022 7:07 PM, Abel Wu wrote: > > On 6/20/22 7:20 PM, K Prateek Nayak Wrote: >> Hello Tim, >> >> Thank you for looking into this. >> >> On 6/17/2022 10:20 PM, Tim Chen wrote: >>> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote: >>>> >>>> >>>> -- >>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>>> index e9f3dc6dcbf4..97a3895416ab 100644 >>>> --- a/kernel/sched/sched.h >>>> +++ b/kernel/sched/sched.h >>>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) >>>> return sd; >>>> } >>>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>>> +DECLARE_PER_CPU(int, sd_share_id); >>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); >>>> DECLARE_PER_CPU(int, sd_llc_size); >>>> DECLARE_PER_CPU(int, sd_llc_id); >>>> -DECLARE_PER_CPU(int, sd_share_id); >>>> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >>>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); >>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); >>>> -- >>>> >>>> The System-map of each kernel is as follows: >>>> >>>> - On "tip" >>>> >>>> 0000000000020518 D sd_asym_cpucapacity >>>> 0000000000020520 D sd_asym_packing >>>> 0000000000020528 D sd_numa >>>> 0000000000020530 D sd_llc_shared >>>> 0000000000020538 D sd_llc_id >>>> 000000000002053c D sd_llc_size >>>> -------------------------------------------- 64B Cacheline Boundary >>>> 0000000000020540 D sd_llc >>>> >>>> - On "tip + Patch 1 only" and "tip + both patches" >>>> >>>> 0000000000020518 D sd_asym_cpucapacity >>>> 0000000000020520 D sd_asym_packing >>>> 0000000000020528 D sd_numa >>>> 0000000000020530 D sd_cluster <----- >>>> 0000000000020538 D sd_llc_shared >>>> -------------------------------------------- 64B Cacheline Boundary >>>> 0000000000020540 D sd_share_id <----- >>>> 0000000000020544 D sd_llc_id >>>> 0000000000020548 D sd_llc_size >>>> 0000000000020550 D sd_llc >>>> >>>> >>>> - On "tip + both patches (Move declaration to top)" >>>> >>>> 0000000000020518 D sd_asym_cpucapacity >>>> 0000000000020520 D sd_asym_packing >>>> 0000000000020528 D sd_numa >>>> 0000000000020530 D sd_llc_shared >>>> 0000000000020538 D sd_llc_id >>>> 000000000002053c D sd_llc_size >>>> -------------------------------------------- 64B Cacheline Boundary >>>> 0000000000020540 D sd_llc >>> >>> Wonder if it will help to try keep sd_llc and sd_llc_size into the same >>> cache line. They are both used in the wake up path. >> >> We are still evaluating keeping which set of variables on the same >> cache line will provide the best results. >> >> I would have expected the two kernel variants - "tip" and the >> "tip + both patches (Move declaration to top)" - to give similar results >> as their System map for all the old variables remain the same and the >> addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc". >> However, now we see a regression for higher number of client. >> >> This probably hints that access to "sd_cluster" variable in Patch 2 and >> bringing in the extra cache line could be responsible for the regression >> we see with "tip + both patches (Move declaration to top)" >> >>> >>> >>>> 0000000000020548 D sd_share_id <----- >>>> 0000000000020550 D sd_cluster <----- >>>> >>>>> Or change the layout a bit to see if there's any difference, >>>>> like: >>>>> >>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); >>>>> DEFINE_PER_CPU(int, sd_llc_size); >>>>> DEFINE_PER_CPU(int, sd_llc_id); >>>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >>>>> +DEFINE_PER_CPU(int, sd_share_id); >>>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); >>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >>>>> >>>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share >>>>> the kernel config as well? I'd like to compare the config as well. >>>> >>>> I've attached the kernel config used to build the test kernel >>>> to this mail. >>>> >>>>> Thanks, >>>>> Yicong >>>> >>>> We are trying to debug the issue using perf and find an optimal >>>> arrangement of the per cpu declarations to get the relevant data >>>> used in the wakeup path on the same 64B cache line. >>> >>> A check of perf c2c profile difference between tip and the move new declarations to >>> the top case could be useful. It may give some additional clues of possibel >>> false sharing issues. >> >> Thank you for the suggestion. We are currently looking at perf counter >> data to see how the cache efficiency has changed between the two kernels. >> We suspect that the need for the data in the other cache line too in the >> wakeup path is resulting in higher cache misses in the levels closer to >> the core. >> >> I don't think it is a false sharing problem as these per CPU data are >> set when the system first boots up and will only be change again during >> a CPU hotplug operation. However, it might be beneficial to see the c2c >> profile if perf counters don't indicate anything out of the ordinary. >> > > Would it be possible if any other frequent-written variables share > same cacheline with these per-cpu data causing false sharing? This indeed seems to be the case. I'll leave the detailed analysis below. > What > about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY? > Making all the sd_* variables DEFINE_PER_CPU_READ_MOSTLY or placing all the sd_* variables on the same cache line doesn't help the regression. In fact, it makes it worse. Following are the results on different test kernels: tip - 5.19.0-rc2 tip (commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") cluster - tip + both the patches of the series Patch1 - tip + only Patch 1 align_first (Patch 1) - tip + only Patch 1 + all sd_* variables in same cache line per_cpu_aigned_struct (Patch 1) - tip + only Patch 1 + all sd_* variables part of a per_cpu struct which is cacheline aligned Clients: tip cluster Patch1 align_first (Patch 1) per_cpu_aigned_struct (Patch 1) 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct) 2993.65 (-8.27 pct) 1728.89 (-47.02 pct) 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct) 4820.18 (-19.81 pct) 3840.64 (-36.10 pct) 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct) 7868.82 (-34.74 pct) 7130.19 (-40.86 pct) 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct) 12327.86 (-42.00 pct) 12572.70 (-40.85 pct) Following is the system map for the kernel variant "align_first (Patch 1)": -- 00000000000204c0 d sugov_cpu ------------------------------------------------ 20500 (Cache Line Start) 0000000000020508 d root_cpuacct_cpuusage 0000000000020510 D cpufreq_update_util_data ------------------------------------------------ 20540 (Cache Line Start) 0000000000020540 D sd_asym_cpucapacity 0000000000020548 D sd_asym_packing 0000000000020550 D sd_numa 0000000000020558 D sd_cluster 0000000000020560 D sd_llc_shared 0000000000020568 D sd_share_id 000000000002056c D sd_llc_id 0000000000020570 D sd_llc_size 0000000000020578 D sd_llc ------------------------------------------------ 20580 (Cache Line Start) 0000000000020580 d wake_up_klogd_work 00000000000205a0 d printk_pending 00000000000205a4 d printk_count_nmi 00000000000205a5 d printk_count 00000000000205a8 d printk_context ------------------------------------------------ 205c0 (Cache Line Start) 00000000000205c0 d rcu_tasks_trace__percpu -- At this point it was clear that one or more sd_* variable needs to be co-located with the per CPU variables in cache line starting at 20540. We began moving variable out of the cache line one by one to see which variable makes the difference as we found out that as long as root_cpuacct_cpuusage and sd_llc_id are on the same cache line, the results were equivalent of what we saw on the tip. As both the variables seem to be accesses very frequently, access to one will prime the cache line containing the other variable as well leading to better cache efficiency. Placing root_cpuacct_cpuusage, sd_llc_id, sd_share_id, sd_llc_shared and sd_cluster on the same cache line, the results are as follows: Kernel versions: tip - 5.19.0-rc2 tip cluster - tip + both the patches of the series cluster (Custom Layout) - tip + both the patches of the series + reworked system map cluster (Custom Layout) + SIS_UTIL - cluster (Custom Layout) + v4 of SIS_UTIL patchset by Chenyu (https://lore.kernel.org/lkml/20220612163428.849378-1-yu.c.chen@intel.com/) Clients: tip cluster cluster (Custom Layout) cluster (Custom Layout) + SIS Util 1 444.41 (0.00 pct) 439.27 (-1.15 pct) 438.06 (-1.42 pct) 447.75 (0.75 pct) 2 879.23 (0.00 pct) 831.49 (-5.42 pct) 846.98 (-3.66 pct) 871.64 (-0.86 pct) 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1621.38 (-1.66 pct) 1656.34 (0.45 pct) 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3103.40 (-4.91 pct) * 3227.88 (-1.10 pct) 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5838.04 (-2.88 pct) 6232.92 (3.68 pct) 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11577.73 (-3.98 pct) 11774.10 (-2.35 pct) 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 19563.57 (-7.97 pct) * 22044.93 (3.70 pct) 128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 31705.47 (2.95 pct) 28986.14 (-5.87 pct) * 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 23929.42 (-4.80 pct) * 43984.52 (74.96 pct) [Known to be unstable without SIS_UTIL] 512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 52278.33 (1.93 pct) 51511.51 (0.43 pct) 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52995.27 (-0.34 pct) 52807.04 (-0.69 pct) Chenyu's SIS_UTIL patch was merged today in the tip (commit: 70fb5ccf2ebb "sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg") and seems to bring back performance to the same level as seen on the tip used during our testing. The system map of the tested configuration labelled "Custom Layout" is as follows: -- 00000000000204c0 d sugov_cpu ------------------------------------------------ 20500 (Cache Line Start) 0000000000020508 d root_cpuacct_cpuusage 0000000000020510 D cpufreq_update_util_data 0000000000020518 D sd_llc_id 000000000002051c D sd_share_id 0000000000020520 D sd_llc_shared 0000000000020528 D sd_cluster 0000000000020530 D sd_asym_cpucapacity 0000000000020538 D sd_asym_packing ------------------------------------------------ 20540 (Cache Line Start) 0000000000020540 D sd_numa 0000000000020548 D sd_llc_size 0000000000020550 D sd_llc 0000000000020560 d wake_up_klogd_work ------------------------------------------------ 20580 (Cache Line Start) 0000000000020580 d printk_pending 0000000000020584 d printk_count_nmi 0000000000020585 d printk_count 0000000000020588 d printk_context ------------------------------------------------ 205c0 (Cache Line Start) 00000000000205c0 d rcu_tasks_trace__percpu -- The System-map is however dependent on multiple factors such as config options enabled, etc. which can change from build to build. Finding a permanent solution to the issue might take some more time. Meanwhile, as this is an issue unrelated to the functionality of this patch, it should not block the landing of this patch series. Thank you everyone for your pointers and patience during this debug. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> -- Thanks and Regards, Prateek _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API 2022-06-28 11:55 ` K Prateek Nayak @ 2022-06-29 3:22 ` Yicong Yang 0 siblings, 0 replies; 23+ messages in thread From: Yicong Yang @ 2022-06-29 3:22 UTC (permalink / raw) To: K Prateek Nayak, Abel Wu, Tim Chen, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39, feng.tang On 2022/6/28 19:55, K Prateek Nayak wrote: > Hello Abel, > > Thank you for looking into this issue. > > tl;dr > > We have found two variables that need to be > co-located on the same cache line to restore tbench > performance. > > As this issue is unrelated to the functionality of > the patch, it should not hold this patch series from > landing given the performance improvements seen on > systems with CPU clusters. > > The results of our analysis are discussed in > detail below. > > On 6/20/2022 7:07 PM, Abel Wu wrote: >> >> On 6/20/22 7:20 PM, K Prateek Nayak Wrote: >>> Hello Tim, >>> >>> Thank you for looking into this. >>> >>> On 6/17/2022 10:20 PM, Tim Chen wrote: >>>> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote: >>>>> >>>>> >>>>> -- >>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>>>> index e9f3dc6dcbf4..97a3895416ab 100644 >>>>> --- a/kernel/sched/sched.h >>>>> +++ b/kernel/sched/sched.h >>>>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) >>>>> return sd; >>>>> } >>>>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>>>> +DECLARE_PER_CPU(int, sd_share_id); >>>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); >>>>> DECLARE_PER_CPU(int, sd_llc_size); >>>>> DECLARE_PER_CPU(int, sd_llc_id); >>>>> -DECLARE_PER_CPU(int, sd_share_id); >>>>> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >>>>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); >>>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >>>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); >>>>> -- >>>>> >>>>> The System-map of each kernel is as follows: >>>>> >>>>> - On "tip" >>>>> >>>>> 0000000000020518 D sd_asym_cpucapacity >>>>> 0000000000020520 D sd_asym_packing >>>>> 0000000000020528 D sd_numa >>>>> 0000000000020530 D sd_llc_shared >>>>> 0000000000020538 D sd_llc_id >>>>> 000000000002053c D sd_llc_size >>>>> -------------------------------------------- 64B Cacheline Boundary >>>>> 0000000000020540 D sd_llc >>>>> >>>>> - On "tip + Patch 1 only" and "tip + both patches" >>>>> >>>>> 0000000000020518 D sd_asym_cpucapacity >>>>> 0000000000020520 D sd_asym_packing >>>>> 0000000000020528 D sd_numa >>>>> 0000000000020530 D sd_cluster <----- >>>>> 0000000000020538 D sd_llc_shared >>>>> -------------------------------------------- 64B Cacheline Boundary >>>>> 0000000000020540 D sd_share_id <----- >>>>> 0000000000020544 D sd_llc_id >>>>> 0000000000020548 D sd_llc_size >>>>> 0000000000020550 D sd_llc >>>>> >>>>> >>>>> - On "tip + both patches (Move declaration to top)" >>>>> >>>>> 0000000000020518 D sd_asym_cpucapacity >>>>> 0000000000020520 D sd_asym_packing >>>>> 0000000000020528 D sd_numa >>>>> 0000000000020530 D sd_llc_shared >>>>> 0000000000020538 D sd_llc_id >>>>> 000000000002053c D sd_llc_size >>>>> -------------------------------------------- 64B Cacheline Boundary >>>>> 0000000000020540 D sd_llc >>>> >>>> Wonder if it will help to try keep sd_llc and sd_llc_size into the same >>>> cache line. They are both used in the wake up path. >>> >>> We are still evaluating keeping which set of variables on the same >>> cache line will provide the best results. >>> >>> I would have expected the two kernel variants - "tip" and the >>> "tip + both patches (Move declaration to top)" - to give similar results >>> as their System map for all the old variables remain the same and the >>> addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc". >>> However, now we see a regression for higher number of client. >>> >>> This probably hints that access to "sd_cluster" variable in Patch 2 and >>> bringing in the extra cache line could be responsible for the regression >>> we see with "tip + both patches (Move declaration to top)" >>> >>>> >>>> >>>>> 0000000000020548 D sd_share_id <----- >>>>> 0000000000020550 D sd_cluster <----- >>>>> >>>>>> Or change the layout a bit to see if there's any difference, >>>>>> like: >>>>>> >>>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); >>>>>> DEFINE_PER_CPU(int, sd_llc_size); >>>>>> DEFINE_PER_CPU(int, sd_llc_id); >>>>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >>>>>> +DEFINE_PER_CPU(int, sd_share_id); >>>>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >>>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); >>>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >>>>>> >>>>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share >>>>>> the kernel config as well? I'd like to compare the config as well. >>>>> >>>>> I've attached the kernel config used to build the test kernel >>>>> to this mail. >>>>> >>>>>> Thanks, >>>>>> Yicong >>>>> >>>>> We are trying to debug the issue using perf and find an optimal >>>>> arrangement of the per cpu declarations to get the relevant data >>>>> used in the wakeup path on the same 64B cache line. >>>> >>>> A check of perf c2c profile difference between tip and the move new declarations to >>>> the top case could be useful. It may give some additional clues of possibel >>>> false sharing issues. >>> >>> Thank you for the suggestion. We are currently looking at perf counter >>> data to see how the cache efficiency has changed between the two kernels. >>> We suspect that the need for the data in the other cache line too in the >>> wakeup path is resulting in higher cache misses in the levels closer to >>> the core. >>> >>> I don't think it is a false sharing problem as these per CPU data are >>> set when the system first boots up and will only be change again during >>> a CPU hotplug operation. However, it might be beneficial to see the c2c >>> profile if perf counters don't indicate anything out of the ordinary. >>> >> >> Would it be possible if any other frequent-written variables share >> same cacheline with these per-cpu data causing false sharing? > > This indeed seems to be the case. I'll leave the > detailed analysis below. > The percpu variables seems won't be affected by the non-percpu variables as the percpu section is already cacheline aligned. arm64's vmlinux.ld.S shows PERCPU_SECTION(L1_CACHE_BYTES). >> What >> about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY? >> > > Making all the sd_* variables DEFINE_PER_CPU_READ_MOSTLY or > placing all the sd_* variables on the same cache line doesn't > help the regression. In fact, it makes it worse. > > Following are the results on different test kernels: > tip - 5.19.0-rc2 tip > (commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") > cluster - tip + both the patches of the series > Patch1 - tip + only Patch 1 > align_first (Patch 1) - tip + only Patch 1 + all sd_* variables in same cache line > per_cpu_aigned_struct (Patch 1) - tip + only Patch 1 + all sd_* variables part of a per_cpu struct which is cacheline aligned > > Clients: tip cluster Patch1 align_first (Patch 1) per_cpu_aigned_struct (Patch 1) > 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct) 2993.65 (-8.27 pct) 1728.89 (-47.02 pct) > 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct) 4820.18 (-19.81 pct) 3840.64 (-36.10 pct) > 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct) 7868.82 (-34.74 pct) 7130.19 (-40.86 pct) > 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct) 12327.86 (-42.00 pct) 12572.70 (-40.85 pct) > > Following is the system map for the kernel variant "align_first (Patch 1)": > > -- > 00000000000204c0 d sugov_cpu > ------------------------------------------------ 20500 (Cache Line Start) > 0000000000020508 d root_cpuacct_cpuusage > 0000000000020510 D cpufreq_update_util_data > ------------------------------------------------ 20540 (Cache Line Start) > 0000000000020540 D sd_asym_cpucapacity > 0000000000020548 D sd_asym_packing > 0000000000020550 D sd_numa > 0000000000020558 D sd_cluster > 0000000000020560 D sd_llc_shared > 0000000000020568 D sd_share_id > 000000000002056c D sd_llc_id > 0000000000020570 D sd_llc_size > 0000000000020578 D sd_llc > ------------------------------------------------ 20580 (Cache Line Start) > 0000000000020580 d wake_up_klogd_work > 00000000000205a0 d printk_pending > 00000000000205a4 d printk_count_nmi > 00000000000205a5 d printk_count > 00000000000205a8 d printk_context > ------------------------------------------------ 205c0 (Cache Line Start) > 00000000000205c0 d rcu_tasks_trace__percpu > -- > > At this point it was clear that one or more sd_* variable needs > to be co-located with the per CPU variables in cache line starting > at 20540. We began moving variable out of the cache line one by one > to see which variable makes the difference as we found out that as > long as root_cpuacct_cpuusage and sd_llc_id are on the same cache > line, the results were equivalent of what we saw on the tip. As both > the variables seem to be accesses very frequently, access to one will > prime the cache line containing the other variable as well leading to > better cache efficiency. > > Placing root_cpuacct_cpuusage, sd_llc_id, sd_share_id, sd_llc_shared > and sd_cluster on the same cache line, the results are as follows: > > Kernel versions: > tip - 5.19.0-rc2 tip > cluster - tip + both the patches of the series > cluster (Custom Layout) - tip + both the patches of the series + reworked system map > cluster (Custom Layout) + SIS_UTIL - cluster (Custom Layout) + v4 of SIS_UTIL patchset by Chenyu > (https://lore.kernel.org/lkml/20220612163428.849378-1-yu.c.chen@intel.com/) > > Clients: tip cluster cluster (Custom Layout) cluster (Custom Layout) > + SIS Util > 1 444.41 (0.00 pct) 439.27 (-1.15 pct) 438.06 (-1.42 pct) 447.75 (0.75 pct) > 2 879.23 (0.00 pct) 831.49 (-5.42 pct) 846.98 (-3.66 pct) 871.64 (-0.86 pct) > 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1621.38 (-1.66 pct) 1656.34 (0.45 pct) > 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3103.40 (-4.91 pct) * 3227.88 (-1.10 pct) > 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5838.04 (-2.88 pct) 6232.92 (3.68 pct) > 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11577.73 (-3.98 pct) 11774.10 (-2.35 pct) > 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 19563.57 (-7.97 pct) * 22044.93 (3.70 pct) > 128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 31705.47 (2.95 pct) 28986.14 (-5.87 pct) * > 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 23929.42 (-4.80 pct) * 43984.52 (74.96 pct) [Known to be unstable without SIS_UTIL] > 512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 52278.33 (1.93 pct) 51511.51 (0.43 pct) > 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52995.27 (-0.34 pct) 52807.04 (-0.69 pct) > > Chenyu's SIS_UTIL patch was merged today in the tip > (commit: 70fb5ccf2ebb "sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg") > and seems to bring back performance to the same level > as seen on the tip used during our testing. > > The system map of the tested configuration labelled "Custom Layout" > is as follows: > > -- > 00000000000204c0 d sugov_cpu > ------------------------------------------------ 20500 (Cache Line Start) > 0000000000020508 d root_cpuacct_cpuusage > 0000000000020510 D cpufreq_update_util_data > 0000000000020518 D sd_llc_id > 000000000002051c D sd_share_id > 0000000000020520 D sd_llc_shared > 0000000000020528 D sd_cluster > 0000000000020530 D sd_asym_cpucapacity > 0000000000020538 D sd_asym_packing > ------------------------------------------------ 20540 (Cache Line Start) > 0000000000020540 D sd_numa > 0000000000020548 D sd_llc_size > 0000000000020550 D sd_llc > 0000000000020560 d wake_up_klogd_work > ------------------------------------------------ 20580 (Cache Line Start) > 0000000000020580 d printk_pending > 0000000000020584 d printk_count_nmi > 0000000000020585 d printk_count > 0000000000020588 d printk_context > ------------------------------------------------ 205c0 (Cache Line Start) > 00000000000205c0 d rcu_tasks_trace__percpu > -- > Thanks for the data. It looks sufficient to illustrate the issue. I'm still a little curious that is it specific on Zen 3? Since I cannot get the Zen 3 machine so I tried to reproduce this on the a AMD 7742 (2 x 64C/128T) and an Intel 6148 (2 x 40C/80T) machine with your config based on tip/sche/core of commit f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle". With cluster's percpu definition patch(the worst case) the root_cpuacct_cpuusage and sd_llc_id are separated but make little difference to the tbench. For 7742 with NPS = 1: tip/sched/core tip/sched/core-cls-percpu threads iter-1 iter-2 iter-3 avg iter-1 iter-2 iter-3 avg 1 287.254 285.252 288.29 286.93 288.105 285.884 286.164 286.72 (-0.075%) 2 573.623 570.326 573.67 572.54 577.242 574.257 575.611 575.70 (0.553%) 4 1140.01 1135.54 1137.9 1137.82 1153.45 1146.66 1149.8 1149.97 (1.068%) 8 2232.73 2228.71 2237.47 2232.97 2280.83 2291.94 2287.55 2286.77 (2.409%) 16 4510.65 4465.2 4522.22 4499.36 4551.19 4573.85 4575.78 4566.94 (1.502%) 32 8432.43 8542.35 8404.7 8459.83 8627.4 8712.29 8764.53 8701.41 (2.856%) 64 16404.2 16464.7 15686.4 16185.10 15878.3 15638.3 15735.6 15750.73 (-2.684%) 128 23327.6 22767.4 21017.3 22370.77 23774.9 23733.2 21744.9 23084.33 (3.190%) layout of core-cls-percpu: 0000000000020508 d root_cpuacct_cpuusage 0000000000020510 D cpufreq_update_util_data 0000000000020518 D sd_asym_cpucapacity 0000000000020520 D sd_asym_packing 0000000000020528 D sd_numa 0000000000020530 D sd_cluster 0000000000020538 D sd_llc_shared ---------------------------------boundary 0000000000020540 D sd_share_id 0000000000020544 D sd_llc_id 0000000000020548 D sd_llc_size 0000000000020550 D sd_llc For 6148 with SNC disabled: tip/sched/core tip/sched/core-cls-percpu 1 314.44 ( 0.00%) 316.68 * 0.71%* 2 630.22 ( 0.00%) 633.84 * 0.57%* 4 1260.70 ( 0.00%) 1267.32 * 0.52%* 8 2475.31 ( 0.00%) 2485.94 * 0.43%* 16 4684.03 ( 0.00%) 4682.21 * -0.04%* 32 7496.28 ( 0.00%) 7360.41 * -1.81%* 64 7981.56 ( 0.00%) 7851.73 * -1.63%* 128 15359.81 ( 0.00%) 15255.68 * -0.68%* 256 15104.31 ( 0.00%) 15253.85 * 0.99%* 320 15146.47 ( 0.00%) 15151.39 * 0.03%* layout of core-cls-percpu: 0000000000020508 d root_cpuacct_cpuusage 0000000000020510 D cpufreq_update_util_data 0000000000020518 D sd_asym_cpucapacity 0000000000020520 D sd_asym_packing 0000000000020528 D sd_numa 0000000000020530 D sd_cluster 0000000000020538 D sd_llc_shared ---------------------------------boundary 0000000000020540 D sd_share_id 0000000000020544 D sd_llc_id 0000000000020548 D sd_llc_size 0000000000020550 D sd_llc > The System-map is however dependent on multiple factors > such as config options enabled, etc. which can change from > build to build. > Finding a permanent solution to the issue might take > some more time. > > Meanwhile, as this is an issue unrelated to the > functionality of this patch, it should not block > the landing of this patch series. > Thank you everyone for your pointers and patience > during this debug. > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Thanks. Then I'll respin the patch. Regards, Yicong _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-09 12:06 [PATCH v4 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang 2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang @ 2022-06-09 12:06 ` Yicong Yang 2022-06-09 22:47 ` Tim Chen 2022-06-26 12:13 ` Abel Wu 1 sibling, 2 replies; 23+ messages in thread From: Yicong Yang @ 2022-06-09 12:06 UTC (permalink / raw) To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 From: Barry Song <song.bao.hua@hisilicon.com> For platforms having clusters like Kunpeng920, CPUs within the same cluster have lower latency when synchronizing and accessing shared resources like cache. Thus, this patch tries to find an idle cpu within the cluster of the target CPU before scanning the whole LLC to gain lower latency. Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch doesn't consider SMT for this moment. Testing has been done on Kunpeng920 by pinning tasks to one numa and two numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. With this patch, We noticed enhancement on tbench within one numa or cross two numa. On numa 0: 5.19-rc1 patched Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%* Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%* Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%* Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%* Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%* Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%* Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%* Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%* On numa 0-1: 5.19-rc1 patched Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%* Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%* Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%* Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%* Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%* Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%* Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%* Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%* Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%* Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> --- kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 41 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 77b2048a9326..6d173e196ad3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd #endif /* CONFIG_SCHED_SMT */ +#ifdef CONFIG_SCHED_CLUSTER +/* + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning + */ +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, + int target, int *nr) +{ + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); + int cpu, idle_cpu; + + /* TODO: Support SMT system with cluster topology */ + if (!sched_smt_active() && sd) { + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { + if (!--*nr) + break; + + idle_cpu = __select_idle_cpu(cpu, p); + if ((unsigned int)idle_cpu < nr_cpumask_bits) + return idle_cpu; + } + + cpumask_andnot(cpus, cpus, sched_domain_span(sd)); + } + + return -1; +} +#else +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, + int target, int *nr) +{ + return -1; +} +#endif + /* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against the @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool time = cpu_clock(this); } + idle_cpu = scan_cluster(p, cpus, target, &nr); + if ((unsigned int)idle_cpu < nr_cpumask_bits) + return idle_cpu; + for_each_cpu_wrap(cpu, cpus, target + 1) { if (has_idle_core) { i = select_idle_core(p, cpu, cpus, &idle_cpu); @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool return i; } else { - if (!--nr) + if (--nr <= 0) return -1; idle_cpu = __select_idle_cpu(cpu, p); if ((unsigned int)idle_cpu < nr_cpumask_bits) @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) /* * If the previous CPU is cache affine and idle, don't be stupid: */ - if (prev != target && cpus_share_cache(prev, target) && + if (prev != target && cpus_share_resources(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev)) && asym_fits_capacity(task_util, prev)) return prev; @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) p->recent_used_cpu = prev; if (recent_used_cpu != prev && recent_used_cpu != target && - cpus_share_cache(recent_used_cpu, target) && + cpus_share_resources(recent_used_cpu, target) && (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && asym_fits_capacity(task_util, recent_used_cpu)) { -- 2.24.0 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang @ 2022-06-09 22:47 ` Tim Chen 2022-06-10 4:01 ` Barry Song 2022-06-10 6:39 ` Yicong Yang 2022-06-26 12:13 ` Abel Wu 1 sibling, 2 replies; 23+ messages in thread From: Tim Chen @ 2022-06-09 22:47 UTC (permalink / raw) To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote: > From: Barry Song <song.bao.hua@hisilicon.com> > > For platforms having clusters like Kunpeng920, CPUs within the same cluster > have lower latency when synchronizing and accessing shared resources like > cache. Thus, this patch tries to find an idle cpu within the cluster of the > target CPU before scanning the whole LLC to gain lower latency. > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch > doesn't consider SMT for this moment. > > Testing has been done on Kunpeng920 by pinning tasks to one numa and two > numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. > > With this patch, We noticed enhancement on tbench within one numa or cross > two numa. > > On numa 0: > 5.19-rc1 patched > Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%* > Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%* > Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%* > Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%* > Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%* > Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%* > Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%* > Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%* > > On numa 0-1: > 5.19-rc1 patched > Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%* > Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%* > Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%* > Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%* > Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%* > Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%* > Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%* > Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%* > Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%* Yicong, Have you tried any workload where tasks don't share data with each other but have sleep/wakeup? That's the case where we actually want to spread the tasks out among the clusters to void contention for L2 cache. Will be nice to make sure there's no regression there for such workload. Code itself looks good. Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > > Tested-by: Yicong Yang <yangyicong@hisilicon.com> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > --- > kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 41 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 77b2048a9326..6d173e196ad3 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd > > #endif /* CONFIG_SCHED_SMT */ > > +#ifdef CONFIG_SCHED_CLUSTER > +/* > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning > + */ > +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, > + int target, int *nr) > +{ > + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); > + int cpu, idle_cpu; > + > + /* TODO: Support SMT system with cluster topology */ > + if (!sched_smt_active() && sd) { > + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { > + if (!--*nr) > + break; > + > + idle_cpu = __select_idle_cpu(cpu, p); > + if ((unsigned int)idle_cpu < nr_cpumask_bits) > + return idle_cpu; > + } > + > + cpumask_andnot(cpus, cpus, sched_domain_span(sd)); > + } > + > + return -1; > +} > +#else > +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, > + int target, int *nr) > +{ > + return -1; > +} > +#endif > + > /* > * Scan the LLC domain for idle CPUs; this is dynamically regulated by > * comparing the average scan cost (tracked in sd->avg_scan_cost) against the > @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > time = cpu_clock(this); > } > > + idle_cpu = scan_cluster(p, cpus, target, &nr); > + if ((unsigned int)idle_cpu < nr_cpumask_bits) > + return idle_cpu; > + > for_each_cpu_wrap(cpu, cpus, target + 1) { > if (has_idle_core) { > i = select_idle_core(p, cpu, cpus, &idle_cpu); > @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > return i; > > } else { > - if (!--nr) > + if (--nr <= 0) > return -1; > idle_cpu = __select_idle_cpu(cpu, p); > if ((unsigned int)idle_cpu < nr_cpumask_bits) > @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > /* > * If the previous CPU is cache affine and idle, don't be stupid: > */ > - if (prev != target && cpus_share_cache(prev, target) && > + if (prev != target && cpus_share_resources(prev, target) && > (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > asym_fits_capacity(task_util, prev)) > return prev; > @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > p->recent_used_cpu = prev; > if (recent_used_cpu != prev && > recent_used_cpu != target && > - cpus_share_cache(recent_used_cpu, target) && > + cpus_share_resources(recent_used_cpu, target) && > (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && > cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && > asym_fits_capacity(task_util, recent_used_cpu)) { _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-09 22:47 ` Tim Chen @ 2022-06-10 4:01 ` Barry Song 2022-06-10 6:39 ` Yicong Yang 1 sibling, 0 replies; 23+ messages in thread From: Barry Song @ 2022-06-10 4:01 UTC (permalink / raw) To: Tim Chen Cc: Yicong Yang, Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Gautham R. Shenoy, LKML, LAK, Dietmar Eggemann, Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego, Srikar Dronamraju, Linuxarm, Guodong Xu, hesham.almatary, john.garry, Yang Shen On Fri, Jun 10, 2022 at 10:47 AM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote: > > From: Barry Song <song.bao.hua@hisilicon.com> > > > > For platforms having clusters like Kunpeng920, CPUs within the same cluster > > have lower latency when synchronizing and accessing shared resources like > > cache. Thus, this patch tries to find an idle cpu within the cluster of the > > target CPU before scanning the whole LLC to gain lower latency. > > > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch > > doesn't consider SMT for this moment. > > > > Testing has been done on Kunpeng920 by pinning tasks to one numa and two > > numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. > > > > With this patch, We noticed enhancement on tbench within one numa or cross > > two numa. > > > > On numa 0: > > 5.19-rc1 patched > > Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%* > > Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%* > > Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%* > > Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%* > > Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%* > > Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%* > > Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%* > > Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%* > > > > On numa 0-1: > > 5.19-rc1 patched > > Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%* > > Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%* > > Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%* > > Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%* > > Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%* > > Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%* > > Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%* > > Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%* > > Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%* > > Yicong, > > Have you tried any workload where tasks don't share data > with each other but have sleep/wakeup? That's the case > where we actually want to spread the tasks out among the clusters > to void contention for L2 cache. guess there isn't a way to universally win in term of the modifcation of scheduling policies. The good news is that as long as LB is still there, tasks still get a good chance to spread if they stop talking to each other again and again after a small talk. if tasks do like spreading but keep in constant touch with each other, we may have to interfere with them from userspace by openMP, numactl or taskset etc. as WAKE_AFFINE is always an enemy for this requirement. > > Will be nice to make sure there's no regression there for > such workload. > > Code itself looks good. > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > > > > > Tested-by: Yicong Yang <yangyicong@hisilicon.com> > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > > --- > > kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 41 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 77b2048a9326..6d173e196ad3 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd > > > > #endif /* CONFIG_SCHED_SMT */ > > > > +#ifdef CONFIG_SCHED_CLUSTER > > +/* > > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning > > + */ > > +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, > > + int target, int *nr) > > +{ > > + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); > > + int cpu, idle_cpu; > > + > > + /* TODO: Support SMT system with cluster topology */ > > + if (!sched_smt_active() && sd) { > > + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { > > + if (!--*nr) > > + break; > > + > > + idle_cpu = __select_idle_cpu(cpu, p); > > + if ((unsigned int)idle_cpu < nr_cpumask_bits) > > + return idle_cpu; > > + } > > + > > + cpumask_andnot(cpus, cpus, sched_domain_span(sd)); > > + } > > + > > + return -1; > > +} > > +#else > > +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, > > + int target, int *nr) > > +{ > > + return -1; > > +} > > +#endif > > + > > /* > > * Scan the LLC domain for idle CPUs; this is dynamically regulated by > > * comparing the average scan cost (tracked in sd->avg_scan_cost) against the > > @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > > time = cpu_clock(this); > > } > > > > + idle_cpu = scan_cluster(p, cpus, target, &nr); > > + if ((unsigned int)idle_cpu < nr_cpumask_bits) > > + return idle_cpu; > > + > > for_each_cpu_wrap(cpu, cpus, target + 1) { > > if (has_idle_core) { > > i = select_idle_core(p, cpu, cpus, &idle_cpu); > > @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > > return i; > > > > } else { > > - if (!--nr) > > + if (--nr <= 0) > > return -1; > > idle_cpu = __select_idle_cpu(cpu, p); > > if ((unsigned int)idle_cpu < nr_cpumask_bits) > > @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > /* > > * If the previous CPU is cache affine and idle, don't be stupid: > > */ > > - if (prev != target && cpus_share_cache(prev, target) && > > + if (prev != target && cpus_share_resources(prev, target) && > > (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > > asym_fits_capacity(task_util, prev)) > > return prev; > > @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > p->recent_used_cpu = prev; > > if (recent_used_cpu != prev && > > recent_used_cpu != target && > > - cpus_share_cache(recent_used_cpu, target) && > > + cpus_share_resources(recent_used_cpu, target) && > > (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && > > cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && > > asym_fits_capacity(task_util, recent_used_cpu)) { > Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-09 22:47 ` Tim Chen 2022-06-10 4:01 ` Barry Song @ 2022-06-10 6:39 ` Yicong Yang 2022-06-10 21:19 ` Tim Chen 2022-06-11 3:03 ` Chen Yu 1 sibling, 2 replies; 23+ messages in thread From: Yicong Yang @ 2022-06-10 6:39 UTC (permalink / raw) To: Tim Chen, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39, Chen Yu On 2022/6/10 6:47, Tim Chen wrote: > On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote: >> From: Barry Song <song.bao.hua@hisilicon.com> >> >> For platforms having clusters like Kunpeng920, CPUs within the same cluster >> have lower latency when synchronizing and accessing shared resources like >> cache. Thus, this patch tries to find an idle cpu within the cluster of the >> target CPU before scanning the whole LLC to gain lower latency. >> >> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch >> doesn't consider SMT for this moment. >> >> Testing has been done on Kunpeng920 by pinning tasks to one numa and two >> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. >> >> With this patch, We noticed enhancement on tbench within one numa or cross >> two numa. >> >> On numa 0: >> 5.19-rc1 patched >> Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%* >> Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%* >> Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%* >> Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%* >> Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%* >> Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%* >> Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%* >> Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%* >> >> On numa 0-1: >> 5.19-rc1 patched >> Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%* >> Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%* >> Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%* >> Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%* >> Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%* >> Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%* >> Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%* >> Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%* >> Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%* > > Yicong, > > Have you tried any workload where tasks don't share data > with each other but have sleep/wakeup? That's the case > where we actually want to spread the tasks out among the clusters > to void contention for L2 cache. > > Will be nice to make sure there's no regression there for > such workload. > Any certain workload you'd like me test? I'm willing to do :) I've tested this patch with MySQL as well (like in v2). This won't hurt the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL posted in [1]. We leverage the nr to suppress redundant scanning in the current approach and seems SIS_UTIL is more efficient in this case. 5.19-rc1 patched patched+SIS_UTIL[1] TPS-16threads 6215.11 6172.74 (-0.68%) 6217.33 (0.04%) QPS-16threads 124302.21 123454.68 (-0.68%) 124346.52 (0.04%) avg-lat-16threads 2.57 2.59 (-0.65%) 2.57 (0.00%) TPS-24threads 8726.40 8690.87 (-0.41%) 8833.08 (1.22%) QPS-24threads 174527.88 173817.42 (-0.41%) 176661.54 (1.21%) avg-lat-24threads 2.75 2.76 (-0.36%) 2.71 (1.33%) TPS-32threads 9555.42 9514.86 (-0.42%) 10010.87 (4.77%) QPS-32threads 191108.37 190297.28 (-0.42%) 200217.35 (4.55%) avg-lat-32threads 3.35 3.36 (-0.30%) 3.20 (4.58%) TPS-64threads 10290.10 10324.75 (0.34%) 10819.77 (5.15%) QPS-64threads 205802.05 206494.95 (0.34%) 216395.40 (4.90%) avg-lat-64threads 6.22 6.20 (0.38%) 5.92 (4.88%) > Code itself looks good. > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > Thanks. [1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/ >> >> Tested-by: Yicong Yang <yangyicong@hisilicon.com> >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >> --- >> kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 41 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 77b2048a9326..6d173e196ad3 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd >> >> #endif /* CONFIG_SCHED_SMT */ >> >> +#ifdef CONFIG_SCHED_CLUSTER >> +/* >> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning >> + */ >> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, >> + int target, int *nr) >> +{ >> + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); >> + int cpu, idle_cpu; >> + >> + /* TODO: Support SMT system with cluster topology */ >> + if (!sched_smt_active() && sd) { >> + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { >> + if (!--*nr) >> + break; >> + >> + idle_cpu = __select_idle_cpu(cpu, p); >> + if ((unsigned int)idle_cpu < nr_cpumask_bits) >> + return idle_cpu; >> + } >> + >> + cpumask_andnot(cpus, cpus, sched_domain_span(sd)); >> + } >> + >> + return -1; >> +} >> +#else >> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, >> + int target, int *nr) >> +{ >> + return -1; >> +} >> +#endif >> + >> /* >> * Scan the LLC domain for idle CPUs; this is dynamically regulated by >> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the >> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool >> time = cpu_clock(this); >> } >> >> + idle_cpu = scan_cluster(p, cpus, target, &nr); >> + if ((unsigned int)idle_cpu < nr_cpumask_bits) >> + return idle_cpu; >> + >> for_each_cpu_wrap(cpu, cpus, target + 1) { >> if (has_idle_core) { >> i = select_idle_core(p, cpu, cpus, &idle_cpu); >> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool >> return i; >> >> } else { >> - if (!--nr) >> + if (--nr <= 0) >> return -1; >> idle_cpu = __select_idle_cpu(cpu, p); >> if ((unsigned int)idle_cpu < nr_cpumask_bits) >> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >> /* >> * If the previous CPU is cache affine and idle, don't be stupid: >> */ >> - if (prev != target && cpus_share_cache(prev, target) && >> + if (prev != target && cpus_share_resources(prev, target) && >> (available_idle_cpu(prev) || sched_idle_cpu(prev)) && >> asym_fits_capacity(task_util, prev)) >> return prev; >> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >> p->recent_used_cpu = prev; >> if (recent_used_cpu != prev && >> recent_used_cpu != target && >> - cpus_share_cache(recent_used_cpu, target) && >> + cpus_share_resources(recent_used_cpu, target) && >> (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && >> cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && >> asym_fits_capacity(task_util, recent_used_cpu)) { > > > . > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-10 6:39 ` Yicong Yang @ 2022-06-10 21:19 ` Tim Chen 2022-06-11 3:03 ` Chen Yu 1 sibling, 0 replies; 23+ messages in thread From: Tim Chen @ 2022-06-10 21:19 UTC (permalink / raw) To: Yicong Yang, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39, Chen Yu On Fri, 2022-06-10 at 14:39 +0800, Yicong Yang wrote: > > > I've tested this patch with MySQL as well (like in v2). This won't hurt > the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL > posted in [1]. We leverage the nr to suppress redundant scanning in the > current approach and seems SIS_UTIL is more efficient in this case. > > 5.19-rc1 patched patched+SIS_UTIL[1] > TPS-16threads 6215.11 6172.74 (-0.68%) 6217.33 (0.04%) > QPS-16threads 124302.21 123454.68 (-0.68%) 124346.52 (0.04%) > avg-lat-16threads 2.57 2.59 (-0.65%) 2.57 (0.00%) > TPS-24threads 8726.40 8690.87 (-0.41%) 8833.08 (1.22%) > QPS-24threads 174527.88 173817.42 (-0.41%) 176661.54 (1.21%) > avg-lat-24threads 2.75 2.76 (-0.36%) 2.71 (1.33%) > TPS-32threads 9555.42 9514.86 (-0.42%) 10010.87 (4.77%) > QPS-32threads 191108.37 190297.28 (-0.42%) 200217.35 (4.55%) > avg-lat-32threads 3.35 3.36 (-0.30%) 3.20 (4.58%) > TPS-64threads 10290.10 10324.75 (0.34%) 10819.77 (5.15%) > QPS-64threads 205802.05 206494.95 (0.34%) 216395.40 (4.90%) > avg-lat-64threads 6.22 6.20 (0.38%) 5.92 (4.88%) > Thanks for the numbers. SIS_UTIL will keep off migrations off the cluster that doesn't really improve overall utilization. We have higher chance that L2 cache is warm. So it makes sense that we see a bit better performance there. Tim _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-10 6:39 ` Yicong Yang 2022-06-10 21:19 ` Tim Chen @ 2022-06-11 3:03 ` Chen Yu 2022-06-11 7:40 ` Yicong Yang 1 sibling, 1 reply; 23+ messages in thread From: Chen Yu @ 2022-06-11 3:03 UTC (permalink / raw) To: Yicong Yang Cc: Tim Chen, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel, dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 Hi Yicong, On Fri, Jun 10, 2022 at 02:39:38PM +0800, Yicong Yang wrote: > On 2022/6/10 6:47, Tim Chen wrote: > > On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote: > >> From: Barry Song <song.bao.hua@hisilicon.com> > >> > >> For platforms having clusters like Kunpeng920, CPUs within the same cluster > >> have lower latency when synchronizing and accessing shared resources like > >> cache. Thus, this patch tries to find an idle cpu within the cluster of the > >> target CPU before scanning the whole LLC to gain lower latency. > >> > >> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch > >> doesn't consider SMT for this moment. > >> > >> Testing has been done on Kunpeng920 by pinning tasks to one numa and two > >> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. > >> > >> With this patch, We noticed enhancement on tbench within one numa or cross > >> two numa. > >> > >> On numa 0: > >> 5.19-rc1 patched > >> Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%* > >> Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%* > >> Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%* > >> Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%* > >> Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%* > >> Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%* > >> Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%* > >> Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%* > >> > >> On numa 0-1: > >> 5.19-rc1 patched > >> Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%* > >> Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%* > >> Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%* > >> Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%* > >> Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%* > >> Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%* > >> Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%* > >> Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%* > >> Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%* > > > > Yicong, > > > > Have you tried any workload where tasks don't share data > > with each other but have sleep/wakeup? That's the case > > where we actually want to spread the tasks out among the clusters > > to void contention for L2 cache. > > > > Will be nice to make sure there's no regression there for > > such workload. > > > > Any certain workload you'd like me test? I'm willing to do :) > > I've tested this patch with MySQL as well (like in v2). This won't hurt > the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL > posted in [1]. We leverage the nr to suppress redundant scanning in the > current approach and seems SIS_UTIL is more efficient in this case. > > 5.19-rc1 patched patched+SIS_UTIL[1] > TPS-16threads 6215.11 6172.74 (-0.68%) 6217.33 (0.04%) > QPS-16threads 124302.21 123454.68 (-0.68%) 124346.52 (0.04%) > avg-lat-16threads 2.57 2.59 (-0.65%) 2.57 (0.00%) > TPS-24threads 8726.40 8690.87 (-0.41%) 8833.08 (1.22%) > QPS-24threads 174527.88 173817.42 (-0.41%) 176661.54 (1.21%) > avg-lat-24threads 2.75 2.76 (-0.36%) 2.71 (1.33%) > TPS-32threads 9555.42 9514.86 (-0.42%) 10010.87 (4.77%) > QPS-32threads 191108.37 190297.28 (-0.42%) 200217.35 (4.55%) > avg-lat-32threads 3.35 3.36 (-0.30%) 3.20 (4.58%) > TPS-64threads 10290.10 10324.75 (0.34%) 10819.77 (5.15%) > QPS-64threads 205802.05 206494.95 (0.34%) 216395.40 (4.90%) > avg-lat-64threads 6.22 6.20 (0.38%) 5.92 (4.88%) > > Thanks for the testings. I'm refining the patch according to Mel's suggestion and will send it out later. Some minor nits below. > > Code itself looks good. > > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > > > > Thanks. > > [1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/ > > >> > >> Tested-by: Yicong Yang <yangyicong@hisilicon.com> > >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > >> --- > >> kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++--- > >> 1 file changed, 41 insertions(+), 3 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 77b2048a9326..6d173e196ad3 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd > >> > >> #endif /* CONFIG_SCHED_SMT */ > >> > >> +#ifdef CONFIG_SCHED_CLUSTER > >> +/* > >> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning > >> + */ > >> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, > >> + int target, int *nr) > >> +{ > >> + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); > >> + int cpu, idle_cpu; > >> + > >> + /* TODO: Support SMT system with cluster topology */ > >> + if (!sched_smt_active() && sd) { > >> + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { > >> + if (!--*nr) > >> + break; Maybe I understand it incorrectly, here we changed the value of nr. Then if we did not find any idle CPU in the cluster, nr is set to 0. Then in the LLC scan, since there is 'if (--nr <= 0)', the LLC scan terminates. Is the policy like: if we can not find one idle CPU in the cluster, we give up scan for one in the LLC? [cut] > >> + > >> /* > >> * Scan the LLC domain for idle CPUs; this is dynamically regulated by If we add logic for cluster scan, might need to update the comment as well. > >> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the > >> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > >> time = cpu_clock(this); > >> } > >> > >> + idle_cpu = scan_cluster(p, cpus, target, &nr); > >> + if ((unsigned int)idle_cpu < nr_cpumask_bits) > >> + return idle_cpu; > >> + > >> for_each_cpu_wrap(cpu, cpus, target + 1) { > >> if (has_idle_core) { > >> i = select_idle_core(p, cpu, cpus, &idle_cpu); > >> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > >> return i; > >> > >> } else { > >> - if (!--nr) > >> + if (--nr <= 0) May I know the reason to change this? I thought this could be a seperate patch. thanks, Chenyu > >> return -1; > >> idle_cpu = __select_idle_cpu(cpu, p); > >> if ((unsigned int)idle_cpu < nr_cpumask_bits) > >> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > >> /* > >> * If the previous CPU is cache affine and idle, don't be stupid: > >> */ > >> - if (prev != target && cpus_share_cache(prev, target) && > >> + if (prev != target && cpus_share_resources(prev, target) && > >> (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > >> asym_fits_capacity(task_util, prev)) > >> return prev; > >> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > >> p->recent_used_cpu = prev; > >> if (recent_used_cpu != prev && > >> recent_used_cpu != target && > >> - cpus_share_cache(recent_used_cpu, target) && > >> + cpus_share_resources(recent_used_cpu, target) && > >> (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && > >> cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && > >> asym_fits_capacity(task_util, recent_used_cpu)) { > > > > > > . > > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-11 3:03 ` Chen Yu @ 2022-06-11 7:40 ` Yicong Yang 2022-06-11 9:04 ` Chen Yu 0 siblings, 1 reply; 23+ messages in thread From: Yicong Yang @ 2022-06-11 7:40 UTC (permalink / raw) To: Chen Yu Cc: yangyicong, Tim Chen, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel, dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 On 2022/6/11 11:03, Chen Yu wrote: > Hi Yicong, > On Fri, Jun 10, 2022 at 02:39:38PM +0800, Yicong Yang wrote: >> On 2022/6/10 6:47, Tim Chen wrote: >>> On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote: >>>> From: Barry Song <song.bao.hua@hisilicon.com> >>>> >>>> For platforms having clusters like Kunpeng920, CPUs within the same cluster >>>> have lower latency when synchronizing and accessing shared resources like >>>> cache. Thus, this patch tries to find an idle cpu within the cluster of the >>>> target CPU before scanning the whole LLC to gain lower latency. >>>> >>>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch >>>> doesn't consider SMT for this moment. >>>> >>>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two >>>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. >>>> >>>> With this patch, We noticed enhancement on tbench within one numa or cross >>>> two numa. >>>> >>>> On numa 0: >>>> 5.19-rc1 patched >>>> Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%* >>>> Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%* >>>> Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%* >>>> Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%* >>>> Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%* >>>> Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%* >>>> Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%* >>>> Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%* >>>> >>>> On numa 0-1: >>>> 5.19-rc1 patched >>>> Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%* >>>> Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%* >>>> Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%* >>>> Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%* >>>> Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%* >>>> Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%* >>>> Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%* >>>> Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%* >>>> Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%* >>> >>> Yicong, >>> >>> Have you tried any workload where tasks don't share data >>> with each other but have sleep/wakeup? That's the case >>> where we actually want to spread the tasks out among the clusters >>> to void contention for L2 cache. >>> >>> Will be nice to make sure there's no regression there for >>> such workload. >>> >> >> Any certain workload you'd like me test? I'm willing to do :) >> >> I've tested this patch with MySQL as well (like in v2). This won't hurt >> the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL >> posted in [1]. We leverage the nr to suppress redundant scanning in the >> current approach and seems SIS_UTIL is more efficient in this case. >> >> 5.19-rc1 patched patched+SIS_UTIL[1] >> TPS-16threads 6215.11 6172.74 (-0.68%) 6217.33 (0.04%) >> QPS-16threads 124302.21 123454.68 (-0.68%) 124346.52 (0.04%) >> avg-lat-16threads 2.57 2.59 (-0.65%) 2.57 (0.00%) >> TPS-24threads 8726.40 8690.87 (-0.41%) 8833.08 (1.22%) >> QPS-24threads 174527.88 173817.42 (-0.41%) 176661.54 (1.21%) >> avg-lat-24threads 2.75 2.76 (-0.36%) 2.71 (1.33%) >> TPS-32threads 9555.42 9514.86 (-0.42%) 10010.87 (4.77%) >> QPS-32threads 191108.37 190297.28 (-0.42%) 200217.35 (4.55%) >> avg-lat-32threads 3.35 3.36 (-0.30%) 3.20 (4.58%) >> TPS-64threads 10290.10 10324.75 (0.34%) 10819.77 (5.15%) >> QPS-64threads 205802.05 206494.95 (0.34%) 216395.40 (4.90%) >> avg-lat-64threads 6.22 6.20 (0.38%) 5.92 (4.88%) >> >> > Thanks for the testings. I'm refining the patch according to Mel's suggestion and > will send it out later. Some minor nits below. >>> Code itself looks good. >>> >>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> >>> >> >> Thanks. >> >> [1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/ >> >>>> >>>> Tested-by: Yicong Yang <yangyicong@hisilicon.com> >>>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> >>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >>>> --- >>>> kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++--- >>>> 1 file changed, 41 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index 77b2048a9326..6d173e196ad3 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd >>>> >>>> #endif /* CONFIG_SCHED_SMT */ >>>> >>>> +#ifdef CONFIG_SCHED_CLUSTER >>>> +/* >>>> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning >>>> + */ >>>> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, >>>> + int target, int *nr) >>>> +{ >>>> + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); >>>> + int cpu, idle_cpu; >>>> + >>>> + /* TODO: Support SMT system with cluster topology */ >>>> + if (!sched_smt_active() && sd) { >>>> + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { >>>> + if (!--*nr) >>>> + break; > Maybe I understand it incorrectly, here we changed the value of nr. Then if > we did not find any idle CPU in the cluster, nr is set to 0. Then in the LLC scan, > since there is 'if (--nr <= 0)', the LLC scan terminates. > Is the policy like: if we can not find one idle CPU in the cluster, we give > up scan for one in the LLC? > [cut] We're still scanning in the LLC since cluster is within the LLC (on Kunpeng 920 cluster cpus share L3 Tag and on Jacobsville cluster cpus share L2, while on both platform LLC should be the L3 cache). So if we've run out of nr we should stop scanning the rest of LLC as well and ... >>>> + >>>> /* >>>> * Scan the LLC domain for idle CPUs; this is dynamically regulated by > If we add logic for cluster scan, might need to update the comment as well. ...the comment here still stands. >>>> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the >>>> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool >>>> time = cpu_clock(this); >>>> } >>>> >>>> + idle_cpu = scan_cluster(p, cpus, target, &nr); >>>> + if ((unsigned int)idle_cpu < nr_cpumask_bits) >>>> + return idle_cpu; >>>> + >>>> for_each_cpu_wrap(cpu, cpus, target + 1) { >>>> if (has_idle_core) { >>>> i = select_idle_core(p, cpu, cpus, &idle_cpu); >>>> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool >>>> return i; >>>> >>>> } else { >>>> - if (!--nr) >>>> + if (--nr <= 0) > May I know the reason to change this? I thought this could be a seperate patch. > scan_cluster() may run out of nr (on Kunpeng 920 there're 4 cpus in one cluster and nr will be at least 4 with SIS_PROP) and we cannot stop scanning correctly without this change. So the change should be in this patch. Thanks, Yicong > thanks, > Chenyu >>>> return -1; >>>> idle_cpu = __select_idle_cpu(cpu, p); >>>> if ((unsigned int)idle_cpu < nr_cpumask_bits) >>>> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >>>> /* >>>> * If the previous CPU is cache affine and idle, don't be stupid: >>>> */ >>>> - if (prev != target && cpus_share_cache(prev, target) && >>>> + if (prev != target && cpus_share_resources(prev, target) && >>>> (available_idle_cpu(prev) || sched_idle_cpu(prev)) && >>>> asym_fits_capacity(task_util, prev)) >>>> return prev; >>>> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >>>> p->recent_used_cpu = prev; >>>> if (recent_used_cpu != prev && >>>> recent_used_cpu != target && >>>> - cpus_share_cache(recent_used_cpu, target) && >>>> + cpus_share_resources(recent_used_cpu, target) && >>>> (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && >>>> cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && >>>> asym_fits_capacity(task_util, recent_used_cpu)) { >>> >>> >>> . >>> > . > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-11 7:40 ` Yicong Yang @ 2022-06-11 9:04 ` Chen Yu 0 siblings, 0 replies; 23+ messages in thread From: Chen Yu @ 2022-06-11 9:04 UTC (permalink / raw) To: Yicong Yang Cc: yangyicong, Tim Chen, peterz, mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel, dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 On Sat, Jun 11, 2022 at 03:40:23PM +0800, Yicong Yang wrote: > On 2022/6/11 11:03, Chen Yu wrote: > > Hi Yicong, > > On Fri, Jun 10, 2022 at 02:39:38PM +0800, Yicong Yang wrote: > >> On 2022/6/10 6:47, Tim Chen wrote: > >>> On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote: > >>>> From: Barry Song <song.bao.hua@hisilicon.com> > >>>> > >>>> For platforms having clusters like Kunpeng920, CPUs within the same cluster > >>>> have lower latency when synchronizing and accessing shared resources like > >>>> cache. Thus, this patch tries to find an idle cpu within the cluster of the > >>>> target CPU before scanning the whole LLC to gain lower latency. > >>>> > >>>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch > >>>> doesn't consider SMT for this moment. > >>>> > >>>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two > >>>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. > >>>> > >>>> With this patch, We noticed enhancement on tbench within one numa or cross > >>>> two numa. > >>>> > >>>> On numa 0: > >>>> 5.19-rc1 patched > >>>> Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%* > >>>> Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%* > >>>> Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%* > >>>> Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%* > >>>> Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%* > >>>> Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%* > >>>> Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%* > >>>> Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%* > >>>> > >>>> On numa 0-1: > >>>> 5.19-rc1 patched > >>>> Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%* > >>>> Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%* > >>>> Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%* > >>>> Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%* > >>>> Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%* > >>>> Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%* > >>>> Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%* > >>>> Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%* > >>>> Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%* > >>> > >>> Yicong, > >>> > >>> Have you tried any workload where tasks don't share data > >>> with each other but have sleep/wakeup? That's the case > >>> where we actually want to spread the tasks out among the clusters > >>> to void contention for L2 cache. > >>> > >>> Will be nice to make sure there's no regression there for > >>> such workload. > >>> > >> > >> Any certain workload you'd like me test? I'm willing to do :) > >> > >> I've tested this patch with MySQL as well (like in v2). This won't hurt > >> the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL > >> posted in [1]. We leverage the nr to suppress redundant scanning in the > >> current approach and seems SIS_UTIL is more efficient in this case. > >> > >> 5.19-rc1 patched patched+SIS_UTIL[1] > >> TPS-16threads 6215.11 6172.74 (-0.68%) 6217.33 (0.04%) > >> QPS-16threads 124302.21 123454.68 (-0.68%) 124346.52 (0.04%) > >> avg-lat-16threads 2.57 2.59 (-0.65%) 2.57 (0.00%) > >> TPS-24threads 8726.40 8690.87 (-0.41%) 8833.08 (1.22%) > >> QPS-24threads 174527.88 173817.42 (-0.41%) 176661.54 (1.21%) > >> avg-lat-24threads 2.75 2.76 (-0.36%) 2.71 (1.33%) > >> TPS-32threads 9555.42 9514.86 (-0.42%) 10010.87 (4.77%) > >> QPS-32threads 191108.37 190297.28 (-0.42%) 200217.35 (4.55%) > >> avg-lat-32threads 3.35 3.36 (-0.30%) 3.20 (4.58%) > >> TPS-64threads 10290.10 10324.75 (0.34%) 10819.77 (5.15%) > >> QPS-64threads 205802.05 206494.95 (0.34%) 216395.40 (4.90%) > >> avg-lat-64threads 6.22 6.20 (0.38%) 5.92 (4.88%) > >> > >> > > Thanks for the testings. I'm refining the patch according to Mel's suggestion and > > will send it out later. Some minor nits below. > >>> Code itself looks good. > >>> > >>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > >>> > >> > >> Thanks. > >> > >> [1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/ > >> > >>>> > >>>> Tested-by: Yicong Yang <yangyicong@hisilicon.com> > >>>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > >>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > >>>> --- > >>>> kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++--- > >>>> 1 file changed, 41 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>>> index 77b2048a9326..6d173e196ad3 100644 > >>>> --- a/kernel/sched/fair.c > >>>> +++ b/kernel/sched/fair.c > >>>> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd > >>>> > >>>> #endif /* CONFIG_SCHED_SMT */ > >>>> > >>>> +#ifdef CONFIG_SCHED_CLUSTER > >>>> +/* > >>>> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning > >>>> + */ > >>>> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, > >>>> + int target, int *nr) > >>>> +{ > >>>> + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); > >>>> + int cpu, idle_cpu; > >>>> + > >>>> + /* TODO: Support SMT system with cluster topology */ > >>>> + if (!sched_smt_active() && sd) { > >>>> + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { > >>>> + if (!--*nr) > >>>> + break; > > Maybe I understand it incorrectly, here we changed the value of nr. Then if > > we did not find any idle CPU in the cluster, nr is set to 0. Then in the LLC scan, > > since there is 'if (--nr <= 0)', the LLC scan terminates. > > Is the policy like: if we can not find one idle CPU in the cluster, we give > > up scan for one in the LLC? > > [cut] > > We're still scanning in the LLC since cluster is within the LLC (on Kunpeng 920 cluster cpus > share L3 Tag and on Jacobsville cluster cpus share L2, while on both platform LLC should be > the L3 cache). So if we've run out of nr we should stop scanning the rest of LLC as well > and ... > I see, the budget has been exhausted and we don't spend more time on LLC. > >>>> + > >>>> /* > >>>> * Scan the LLC domain for idle CPUs; this is dynamically regulated by > > If we add logic for cluster scan, might need to update the comment as well. > > ...the comment here still stands. > > >>>> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the > >>>> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > >>>> time = cpu_clock(this); > >>>> } > >>>> > >>>> + idle_cpu = scan_cluster(p, cpus, target, &nr); > >>>> + if ((unsigned int)idle_cpu < nr_cpumask_bits) > >>>> + return idle_cpu; > >>>> + > >>>> for_each_cpu_wrap(cpu, cpus, target + 1) { > >>>> if (has_idle_core) { > >>>> i = select_idle_core(p, cpu, cpus, &idle_cpu); > >>>> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > >>>> return i; > >>>> > >>>> } else { > >>>> - if (!--nr) > >>>> + if (--nr <= 0) > > May I know the reason to change this? I thought this could be a seperate patch. > > > > scan_cluster() may run out of nr (on Kunpeng 920 there're 4 cpus in one cluster and nr will be at least 4 with > SIS_PROP) and we cannot stop scanning correctly without this change. So the change should be in this patch. > So when nr reaches 0, the original loop could not stop. Thanks for the explanation. thanks, Chenyu > Thanks, > Yicong > > > thanks, > > Chenyu > >>>> return -1; > >>>> idle_cpu = __select_idle_cpu(cpu, p); > >>>> if ((unsigned int)idle_cpu < nr_cpumask_bits) > >>>> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > >>>> /* > >>>> * If the previous CPU is cache affine and idle, don't be stupid: > >>>> */ > >>>> - if (prev != target && cpus_share_cache(prev, target) && > >>>> + if (prev != target && cpus_share_resources(prev, target) && > >>>> (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > >>>> asym_fits_capacity(task_util, prev)) > >>>> return prev; > >>>> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > >>>> p->recent_used_cpu = prev; > >>>> if (recent_used_cpu != prev && > >>>> recent_used_cpu != target && > >>>> - cpus_share_cache(recent_used_cpu, target) && > >>>> + cpus_share_resources(recent_used_cpu, target) && > >>>> (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && > >>>> cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && > >>>> asym_fits_capacity(task_util, recent_used_cpu)) { > >>> > >>> > >>> . > >>> > > . > > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang 2022-06-09 22:47 ` Tim Chen @ 2022-06-26 12:13 ` Abel Wu 2022-06-27 8:16 ` Yicong Yang 1 sibling, 1 reply; 23+ messages in thread From: Abel Wu @ 2022-06-26 12:13 UTC (permalink / raw) To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 On 6/9/22 8:06 PM, Yicong Yang Wrote: > From: Barry Song <song.bao.hua@hisilicon.com> > > For platforms having clusters like Kunpeng920, CPUs within the same cluster > have lower latency when synchronizing and accessing shared resources like > cache. Thus, this patch tries to find an idle cpu within the cluster of the > target CPU before scanning the whole LLC to gain lower latency. > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch > doesn't consider SMT for this moment. > > Testing has been done on Kunpeng920 by pinning tasks to one numa and two > numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. > > With this patch, We noticed enhancement on tbench within one numa or cross > two numa. > > On numa 0: > 5.19-rc1 patched > Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%* > Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%* > Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%* > Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%* > Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%* > Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%* > Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%* > Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%* > > On numa 0-1: > 5.19-rc1 patched > Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%* > Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%* > Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%* > Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%* > Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%* > Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%* > Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%* > Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%* > Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%* > > Tested-by: Yicong Yang <yangyicong@hisilicon.com> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > --- > kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 41 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 77b2048a9326..6d173e196ad3 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd > > #endif /* CONFIG_SCHED_SMT */ > > +#ifdef CONFIG_SCHED_CLUSTER > +/* > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning > + */ > +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, > + int target, int *nr) > +{ > + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); > + int cpu, idle_cpu; > + > + /* TODO: Support SMT system with cluster topology */ > + if (!sched_smt_active() && sd) { > + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { > + if (!--*nr) > + break; return -1; :) > + > + idle_cpu = __select_idle_cpu(cpu, p); > + if ((unsigned int)idle_cpu < nr_cpumask_bits) > + return idle_cpu; > + } > + > + cpumask_andnot(cpus, cpus, sched_domain_span(sd)); > + } > + > + return -1; > +} > +#else > +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, > + int target, int *nr) > +{ > + return -1; > +} > +#endif > + > /* > * Scan the LLC domain for idle CPUs; this is dynamically regulated by > * comparing the average scan cost (tracked in sd->avg_scan_cost) against the > @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > time = cpu_clock(this); > } > > + idle_cpu = scan_cluster(p, cpus, target, &nr); > + if ((unsigned int)idle_cpu < nr_cpumask_bits) > + return idle_cpu; > + > for_each_cpu_wrap(cpu, cpus, target + 1) { > if (has_idle_core) { > i = select_idle_core(p, cpu, cpus, &idle_cpu); > @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > return i; > > } else { > - if (!--nr) > + if (--nr <= 0) > return -1; > idle_cpu = __select_idle_cpu(cpu, p); > if ((unsigned int)idle_cpu < nr_cpumask_bits) > @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > /* > * If the previous CPU is cache affine and idle, don't be stupid: > */ > - if (prev != target && cpus_share_cache(prev, target) && > + if (prev != target && cpus_share_resources(prev, target) && > (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > asym_fits_capacity(task_util, prev)) > return prev; > @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > p->recent_used_cpu = prev; > if (recent_used_cpu != prev && > recent_used_cpu != target && > - cpus_share_cache(recent_used_cpu, target) && > + cpus_share_resources(recent_used_cpu, target) && > (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && > cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && > asym_fits_capacity(task_util, recent_used_cpu)) { _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path 2022-06-26 12:13 ` Abel Wu @ 2022-06-27 8:16 ` Yicong Yang 0 siblings, 0 replies; 23+ messages in thread From: Yicong Yang @ 2022-06-27 8:16 UTC (permalink / raw) To: Abel Wu, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39 On 2022/6/26 20:13, Abel Wu wrote: > > On 6/9/22 8:06 PM, Yicong Yang Wrote: >> From: Barry Song <song.bao.hua@hisilicon.com> >> >> For platforms having clusters like Kunpeng920, CPUs within the same cluster >> have lower latency when synchronizing and accessing shared resources like >> cache. Thus, this patch tries to find an idle cpu within the cluster of the >> target CPU before scanning the whole LLC to gain lower latency. >> >> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch >> doesn't consider SMT for this moment. >> >> Testing has been done on Kunpeng920 by pinning tasks to one numa and two >> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. >> >> With this patch, We noticed enhancement on tbench within one numa or cross >> two numa. >> >> On numa 0: >> 5.19-rc1 patched >> Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%* >> Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%* >> Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%* >> Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%* >> Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%* >> Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%* >> Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%* >> Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%* >> >> On numa 0-1: >> 5.19-rc1 patched >> Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%* >> Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%* >> Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%* >> Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%* >> Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%* >> Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%* >> Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%* >> Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%* >> Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%* >> >> Tested-by: Yicong Yang <yangyicong@hisilicon.com> >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >> --- >> kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 41 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 77b2048a9326..6d173e196ad3 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd >> #endif /* CONFIG_SCHED_SMT */ >> +#ifdef CONFIG_SCHED_CLUSTER >> +/* >> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning >> + */ >> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus, >> + int target, int *nr) >> +{ >> + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); >> + int cpu, idle_cpu; >> + >> + /* TODO: Support SMT system with cluster topology */ >> + if (!sched_smt_active() && sd) { >> + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { >> + if (!--*nr) >> + break; > > return -1; > :) > thanks for the comment. If we've run out of nr the select_idle_cpu() will stop scanning as well, so it's unnecessary to clear the mask of cluster cpus and just return. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2022-06-29 3:24 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-09 12:06 [PATCH v4 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
2022-06-09 22:28 ` Tim Chen
2022-06-10 6:54 ` Yicong Yang
2022-06-15 14:19 ` K Prateek Nayak
2022-06-15 15:43 ` Gautham R. Shenoy
2022-06-16 8:10 ` Yicong Yang
2022-06-16 7:55 ` Yicong Yang
[not found] ` <6bf4f032-7d07-d4a4-4f5a-28f3871131c0@amd.com>
2022-06-17 16:50 ` Tim Chen
2022-06-20 11:20 ` K Prateek Nayak
2022-06-20 13:37 ` Abel Wu
2022-06-28 11:55 ` K Prateek Nayak
2022-06-29 3:22 ` Yicong Yang
2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-06-09 22:47 ` Tim Chen
2022-06-10 4:01 ` Barry Song
2022-06-10 6:39 ` Yicong Yang
2022-06-10 21:19 ` Tim Chen
2022-06-11 3:03 ` Chen Yu
2022-06-11 7:40 ` Yicong Yang
2022-06-11 9:04 ` Chen Yu
2022-06-26 12:13 ` Abel Wu
2022-06-27 8:16 ` Yicong Yang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).