[PATCH v4 0/2] sched/fair: Wake task within the cluster when possible

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/2] sched/fair: Wake task within the cluster when possible
@ 2022-06-09 12:06 Yicong Yang
  2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
  2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  0 siblings, 2 replies; 23+ messages in thread
From: Yicong Yang @ 2022-06-09 12:06 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39

This is the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.

[1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 66558b730f25 ("sched: Add cluster scheduler level for x86")

Change since v3:
- fix compile error when !CONFIG_SCHED_CLUSTER, reported by lkp test.
Link: https://lore.kernel.org/lkml/20220608095758.60504-1-yangyicong@hisilicon.com/

Change since v2:
- leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
- remove the ping-pong suppression
- address the comment from Tim, thanks.
Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/

Change since v1:
- regain the performance data based on v5.17-rc1
- rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/

Barry Song (2):
  sched: Add per_cpu cluster domain info and cpus_share_resources API
  sched/fair: Scan cluster before scanning LLC in wake-up path

 include/linux/sched/sd_flags.h |  7 ++++++
 include/linux/sched/topology.h |  8 ++++++-
 kernel/sched/core.c            | 12 ++++++++++
 kernel/sched/fair.c            | 44 +++++++++++++++++++++++++++++++---
 kernel/sched/sched.h           |  2 ++
 kernel/sched/topology.c        | 15 ++++++++++++
 6 files changed, 84 insertions(+), 4 deletions(-)

-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-09 12:06 [PATCH v4 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
@ 2022-06-09 12:06 ` Yicong Yang
  2022-06-09 22:28   ` Tim Chen
  2022-06-15 14:19   ` K Prateek Nayak
  2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  1 sibling, 2 replies; 23+ messages in thread
From: Yicong Yang @ 2022-06-09 12:06 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39

From: Barry Song <song.bao.hua@hisilicon.com>

Add per-cpu cluster domain info and cpus_share_resources() API.
This is the preparation for the optimization of select_idle_cpu()
on platforms with cluster scheduler level.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
---
 include/linux/sched/sd_flags.h |  7 +++++++
 include/linux/sched/topology.h |  8 +++++++-
 kernel/sched/core.c            | 12 ++++++++++++
 kernel/sched/sched.h           |  2 ++
 kernel/sched/topology.c        | 15 +++++++++++++++
 5 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66d95f7..42ed454e8b18 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -109,6 +109,13 @@ SD_FLAG(SD_ASYM_CPUCAPACITY_FULL, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
  */
 SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
 
+/*
+ * Domain members share CPU cluster (LLC tags or L2 cache)
+ *
+ * NEEDS_GROUPS: Clusters are shared between groups.
+ */
+SD_FLAG(SD_CLUSTER, SDF_NEEDS_GROUPS)
+
 /*
  * Domain members share CPU package resources (i.e. caches)
  *
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..df489a1db6b7 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -45,7 +45,7 @@ static inline int cpu_smt_flags(void)
 #ifdef CONFIG_SCHED_CLUSTER
 static inline int cpu_cluster_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES;
+	return SD_CLUSTER | SD_SHARE_PKG_RESOURCES;
 }
 #endif
 
@@ -178,6 +178,7 @@ cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_resources(int this_cpu, int that_cpu);
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 typedef int (*sched_domain_flags_f)(void);
@@ -231,6 +232,11 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+static inline bool cpus_share_resources(int this_cpu, int that_cpu)
+{
+	return true;
+}
+
 #endif	/* !CONFIG_SMP */
 
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfa7452ca92e..79a6f012b0cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3808,6 +3808,18 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
 
+/*
+ * Whether CPUs are share cache resources, which means LLC on non-cluster
+ * machines and LLC tag or L2 on machines with clusters.
+ */
+bool cpus_share_resources(int this_cpu, int that_cpu)
+{
+	if (this_cpu == that_cpu)
+		return true;
+
+	return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
+}
+
 static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 {
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 01259611beb9..b9bcfcf8d14d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(int, sd_share_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05b6c2ad90b9..0595827d481d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_share_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -689,6 +691,18 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
+	sd = lowest_flag_domain(cpu, SD_CLUSTER);
+	if (sd)
+		id = cpumask_first(sched_domain_span(sd));
+	rcu_assign_pointer(per_cpu(sd_cluster, cpu), sd);
+
+	/*
+	 * This assignment should be placed after the sd_llc_id as
+	 * we want this id equals to cluster id on cluster machines
+	 * but equals to LLC id on non-Cluster machines.
+	 */
+	per_cpu(sd_share_id, cpu) = id;
+
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 
@@ -1532,6 +1546,7 @@ static struct cpumask		***sched_domains_numa_masks;
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY	|	\
+	 SD_CLUSTER		|	\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA		|	\
 	 SD_ASYM_PACKING)
-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
@ 2022-06-09 22:28   ` Tim Chen
  2022-06-10  6:54     ` Yicong Yang
  2022-06-15 14:19   ` K Prateek Nayak
  1 sibling, 1 reply; 23+ messages in thread
From: Tim Chen @ 2022-06-09 22:28 UTC (permalink / raw)
  To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39

On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote:
> 
>  
> +/*
> + * Whether CPUs are share cache resources, which means LLC on non-cluster
> + * machines and LLC tag or L2 on machines with clusters.
> + */
> +bool cpus_share_resources(int this_cpu, int that_cpu)

Suggest cpus_share_lowest_cache to be a bit more informative

> +{
> +	if (this_cpu == that_cpu)
> +		return true;
> +
> +	return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
> +}
> +
>  static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>  {
>  	/*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 01259611beb9..b9bcfcf8d14d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_size);
>  DECLARE_PER_CPU(int, sd_llc_id);
> +DECLARE_PER_CPU(int, sd_share_id);
>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 05b6c2ad90b9..0595827d481d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_share_id);

Some minor nits about the name of "sd_share_id".  
It is not quite obvious what it is.  

Maybe something like sd_lowest_cache_id to denote
it is the id of lowest shared cache domain between CPU. 

Otherwise the patch looks good to me.  You can add

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

Tim


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-09 22:28   ` Tim Chen
@ 2022-06-10  6:54     ` Yicong Yang
  0 siblings, 0 replies; 23+ messages in thread
From: Yicong Yang @ 2022-06-10  6:54 UTC (permalink / raw)
  To: Tim Chen, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39

On 2022/6/10 6:28, Tim Chen wrote:
> On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote:
>>
>>  
>> +/*
>> + * Whether CPUs are share cache resources, which means LLC on non-cluster
>> + * machines and LLC tag or L2 on machines with clusters.
>> + */
>> +bool cpus_share_resources(int this_cpu, int that_cpu)
> 
> Suggest cpus_share_lowest_cache to be a bit more informative
> 
>> +{
>> +	if (this_cpu == that_cpu)
>> +		return true;
>> +
>> +	return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
>> +}
>> +
>>  static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>>  {
>>  	/*
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 01259611beb9..b9bcfcf8d14d 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>  DECLARE_PER_CPU(int, sd_llc_size);
>>  DECLARE_PER_CPU(int, sd_llc_id);
>> +DECLARE_PER_CPU(int, sd_share_id);
>>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 05b6c2ad90b9..0595827d481d 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>  DEFINE_PER_CPU(int, sd_llc_size);
>>  DEFINE_PER_CPU(int, sd_llc_id);
>> +DEFINE_PER_CPU(int, sd_share_id);
> 
> Some minor nits about the name of "sd_share_id".  
> It is not quite obvious what it is.  
> 
> Maybe something like sd_lowest_cache_id to denote
> it is the id of lowest shared cache domain between CPU. 
> 

Thanks for the suggestion! Since Vincent and Gautham have suggested this in v1 [1], I'd like
to wait a bit before changing to this to see if they have any comment.

> Otherwise the patch looks good to me.  You can add
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> 

Thanks.

[1] https://lore.kernel.org/lkml/CAKfTPtBKLDyNPXg7uLbQ3jUnEwppfC+E29=oJ1tWzzqHsNpApw@mail.gmail.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
  2022-06-09 22:28   ` Tim Chen
@ 2022-06-15 14:19   ` K Prateek Nayak
  2022-06-15 15:43     ` Gautham R. Shenoy
  2022-06-16  7:55     ` Yicong Yang
  1 sibling, 2 replies; 23+ messages in thread
From: K Prateek Nayak @ 2022-06-15 14:19 UTC (permalink / raw)
  To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39

Hello Yicong,

I replied to the v3 of the series by mistake!
(https://lore.kernel.org/lkml/0b065646-5b05-cbdb-b20c-e1dfef3f4d79@amd.com/)
But rest assured all the analysis discussed there was done with
the v4 patch series. I'll add the same observations below so we
can continue discussion on v4.

We are observing some serious regression with tbench with this patch
series applied. The issue doesn't seem to be related to the actual 
functionality of the patch but how the patch changes the per CPU
variable layout.

Discussed below are the results from running tbench on a dual
socket Zen3 (2 x 64C/128T) configured in different NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip:      5.19.0-rc2 tip sched/core
- cluster:  5.19.0-rc2 tip sched/core + both the patches of the series

When we started testing, the tip was at:
commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"

* - Data points of concern

~~~~~~
tbench
~~~~~~

NPS1

Clients:       tip                     cluster
    1    444.41 (0.00 pct)       439.27 (-1.15 pct)
    2    879.23 (0.00 pct)       831.49 (-5.42 pct)     *
    4    1648.83 (0.00 pct)      1608.07 (-2.47 pct)
    8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)    *
   16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)   *
   32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)   *
   64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)  *
  128    30795.27 (0.00 pct)     30861.34 (0.21 pct)
  256    25138.43 (0.00 pct)     24711.90 (-1.69 pct)
  512    51287.93 (0.00 pct)     51855.55 (1.10 pct)
 1024    53176.97 (0.00 pct)     52554.55 (-1.17 pct)

NPS2

Clients:       tip                     cluster
    1    445.45 (0.00 pct)       441.75 (-0.83 pct)
    2    869.24 (0.00 pct)       845.61 (-2.71 pct)
    4    1644.28 (0.00 pct)      1586.49 (-3.51 pct)
    8    3120.83 (0.00 pct)      2967.01 (-4.92 pct) 	*
   16    5972.29 (0.00 pct)      5208.58 (-12.78 pct)   *
   32    11776.38 (0.00 pct)     10229.53 (-13.13 pct)  *
   64    20933.15 (0.00 pct)     17033.45 (-18.62 pct)  *
  128    32195.00 (0.00 pct)     29507.85 (-8.34 pct)   *
  256    24641.52 (0.00 pct)     27225.00 (10.48 pct)
  512    50806.96 (0.00 pct)     51377.50 (1.12 pct)
 1024    51993.96 (0.00 pct)     50773.35 (-2.34 pct)

NPS4

Clients:      tip                   cluster
    1    442.10 (0.00 pct)       435.06 (-1.59 pct)
    2    870.94 (0.00 pct)       858.64 (-1.41 pct)
    4    1615.30 (0.00 pct)      1607.27 (-0.49 pct)
    8    3195.95 (0.00 pct)      3020.63 (-5.48 pct)    *
   16    5937.41 (0.00 pct)      5719.87 (-3.66 pct)
   32    11800.41 (0.00 pct)     11229.65 (-4.83 pct)	*
   64    20844.71 (0.00 pct)     20432.79 (-1.97 pct)
  128    31003.62 (0.00 pct)     29441.20 (-5.03 pct)   *
  256    27476.37 (0.00 pct)     25857.30 (-5.89 pct)   * [Know to have run to run variance]
  512    52276.72 (0.00 pct)     51659.16 (-1.18 pct)
 1024    51372.10 (0.00 pct)     51026.87 (-0.67 pct)

Note: tbench results for 256 workers are known to have
run to run variation on the test machine. Any regression
seen for the data point can be safely ignored.

The behavior is consistent for both tip and patched kernel
across multiple runs of tbench.

~~~~~~~~~~~~~~~~~~~~
Analysis done so far
~~~~~~~~~~~~~~~~~~~~

To root cause this issue quicker, we have focused on 8 to 64 clients
data points with the machine running in NPS1 mode.

- Even on disabling HW prefetcher, the behavior remains consistent
  signifying HW prefetcher is not the cause of the problem.

- Bisecting:

When we ran the tests with only Patch 1 of the series, the
regression was visible and the numbers were worse.

Clients:       tip                     cluster              Patch 1 Only
    8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3018.63 (-7.51 pct)
   16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    4869.26 (-18.99 pct)
   32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    8159.60 (-32.33 pct)
   64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   13161.92 (-38.08 pct)

We further bisected the hunks to narrow down the cause to the per CPU
variable declarations. 


On 6/9/2022 5:36 PM, Yicong Yang wrote:
> 
> [..snip..]
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 01259611beb9..b9bcfcf8d14d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_size);
>  DECLARE_PER_CPU(int, sd_llc_id);
> +DECLARE_PER_CPU(int, sd_share_id);
>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);

The main reason for the regression seems to be the above declarations.
The regression seem to go away if we do one of the following:

- Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY()
  instead of DECLARE_PER_CPU() and change the corresponding definition
  below to DEFINE_PER_CPU_READ_MOSTLY().

  Clients:       tip                     Patch 1           Patch 1 (READ_MOSTLY)
    8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3237.33 (-0.56 pct)
   16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5914.53 (-2.92 pct)
   32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11536.05 (3.40 pct)
   64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   21162.33 (0.67 pct)

- Convert sd_share_id and sd_cluster to static arrays.
  
  Clients:       tip                    Patch 1            Patch 1 (Static Array)
    8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3203.27 (-1.61 pct)
   16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    6198.35 (1.73 pct)
   32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11385.76 (2.05 pct)
   64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   21919.80 (4.28 pct)

- Move the declarations of sd_share_id and sd_cluster to the top

  Clients:       tip                    Patch 1            Patch 1 (Declarion on Top)
    8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3072.30 (-5.63 pct)
   16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5586.59 (-8.30 pct)
   32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11184.17 (0.24 pct)
   64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   20289.70 (-3.47 pct)

Unfortunately, none of these are complete solutions. For example, using
DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression
but doesn't eliminate it entirely:

   Clients:     tip                     cluster           cluster (READ_MOSTLY)  
    1      444.41 (0.00 pct)       439.27 (-1.15 pct)      435.95 (-1.90 pct)
    2      879.23 (0.00 pct)       831.49 (-5.42 pct)      842.09 (-4.22 pct)
    4      1648.83 (0.00 pct)      1608.07 (-2.47 pct)     1598.77 (-3.03 pct)
    8      3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3090.86 (-5.29 pct)	*
   16      6011.19 (0.00 pct)      5360.28 (-10.82 pct)    5360.28 (-10.82 pct)	*
   32      12058.31 (0.00 pct)     8769.08 (-27.27 pct)    11083.66 (-8.08 pct)	*
   64      21258.21 (0.00 pct)     19021.09 (-10.52 pct)   20984.30 (-1.28 pct)
  128      30795.27 (0.00 pct)     30861.34 (0.21 pct)     30735.20 (-0.19 pct)
  256      25138.43 (0.00 pct)     24711.90 (-1.69 pct)    24021.21 (-4.44 pct)
  512      51287.93 (0.00 pct)     51855.55 (1.10 pct)     51672.73 (0.75 pct)
 1024      53176.97 (0.00 pct)     52554.55 (-1.17 pct)    52620.02 (-1.04 pct)

We are still trying to root cause the underlying issue that
brought about such drastic regression in tbench performance. 

>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 05b6c2ad90b9..0595827d481d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_share_id);
> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>
>  [..snip..]
>

We would like some time to investigate this issue and root cause
the reason for this regression.
--
Thanks and Regards,
Prateek

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-15 14:19   ` K Prateek Nayak
@ 2022-06-15 15:43     ` Gautham R. Shenoy
  2022-06-16  8:10       ` Yicong Yang
  2022-06-16  7:55     ` Yicong Yang
  1 sibling, 1 reply; 23+ messages in thread
From: Gautham R. Shenoy @ 2022-06-15 15:43 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, linux-kernel, linux-arm-kernel, dietmar.eggemann,
	rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego,
	srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary,
	john.garry, shenyang39

On Wed, Jun 15, 2022 at 07:49:22PM +0530, K Prateek Nayak wrote:

[..snip..]

> 
> - Bisecting:
> 
> When we ran the tests with only Patch 1 of the series, the
> regression was visible and the numbers were worse.
> 
> Clients:       tip                     cluster              Patch 1 Only
>     8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3018.63 (-7.51 pct)
>    16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    4869.26 (-18.99 pct)
>    32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    8159.60 (-32.33 pct)
>    64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   13161.92 (-38.08 pct)
> 
> We further bisected the hunks to narrow down the cause to the per CPU
> variable declarations. 
> 
> 
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 01259611beb9..b9bcfcf8d14d 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> >  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DECLARE_PER_CPU(int, sd_llc_size);
> >  DECLARE_PER_CPU(int, sd_llc_id);
> > +DECLARE_PER_CPU(int, sd_share_id);
> >  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> 
> The main reason for the regression seems to be the above declarations.

I think you meant that the regressions are due to the DEFINE_PER_CPU()
instances from the following hunk:

> > @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DEFINE_PER_CPU(int, sd_llc_size);
> >  DEFINE_PER_CPU(int, sd_llc_id);
> > +DEFINE_PER_CPU(int, sd_share_id);
> > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> >


The System.map diff for these variables between tip vs tip +
cluster-sched-v4 on your test system looks as follows:

 0000000000020520 D sd_asym_packing
 0000000000020528 D sd_numa
-0000000000020530 D sd_llc_shared
-0000000000020538 D sd_llc_id
-000000000002053c D sd_llc_size
-0000000000020540 D sd_llc
+0000000000020530 D sd_cluster
+0000000000020538 D sd_llc_shared
+0000000000020540 D sd_share_id
+0000000000020544 D sd_llc_id
+0000000000020548 D sd_llc_size
+0000000000020550 D sd_llc

The allocations are in the reverse-order of the definitions.

That perhaps explains why you no longer see the regression when you
define the sd_share_id and sd_cluster per-cpu definitions at the
beginning as indicated by the following

> - Move the declarations of sd_share_id and sd_cluster to the top
> 
>   Clients:       tip                    Patch 1            Patch 1 (Declarion on Top)
>     8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3072.30 (-5.63 pct)
>    16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5586.59 (-8.30 pct)
>    32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11184.17 (0.24 pct)
>    64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   20289.70 (-3.47 pct)


--
Thanks and Regards
gautham.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-15 15:43     ` Gautham R. Shenoy
@ 2022-06-16  8:10       ` Yicong Yang
  0 siblings, 0 replies; 23+ messages in thread
From: Yicong Yang @ 2022-06-16  8:10 UTC (permalink / raw)
  To: Gautham R. Shenoy, K Prateek Nayak
  Cc: yangyicong, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, linux-kernel, linux-arm-kernel, dietmar.eggemann,
	rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego,
	srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary,
	john.garry, shenyang39

On 2022/6/15 23:43, Gautham R. Shenoy wrote:
> On Wed, Jun 15, 2022 at 07:49:22PM +0530, K Prateek Nayak wrote:
> 
> [..snip..]
> 
>>
>> - Bisecting:
>>
>> When we ran the tests with only Patch 1 of the series, the
>> regression was visible and the numbers were worse.
>>
>> Clients:       tip                     cluster              Patch 1 Only
>>     8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3018.63 (-7.51 pct)
>>    16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    4869.26 (-18.99 pct)
>>    32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    8159.60 (-32.33 pct)
>>    64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   13161.92 (-38.08 pct)
>>
>> We further bisected the hunks to narrow down the cause to the per CPU
>> variable declarations. 
>>
>>
>>>
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 01259611beb9..b9bcfcf8d14d 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>  DECLARE_PER_CPU(int, sd_llc_size);
>>>  DECLARE_PER_CPU(int, sd_llc_id);
>>> +DECLARE_PER_CPU(int, sd_share_id);
>>>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>
>> The main reason for the regression seems to be the above declarations.
> 
> I think you meant that the regressions are due to the DEFINE_PER_CPU()
> instances from the following hunk:
> 
>>> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>  DEFINE_PER_CPU(int, sd_llc_size);
>>>  DEFINE_PER_CPU(int, sd_llc_id);
>>> +DEFINE_PER_CPU(int, sd_share_id);
>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>
> 
> 
> The System.map diff for these variables between tip vs tip +
> cluster-sched-v4 on your test system looks as follows:
> 
>  0000000000020520 D sd_asym_packing
>  0000000000020528 D sd_numa
> -0000000000020530 D sd_llc_shared
> -0000000000020538 D sd_llc_id
> -000000000002053c D sd_llc_size
> -0000000000020540 D sd_llc
> +0000000000020530 D sd_cluster
> +0000000000020538 D sd_llc_shared

looks like below are in another cacheline (for 64B cacheline)?
while previous sd_llc_id and sd_llc_shared are in the same.

> +0000000000020540 D sd_share_id
> +0000000000020544 D sd_llc_id
> +0000000000020548 D sd_llc_size
> +0000000000020550 D sd_llc
> 
> The allocations are in the reverse-order of the definitions.
> 
> That perhaps explains why you no longer see the regression when you
> define the sd_share_id and sd_cluster per-cpu definitions at the
> beginning as indicated by the following
> 
>> - Move the declarations of sd_share_id and sd_cluster to the top
>>
>>   Clients:       tip                    Patch 1            Patch 1 (Declarion on Top)
>>     8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3072.30 (-5.63 pct)
>>    16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5586.59 (-8.30 pct)
>>    32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11184.17 (0.24 pct)
>>    64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   20289.70 (-3.47 pct)
> 
> 
> --
> Thanks and Regards
> gautham.
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-15 14:19   ` K Prateek Nayak
  2022-06-15 15:43     ` Gautham R. Shenoy
@ 2022-06-16  7:55     ` Yicong Yang
       [not found]       ` <6bf4f032-7d07-d4a4-4f5a-28f3871131c0@amd.com>
  1 sibling, 1 reply; 23+ messages in thread
From: Yicong Yang @ 2022-06-16  7:55 UTC (permalink / raw)
  To: K Prateek Nayak, Yicong Yang, peterz, mingo, juri.lelli,
	vincent.guittot, tim.c.chen, gautham.shenoy, linux-kernel,
	linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39

Hi Prateek,

Seems my previous reply is in the wrong format so the server rejected it..just repost..

Thanks a lot for the test!

On 2022/6/15 22:19, K Prateek Nayak wrote:
> Hello Yicong,
> 
> I replied to the v3 of the series by mistake!
> (https://lore.kernel.org/lkml/0b065646-5b05-cbdb-b20c-e1dfef3f4d79@amd.com/)
> But rest assured all the analysis discussed there was done with
> the v4 patch series. I'll add the same observations below so we
> can continue discussion on v4.
> 
> We are observing some serious regression with tbench with this patch
> series applied. The issue doesn't seem to be related to the actual 
> functionality of the patch but how the patch changes the per CPU
> variable layout.
> 
> Discussed below are the results from running tbench on a dual
> socket Zen3 (2 x 64C/128T) configured in different NPS modes.
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - tip:      5.19.0-rc2 tip sched/core
> - cluster:  5.19.0-rc2 tip sched/core + both the patches of the series
> 
> When we started testing, the tip was at:
> commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"
> 
> * - Data points of concern
> 
> ~~~~~~
> tbench
> ~~~~~~
> 
> NPS1
> 
> Clients:       tip                     cluster
>     1    444.41 (0.00 pct)       439.27 (-1.15 pct)
>     2    879.23 (0.00 pct)       831.49 (-5.42 pct)     *
>     4    1648.83 (0.00 pct)      1608.07 (-2.47 pct)
>     8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)    *
>    16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)   *
>    32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)   *
>    64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)  *
>   128    30795.27 (0.00 pct)     30861.34 (0.21 pct)
>   256    25138.43 (0.00 pct)     24711.90 (-1.69 pct)
>   512    51287.93 (0.00 pct)     51855.55 (1.10 pct)
>  1024    53176.97 (0.00 pct)     52554.55 (-1.17 pct)
> 
> NPS2
> 
> Clients:       tip                     cluster
>     1    445.45 (0.00 pct)       441.75 (-0.83 pct)
>     2    869.24 (0.00 pct)       845.61 (-2.71 pct)
>     4    1644.28 (0.00 pct)      1586.49 (-3.51 pct)
>     8    3120.83 (0.00 pct)      2967.01 (-4.92 pct) 	*
>    16    5972.29 (0.00 pct)      5208.58 (-12.78 pct)   *
>    32    11776.38 (0.00 pct)     10229.53 (-13.13 pct)  *
>    64    20933.15 (0.00 pct)     17033.45 (-18.62 pct)  *
>   128    32195.00 (0.00 pct)     29507.85 (-8.34 pct)   *
>   256    24641.52 (0.00 pct)     27225.00 (10.48 pct)
>   512    50806.96 (0.00 pct)     51377.50 (1.12 pct)
>  1024    51993.96 (0.00 pct)     50773.35 (-2.34 pct)
> 
> NPS4
> 
> Clients:      tip                   cluster
>     1    442.10 (0.00 pct)       435.06 (-1.59 pct)
>     2    870.94 (0.00 pct)       858.64 (-1.41 pct)
>     4    1615.30 (0.00 pct)      1607.27 (-0.49 pct)
>     8    3195.95 (0.00 pct)      3020.63 (-5.48 pct)    *
>    16    5937.41 (0.00 pct)      5719.87 (-3.66 pct)
>    32    11800.41 (0.00 pct)     11229.65 (-4.83 pct)	*
>    64    20844.71 (0.00 pct)     20432.79 (-1.97 pct)
>   128    31003.62 (0.00 pct)     29441.20 (-5.03 pct)   *
>   256    27476.37 (0.00 pct)     25857.30 (-5.89 pct)   * [Know to have run to run variance]
>   512    52276.72 (0.00 pct)     51659.16 (-1.18 pct)
>  1024    51372.10 (0.00 pct)     51026.87 (-0.67 pct)
> 
> Note: tbench results for 256 workers are known to have
> run to run variation on the test machine. Any regression
> seen for the data point can be safely ignored.
> 
> The behavior is consistent for both tip and patched kernel
> across multiple runs of tbench.
> 
> ~~~~~~~~~~~~~~~~~~~~
> Analysis done so far
> ~~~~~~~~~~~~~~~~~~~~
> 
> To root cause this issue quicker, we have focused on 8 to 64 clients
> data points with the machine running in NPS1 mode.
> 
> - Even on disabling HW prefetcher, the behavior remains consistent
>   signifying HW prefetcher is not the cause of the problem.
> 
> - Bisecting:
> 
> When we ran the tests with only Patch 1 of the series, the
> regression was visible and the numbers were worse.
> 
> Clients:       tip                     cluster              Patch 1 Only
>     8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3018.63 (-7.51 pct)
>    16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    4869.26 (-18.99 pct)
>    32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    8159.60 (-32.33 pct)
>    64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   13161.92 (-38.08 pct)
> 
> We further bisected the hunks to narrow down the cause to the per CPU
> variable declarations. 
> 
> 
> On 6/9/2022 5:36 PM, Yicong Yang wrote:
>>
>> [..snip..]
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 01259611beb9..b9bcfcf8d14d 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>  DECLARE_PER_CPU(int, sd_llc_size);
>>  DECLARE_PER_CPU(int, sd_llc_id);
>> +DECLARE_PER_CPU(int, sd_share_id);
>>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> 
> The main reason for the regression seems to be the above declarations.
> The regression seem to go away if we do one of the following:
> 
> - Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY()
>   instead of DECLARE_PER_CPU() and change the corresponding definition
>   below to DEFINE_PER_CPU_READ_MOSTLY().
> 
>   Clients:       tip                     Patch 1           Patch 1 (READ_MOSTLY)
>     8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3237.33 (-0.56 pct)
>    16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5914.53 (-2.92 pct)
>    32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11536.05 (3.40 pct)
>    64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   21162.33 (0.67 pct)
> 
> - Convert sd_share_id and sd_cluster to static arrays.
>   
>   Clients:       tip                    Patch 1            Patch 1 (Static Array)
>     8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3203.27 (-1.61 pct)
>    16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    6198.35 (1.73 pct)
>    32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11385.76 (2.05 pct)
>    64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   21919.80 (4.28 pct)
> 
> - Move the declarations of sd_share_id and sd_cluster to the top
> 
>   Clients:       tip                    Patch 1            Patch 1 (Declarion on Top)
>     8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3072.30 (-5.63 pct)
>    16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5586.59 (-8.30 pct)
>    32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11184.17 (0.24 pct)
>    64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   20289.70 (-3.47 pct)
> 
> Unfortunately, none of these are complete solutions. For example, using
> DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression
> but doesn't eliminate it entirely:
> 
>    Clients:     tip                     cluster           cluster (READ_MOSTLY)  
>     1      444.41 (0.00 pct)       439.27 (-1.15 pct)      435.95 (-1.90 pct)
>     2      879.23 (0.00 pct)       831.49 (-5.42 pct)      842.09 (-4.22 pct)
>     4      1648.83 (0.00 pct)      1608.07 (-2.47 pct)     1598.77 (-3.03 pct)
>     8      3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3090.86 (-5.29 pct)	*
>    16      6011.19 (0.00 pct)      5360.28 (-10.82 pct)    5360.28 (-10.82 pct)	*
>    32      12058.31 (0.00 pct)     8769.08 (-27.27 pct)    11083.66 (-8.08 pct)	*
>    64      21258.21 (0.00 pct)     19021.09 (-10.52 pct)   20984.30 (-1.28 pct)
>   128      30795.27 (0.00 pct)     30861.34 (0.21 pct)     30735.20 (-0.19 pct)
>   256      25138.43 (0.00 pct)     24711.90 (-1.69 pct)    24021.21 (-4.44 pct)
>   512      51287.93 (0.00 pct)     51855.55 (1.10 pct)     51672.73 (0.75 pct)
>  1024      53176.97 (0.00 pct)     52554.55 (-1.17 pct)    52620.02 (-1.04 pct)
> 
> We are still trying to root cause the underlying issue that
> brought about such drastic regression in tbench performance. 
> 
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 05b6c2ad90b9..0595827d481d 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>  DEFINE_PER_CPU(int, sd_llc_size);
>>  DEFINE_PER_CPU(int, sd_llc_id);
>> +DEFINE_PER_CPU(int, sd_share_id);
>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>
>>  [..snip..]
>>
> 
> We would like some time to investigate this issue and root cause
> the reason for this regression.

One significant difference I can see for now is that Kunpeng 920 is a non-SMT machine while Zen3
is a SMT machine. So in the select_idle_sibling() path we won't use sd_llc_shared. Since sd_share_id
and sd_cluster are inserted between sd_llc_id and sd_llc_shared which are both used in the path on
your test, I guess the change of the layout may affect something like the cache behaviour.

Can you also test with SMT disabled? Or change the layout a bit to see if there's any difference,
like:

 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(int, sd_share_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);

I need to further look into it and have some tests on a SMT machine. Would you mind to share
the kernel config as well? I'd like to compare the config as well.

Thanks,
Yicong

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <6bf4f032-7d07-d4a4-4f5a-28f3871131c0@amd.com>]

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
       [not found]       ` <6bf4f032-7d07-d4a4-4f5a-28f3871131c0@amd.com>
@ 2022-06-17 16:50         ` Tim Chen
  2022-06-20 11:20           ` K Prateek Nayak
  0 siblings, 1 reply; 23+ messages in thread
From: Tim Chen @ 2022-06-17 16:50 UTC (permalink / raw)
  To: K Prateek Nayak, Yicong Yang, Yicong Yang, peterz, mingo,
	juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel,
	linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, feng.tang

On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
> 
> 
> --
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e9f3dc6dcbf4..97a3895416ab 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>  	return sd;
>  }
>  
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> +DECLARE_PER_CPU(int, sd_share_id);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_size);
>  DECLARE_PER_CPU(int, sd_llc_id);
> -DECLARE_PER_CPU(int, sd_share_id);
>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> --
> 
> The System-map of each kernel is as follows:
> 
> - On "tip"
> 
> 0000000000020518 D sd_asym_cpucapacity
> 0000000000020520 D sd_asym_packing
> 0000000000020528 D sd_numa
> 0000000000020530 D sd_llc_shared
> 0000000000020538 D sd_llc_id
> 000000000002053c D sd_llc_size
> -------------------------------------------- 64B Cacheline Boundary
> 0000000000020540 D sd_llc
> 
> - On "tip + Patch 1 only" and "tip + both patches"
> 
> 0000000000020518 D sd_asym_cpucapacity
> 0000000000020520 D sd_asym_packing
> 0000000000020528 D sd_numa
> 0000000000020530 D sd_cluster     <-----
> 0000000000020538 D sd_llc_shared
> -------------------------------------------- 64B Cacheline Boundary
> 0000000000020540 D sd_share_id    <-----
> 0000000000020544 D sd_llc_id
> 0000000000020548 D sd_llc_size
> 0000000000020550 D sd_llc
> 
> 
> - On "tip + both patches (Move declaration to top)"
> 
> 0000000000020518 D sd_asym_cpucapacity
> 0000000000020520 D sd_asym_packing
> 0000000000020528 D sd_numa
> 0000000000020530 D sd_llc_shared
> 0000000000020538 D sd_llc_id
> 000000000002053c D sd_llc_size
> -------------------------------------------- 64B Cacheline Boundary
> 0000000000020540 D sd_llc

Wonder if it will help to try keep sd_llc and sd_llc_size into the same
cache line.  They are both used in the wake up path. 


> 0000000000020548 D sd_share_id    <-----
> 0000000000020550 D sd_cluster     <-----
> 
> > Or change the layout a bit to see if there's any difference,
> > like:
> > 
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DEFINE_PER_CPU(int, sd_llc_size);
> >  DEFINE_PER_CPU(int, sd_llc_id);
> >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DEFINE_PER_CPU(int, sd_share_id);
> > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > 
> > I need to further look into it and have some tests on a SMT machine. Would you mind to share
> > the kernel config as well? I'd like to compare the config as well.
> 
> I've attached the kernel config used to build the test kernel
> to this mail.
> 
> > Thanks,
> > Yicong
> 
> We are trying to debug the issue using perf and find an optimal
> arrangement of the per cpu declarations to get the relevant data
> used in the wakeup path on the same 64B cache line.

A check of perf c2c profile difference between tip and the move new declarations to
the top case could be useful.  It may give some additional clues of possibel 
false sharing issues.

Tim

> 
> We'll keep you posted of out finding. Let me know if you need
> anything else in the meantime.
> --
> Thanks and Regards,
> Prateek


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-17 16:50         ` Tim Chen
@ 2022-06-20 11:20           ` K Prateek Nayak
  2022-06-20 13:37             ` Abel Wu
  0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2022-06-20 11:20 UTC (permalink / raw)
  To: Tim Chen, Yicong Yang, Yicong Yang, peterz, mingo, juri.lelli,
	vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, feng.tang

Hello Tim,

Thank you for looking into this.

On 6/17/2022 10:20 PM, Tim Chen wrote:
> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>>
>>
>> --
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e9f3dc6dcbf4..97a3895416ab 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>  	return sd;
>>  }
>>  
>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>> +DECLARE_PER_CPU(int, sd_share_id);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>  DECLARE_PER_CPU(int, sd_llc_size);
>>  DECLARE_PER_CPU(int, sd_llc_id);
>> -DECLARE_PER_CPU(int, sd_share_id);
>>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>> --
>>
>> The System-map of each kernel is as follows:
>>
>> - On "tip"
>>
>> 0000000000020518 D sd_asym_cpucapacity
>> 0000000000020520 D sd_asym_packing
>> 0000000000020528 D sd_numa
>> 0000000000020530 D sd_llc_shared
>> 0000000000020538 D sd_llc_id
>> 000000000002053c D sd_llc_size
>> -------------------------------------------- 64B Cacheline Boundary
>> 0000000000020540 D sd_llc
>>
>> - On "tip + Patch 1 only" and "tip + both patches"
>>
>> 0000000000020518 D sd_asym_cpucapacity
>> 0000000000020520 D sd_asym_packing
>> 0000000000020528 D sd_numa
>> 0000000000020530 D sd_cluster     <-----
>> 0000000000020538 D sd_llc_shared
>> -------------------------------------------- 64B Cacheline Boundary
>> 0000000000020540 D sd_share_id    <-----
>> 0000000000020544 D sd_llc_id
>> 0000000000020548 D sd_llc_size
>> 0000000000020550 D sd_llc
>>
>>
>> - On "tip + both patches (Move declaration to top)"
>>
>> 0000000000020518 D sd_asym_cpucapacity
>> 0000000000020520 D sd_asym_packing
>> 0000000000020528 D sd_numa
>> 0000000000020530 D sd_llc_shared
>> 0000000000020538 D sd_llc_id
>> 000000000002053c D sd_llc_size
>> -------------------------------------------- 64B Cacheline Boundary
>> 0000000000020540 D sd_llc
> 
> Wonder if it will help to try keep sd_llc and sd_llc_size into the same
> cache line.  They are both used in the wake up path.

We are still evaluating keeping which set of variables on the same
cache line will provide the best results.

I would have expected the two kernel variants - "tip" and the
"tip + both patches (Move declaration to top)" - to give similar results
as their System map for all the old variables remain the same and the
addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
However, now we see a regression for higher number of client.

This probably hints that access to "sd_cluster" variable in Patch 2 and
bringing in the extra cache line could be responsible for the regression
we see with "tip + both patches (Move declaration to top)"

>
> 
>> 0000000000020548 D sd_share_id    <-----
>> 0000000000020550 D sd_cluster     <-----
>>
>>> Or change the layout a bit to see if there's any difference,
>>> like:
>>>
>>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>  DEFINE_PER_CPU(int, sd_llc_size);
>>>  DEFINE_PER_CPU(int, sd_llc_id);
>>>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> +DEFINE_PER_CPU(int, sd_share_id);
>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>
>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share
>>> the kernel config as well? I'd like to compare the config as well.
>>
>> I've attached the kernel config used to build the test kernel
>> to this mail.
>>
>>> Thanks,
>>> Yicong
>>
>> We are trying to debug the issue using perf and find an optimal
>> arrangement of the per cpu declarations to get the relevant data
>> used in the wakeup path on the same 64B cache line.
> 
> A check of perf c2c profile difference between tip and the move new declarations to
> the top case could be useful.  It may give some additional clues of possibel 
> false sharing issues.

Thank you for the suggestion. We are currently looking at perf counter
data to see how the cache efficiency has changed between the two kernels.
We suspect that the need for the data in the other cache line too in the
wakeup path is resulting in higher cache misses in the levels closer to
the core.

I don't think it is a false sharing problem as these per CPU data are
set when the system first boots up and will only be change again during
a CPU hotplug operation. However, it might be beneficial to see the c2c
profile if perf counters don't indicate anything out of the ordinary.

> 
> Tim
> 
>>
>> We'll keep you posted of out finding. Let me know if you need
>> anything else in the meantime.
>> --
>> Thanks and Regards,
>> Prateek
> 
--
Thanks and Regards,
Prateek

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-20 11:20           ` K Prateek Nayak
@ 2022-06-20 13:37             ` Abel Wu
  2022-06-28 11:55               ` K Prateek Nayak
  0 siblings, 1 reply; 23+ messages in thread
From: Abel Wu @ 2022-06-20 13:37 UTC (permalink / raw)
  To: K Prateek Nayak, Tim Chen, Yicong Yang, Yicong Yang, peterz,
	mingo, juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel,
	linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, feng.tang


On 6/20/22 7:20 PM, K Prateek Nayak Wrote:
> Hello Tim,
> 
> Thank you for looking into this.
> 
> On 6/17/2022 10:20 PM, Tim Chen wrote:
>> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>>>
>>>
>>> --
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index e9f3dc6dcbf4..97a3895416ab 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>>   	return sd;
>>>   }
>>>   
>>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>> +DECLARE_PER_CPU(int, sd_share_id);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>   DECLARE_PER_CPU(int, sd_llc_size);
>>>   DECLARE_PER_CPU(int, sd_llc_id);
>>> -DECLARE_PER_CPU(int, sd_share_id);
>>>   DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> --
>>>
>>> The System-map of each kernel is as follows:
>>>
>>> - On "tip"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_llc_shared
>>> 0000000000020538 D sd_llc_id
>>> 000000000002053c D sd_llc_size
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_llc
>>>
>>> - On "tip + Patch 1 only" and "tip + both patches"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_cluster     <-----
>>> 0000000000020538 D sd_llc_shared
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_share_id    <-----
>>> 0000000000020544 D sd_llc_id
>>> 0000000000020548 D sd_llc_size
>>> 0000000000020550 D sd_llc
>>>
>>>
>>> - On "tip + both patches (Move declaration to top)"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_llc_shared
>>> 0000000000020538 D sd_llc_id
>>> 000000000002053c D sd_llc_size
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_llc
>>
>> Wonder if it will help to try keep sd_llc and sd_llc_size into the same
>> cache line.  They are both used in the wake up path.
> 
> We are still evaluating keeping which set of variables on the same
> cache line will provide the best results.
> 
> I would have expected the two kernel variants - "tip" and the
> "tip + both patches (Move declaration to top)" - to give similar results
> as their System map for all the old variables remain the same and the
> addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
> However, now we see a regression for higher number of client.
> 
> This probably hints that access to "sd_cluster" variable in Patch 2 and
> bringing in the extra cache line could be responsible for the regression
> we see with "tip + both patches (Move declaration to top)"
> 
>>
>>
>>> 0000000000020548 D sd_share_id    <-----
>>> 0000000000020550 D sd_cluster     <-----
>>>
>>>> Or change the layout a bit to see if there's any difference,
>>>> like:
>>>>
>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>   DEFINE_PER_CPU(int, sd_llc_size);
>>>>   DEFINE_PER_CPU(int, sd_llc_id);
>>>>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>> +DEFINE_PER_CPU(int, sd_share_id);
>>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>
>>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share
>>>> the kernel config as well? I'd like to compare the config as well.
>>>
>>> I've attached the kernel config used to build the test kernel
>>> to this mail.
>>>
>>>> Thanks,
>>>> Yicong
>>>
>>> We are trying to debug the issue using perf and find an optimal
>>> arrangement of the per cpu declarations to get the relevant data
>>> used in the wakeup path on the same 64B cache line.
>>
>> A check of perf c2c profile difference between tip and the move new declarations to
>> the top case could be useful.  It may give some additional clues of possibel
>> false sharing issues.
> 
> Thank you for the suggestion. We are currently looking at perf counter
> data to see how the cache efficiency has changed between the two kernels.
> We suspect that the need for the data in the other cache line too in the
> wakeup path is resulting in higher cache misses in the levels closer to
> the core.
> 
> I don't think it is a false sharing problem as these per CPU data are
> set when the system first boots up and will only be change again during
> a CPU hotplug operation. However, it might be beneficial to see the c2c
> profile if perf counters don't indicate anything out of the ordinary.
> 

Would it be possible if any other frequent-written variables share
same cacheline with these per-cpu data causing false sharing? What
about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY?


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-20 13:37             ` Abel Wu
@ 2022-06-28 11:55               ` K Prateek Nayak
  2022-06-29  3:22                 ` Yicong Yang
  0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2022-06-28 11:55 UTC (permalink / raw)
  To: Abel Wu, Tim Chen, Yicong Yang, Yicong Yang, peterz, mingo,
	juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel,
	linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, feng.tang

Hello Abel,

Thank you for looking into this issue.

tl;dr

We have found two variables that need to be
co-located on the same cache line to restore tbench
performance.

As this issue is unrelated to the functionality of
the patch, it should not hold this patch series from
landing given the performance improvements seen on
systems with CPU clusters.

The results of our analysis are discussed in
detail below.

On 6/20/2022 7:07 PM, Abel Wu wrote:
> 
> On 6/20/22 7:20 PM, K Prateek Nayak Wrote:
>> Hello Tim,
>>
>> Thank you for looking into this.
>>
>> On 6/17/2022 10:20 PM, Tim Chen wrote:
>>> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>>>>
>>>>
>>>> -- 
>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>> index e9f3dc6dcbf4..97a3895416ab 100644
>>>> --- a/kernel/sched/sched.h
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>>>       return sd;
>>>>   }
>>>>   +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>> +DECLARE_PER_CPU(int, sd_share_id);
>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>   DECLARE_PER_CPU(int, sd_llc_size);
>>>>   DECLARE_PER_CPU(int, sd_llc_id);
>>>> -DECLARE_PER_CPU(int, sd_share_id);
>>>>   DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>>> -- 
>>>>
>>>> The System-map of each kernel is as follows:
>>>>
>>>> - On "tip"
>>>>
>>>> 0000000000020518 D sd_asym_cpucapacity
>>>> 0000000000020520 D sd_asym_packing
>>>> 0000000000020528 D sd_numa
>>>> 0000000000020530 D sd_llc_shared
>>>> 0000000000020538 D sd_llc_id
>>>> 000000000002053c D sd_llc_size
>>>> -------------------------------------------- 64B Cacheline Boundary
>>>> 0000000000020540 D sd_llc
>>>>
>>>> - On "tip + Patch 1 only" and "tip + both patches"
>>>>
>>>> 0000000000020518 D sd_asym_cpucapacity
>>>> 0000000000020520 D sd_asym_packing
>>>> 0000000000020528 D sd_numa
>>>> 0000000000020530 D sd_cluster     <-----
>>>> 0000000000020538 D sd_llc_shared
>>>> -------------------------------------------- 64B Cacheline Boundary
>>>> 0000000000020540 D sd_share_id    <-----
>>>> 0000000000020544 D sd_llc_id
>>>> 0000000000020548 D sd_llc_size
>>>> 0000000000020550 D sd_llc
>>>>
>>>>
>>>> - On "tip + both patches (Move declaration to top)"
>>>>
>>>> 0000000000020518 D sd_asym_cpucapacity
>>>> 0000000000020520 D sd_asym_packing
>>>> 0000000000020528 D sd_numa
>>>> 0000000000020530 D sd_llc_shared
>>>> 0000000000020538 D sd_llc_id
>>>> 000000000002053c D sd_llc_size
>>>> -------------------------------------------- 64B Cacheline Boundary
>>>> 0000000000020540 D sd_llc
>>>
>>> Wonder if it will help to try keep sd_llc and sd_llc_size into the same
>>> cache line.  They are both used in the wake up path.
>>
>> We are still evaluating keeping which set of variables on the same
>> cache line will provide the best results.
>>
>> I would have expected the two kernel variants - "tip" and the
>> "tip + both patches (Move declaration to top)" - to give similar results
>> as their System map for all the old variables remain the same and the
>> addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
>> However, now we see a regression for higher number of client.
>>
>> This probably hints that access to "sd_cluster" variable in Patch 2 and
>> bringing in the extra cache line could be responsible for the regression
>> we see with "tip + both patches (Move declaration to top)"
>>
>>>
>>>
>>>> 0000000000020548 D sd_share_id    <-----
>>>> 0000000000020550 D sd_cluster     <-----
>>>>
>>>>> Or change the layout a bit to see if there's any difference,
>>>>> like:
>>>>>
>>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>>   DEFINE_PER_CPU(int, sd_llc_size);
>>>>>   DEFINE_PER_CPU(int, sd_llc_id);
>>>>>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>>> +DEFINE_PER_CPU(int, sd_share_id);
>>>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>>
>>>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share
>>>>> the kernel config as well? I'd like to compare the config as well.
>>>>
>>>> I've attached the kernel config used to build the test kernel
>>>> to this mail.
>>>>
>>>>> Thanks,
>>>>> Yicong
>>>>
>>>> We are trying to debug the issue using perf and find an optimal
>>>> arrangement of the per cpu declarations to get the relevant data
>>>> used in the wakeup path on the same 64B cache line.
>>>
>>> A check of perf c2c profile difference between tip and the move new declarations to
>>> the top case could be useful.  It may give some additional clues of possibel
>>> false sharing issues.
>>
>> Thank you for the suggestion. We are currently looking at perf counter
>> data to see how the cache efficiency has changed between the two kernels.
>> We suspect that the need for the data in the other cache line too in the
>> wakeup path is resulting in higher cache misses in the levels closer to
>> the core.
>>
>> I don't think it is a false sharing problem as these per CPU data are
>> set when the system first boots up and will only be change again during
>> a CPU hotplug operation. However, it might be beneficial to see the c2c
>> profile if perf counters don't indicate anything out of the ordinary.
>>
> 
> Would it be possible if any other frequent-written variables share
> same cacheline with these per-cpu data causing false sharing? 

This indeed seems to be the case. I'll leave the
detailed analysis below.

> What
> about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY?
>

Making all the sd_* variables DEFINE_PER_CPU_READ_MOSTLY or
placing all the sd_* variables on the same cache line doesn't
help the regression. In fact, it makes it worse.

Following are the results on different test kernels:
tip				- 5.19.0-rc2 tip
				  (commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle")
cluster				- tip + both the patches of the series
Patch1				- tip + only Patch 1
align_first (Patch 1) 		- tip + only Patch 1 + all sd_* variables in same cache line
per_cpu_aigned_struct (Patch 1) - tip + only Patch 1 + all sd_* variables part of a per_cpu struct which is cacheline aligned

Clients:       tip                    cluster                 Patch1             align_first (Patch 1)       per_cpu_aigned_struct (Patch 1)
    8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3018.63 (-7.51 pct)     2993.65 (-8.27 pct)         1728.89 (-47.02 pct)
   16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    4869.26 (-18.99 pct)    4820.18 (-19.81 pct)        3840.64 (-36.10 pct)
   32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    8159.60 (-32.33 pct)    7868.82 (-34.74 pct)        7130.19 (-40.86 pct)
   64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   13161.92 (-38.08 pct)   12327.86 (-42.00 pct)       12572.70 (-40.85 pct)

Following is the system map for the kernel variant "align_first (Patch 1)":

--
00000000000204c0 d sugov_cpu
 ------------------------------------------------ 20500 (Cache Line Start)
0000000000020508 d root_cpuacct_cpuusage
0000000000020510 D cpufreq_update_util_data
 ------------------------------------------------ 20540 (Cache Line Start)
0000000000020540 D sd_asym_cpucapacity                  
0000000000020548 D sd_asym_packing                      
0000000000020550 D sd_numa                              
0000000000020558 D sd_cluster                           
0000000000020560 D sd_llc_shared                        
0000000000020568 D sd_share_id                          
000000000002056c D sd_llc_id                            
0000000000020570 D sd_llc_size                          
0000000000020578 D sd_llc                               
 ------------------------------------------------ 20580 (Cache Line Start)
0000000000020580 d wake_up_klogd_work
00000000000205a0 d printk_pending
00000000000205a4 d printk_count_nmi
00000000000205a5 d printk_count
00000000000205a8 d printk_context
 ------------------------------------------------ 205c0 (Cache Line Start)
00000000000205c0 d rcu_tasks_trace__percpu
--

At this point it was clear that one or more sd_* variable needs
to be co-located with the per CPU variables in cache line starting
at 20540. We began moving variable out of the cache line one by one
to see which variable makes the difference as we found out that as
long as root_cpuacct_cpuusage and sd_llc_id are on the same cache
line, the results were equivalent of what we saw on the tip. As both
the variables seem to be accesses very frequently, access to one will
prime the cache line containing the other variable as well leading to
better cache efficiency.

Placing root_cpuacct_cpuusage, sd_llc_id, sd_share_id, sd_llc_shared
and sd_cluster on the same cache line, the results are as follows:

Kernel versions:
tip					- 5.19.0-rc2 tip
cluster					- tip + both the patches of the series
cluster (Custom Layout)			- tip + both the patches of the series + reworked system map
cluster (Custom Layout) + SIS_UTIL	- cluster (Custom Layout) + v4 of SIS_UTIL patchset by Chenyu
					  (https://lore.kernel.org/lkml/20220612163428.849378-1-yu.c.chen@intel.com/)

Clients:      tip                     cluster          cluster (Custom Layout)       cluster (Custom Layout)
                                                                                            + SIS Util
    1    444.41 (0.00 pct)       439.27 (-1.15 pct)      438.06 (-1.42 pct)             447.75 (0.75 pct)
    2    879.23 (0.00 pct)       831.49 (-5.42 pct)      846.98 (-3.66 pct)             871.64 (-0.86 pct)
    4    1648.83 (0.00 pct)      1608.07 (-2.47 pct)     1621.38 (-1.66 pct)            1656.34 (0.45 pct)
    8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3103.40 (-4.91 pct)     *      3227.88 (-1.10 pct)
   16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    5838.04 (-2.88 pct)            6232.92 (3.68 pct)
   32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    11577.73 (-3.98 pct)           11774.10 (-2.35 pct)
   64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   19563.57 (-7.97 pct)    *      22044.93 (3.70 pct)
  128    30795.27 (0.00 pct)     30861.34 (0.21 pct)     31705.47 (2.95 pct)            28986.14 (-5.87 pct)    *
  256    25138.43 (0.00 pct)     24711.90 (-1.69 pct)    23929.42 (-4.80 pct)    *      43984.52 (74.96 pct)    [Known to be unstable without SIS_UTIL]
  512    51287.93 (0.00 pct)     51855.55 (1.10 pct)     52278.33 (1.93 pct)            51511.51 (0.43 pct)
 1024    53176.97 (0.00 pct)     52554.55 (-1.17 pct)    52995.27 (-0.34 pct)           52807.04 (-0.69 pct)

Chenyu's SIS_UTIL patch was merged today in the tip
(commit: 70fb5ccf2ebb "sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg")
and seems to bring back performance to the same level
as seen on the tip used during our testing.

The system map of the tested configuration labelled "Custom Layout"
is as follows:

--
00000000000204c0 d sugov_cpu
 ------------------------------------------------ 20500 (Cache Line Start)
0000000000020508 d root_cpuacct_cpuusage
0000000000020510 D cpufreq_update_util_data
0000000000020518 D sd_llc_id
000000000002051c D sd_share_id
0000000000020520 D sd_llc_shared
0000000000020528 D sd_cluster
0000000000020530 D sd_asym_cpucapacity
0000000000020538 D sd_asym_packing
 ------------------------------------------------ 20540 (Cache Line Start)
0000000000020540 D sd_numa
0000000000020548 D sd_llc_size
0000000000020550 D sd_llc
0000000000020560 d wake_up_klogd_work
 ------------------------------------------------ 20580 (Cache Line Start)
0000000000020580 d printk_pending
0000000000020584 d printk_count_nmi
0000000000020585 d printk_count
0000000000020588 d printk_context
 ------------------------------------------------ 205c0 (Cache Line Start)
00000000000205c0 d rcu_tasks_trace__percpu
--

The System-map is however dependent on multiple factors
such as config options enabled, etc. which can change from
build to build.
Finding a permanent solution to the issue might take
some more time.

Meanwhile, as this is an issue unrelated to the
functionality of this patch, it should not block
the landing of this patch series.
Thank you everyone for your pointers and patience
during this debug.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-28 11:55               ` K Prateek Nayak
@ 2022-06-29  3:22                 ` Yicong Yang
  0 siblings, 0 replies; 23+ messages in thread
From: Yicong Yang @ 2022-06-29  3:22 UTC (permalink / raw)
  To: K Prateek Nayak, Abel Wu, Tim Chen, Yicong Yang, peterz, mingo,
	juri.lelli, vincent.guittot, gautham.shenoy, linux-kernel,
	linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, feng.tang

On 2022/6/28 19:55, K Prateek Nayak wrote:
> Hello Abel,
> 
> Thank you for looking into this issue.
> 
> tl;dr
> 
> We have found two variables that need to be
> co-located on the same cache line to restore tbench
> performance.
> 
> As this issue is unrelated to the functionality of
> the patch, it should not hold this patch series from
> landing given the performance improvements seen on
> systems with CPU clusters.
> 
> The results of our analysis are discussed in
> detail below.
> 
> On 6/20/2022 7:07 PM, Abel Wu wrote:
>>
>> On 6/20/22 7:20 PM, K Prateek Nayak Wrote:
>>> Hello Tim,
>>>
>>> Thank you for looking into this.
>>>
>>> On 6/17/2022 10:20 PM, Tim Chen wrote:
>>>> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>>>>>
>>>>>
>>>>> -- 
>>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>>> index e9f3dc6dcbf4..97a3895416ab 100644
>>>>> --- a/kernel/sched/sched.h
>>>>> +++ b/kernel/sched/sched.h
>>>>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>>>>       return sd;
>>>>>   }
>>>>>   +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>>> +DECLARE_PER_CPU(int, sd_share_id);
>>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>>   DECLARE_PER_CPU(int, sd_llc_size);
>>>>>   DECLARE_PER_CPU(int, sd_llc_id);
>>>>> -DECLARE_PER_CPU(int, sd_share_id);
>>>>>   DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>>>> -- 
>>>>>
>>>>> The System-map of each kernel is as follows:
>>>>>
>>>>> - On "tip"
>>>>>
>>>>> 0000000000020518 D sd_asym_cpucapacity
>>>>> 0000000000020520 D sd_asym_packing
>>>>> 0000000000020528 D sd_numa
>>>>> 0000000000020530 D sd_llc_shared
>>>>> 0000000000020538 D sd_llc_id
>>>>> 000000000002053c D sd_llc_size
>>>>> -------------------------------------------- 64B Cacheline Boundary
>>>>> 0000000000020540 D sd_llc
>>>>>
>>>>> - On "tip + Patch 1 only" and "tip + both patches"
>>>>>
>>>>> 0000000000020518 D sd_asym_cpucapacity
>>>>> 0000000000020520 D sd_asym_packing
>>>>> 0000000000020528 D sd_numa
>>>>> 0000000000020530 D sd_cluster     <-----
>>>>> 0000000000020538 D sd_llc_shared
>>>>> -------------------------------------------- 64B Cacheline Boundary
>>>>> 0000000000020540 D sd_share_id    <-----
>>>>> 0000000000020544 D sd_llc_id
>>>>> 0000000000020548 D sd_llc_size
>>>>> 0000000000020550 D sd_llc
>>>>>
>>>>>
>>>>> - On "tip + both patches (Move declaration to top)"
>>>>>
>>>>> 0000000000020518 D sd_asym_cpucapacity
>>>>> 0000000000020520 D sd_asym_packing
>>>>> 0000000000020528 D sd_numa
>>>>> 0000000000020530 D sd_llc_shared
>>>>> 0000000000020538 D sd_llc_id
>>>>> 000000000002053c D sd_llc_size
>>>>> -------------------------------------------- 64B Cacheline Boundary
>>>>> 0000000000020540 D sd_llc
>>>>
>>>> Wonder if it will help to try keep sd_llc and sd_llc_size into the same
>>>> cache line.  They are both used in the wake up path.
>>>
>>> We are still evaluating keeping which set of variables on the same
>>> cache line will provide the best results.
>>>
>>> I would have expected the two kernel variants - "tip" and the
>>> "tip + both patches (Move declaration to top)" - to give similar results
>>> as their System map for all the old variables remain the same and the
>>> addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
>>> However, now we see a regression for higher number of client.
>>>
>>> This probably hints that access to "sd_cluster" variable in Patch 2 and
>>> bringing in the extra cache line could be responsible for the regression
>>> we see with "tip + both patches (Move declaration to top)"
>>>
>>>>
>>>>
>>>>> 0000000000020548 D sd_share_id    <-----
>>>>> 0000000000020550 D sd_cluster     <-----
>>>>>
>>>>>> Or change the layout a bit to see if there's any difference,
>>>>>> like:
>>>>>>
>>>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>>>   DEFINE_PER_CPU(int, sd_llc_size);
>>>>>>   DEFINE_PER_CPU(int, sd_llc_id);
>>>>>>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>>>> +DEFINE_PER_CPU(int, sd_share_id);
>>>>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>>>
>>>>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share
>>>>>> the kernel config as well? I'd like to compare the config as well.
>>>>>
>>>>> I've attached the kernel config used to build the test kernel
>>>>> to this mail.
>>>>>
>>>>>> Thanks,
>>>>>> Yicong
>>>>>
>>>>> We are trying to debug the issue using perf and find an optimal
>>>>> arrangement of the per cpu declarations to get the relevant data
>>>>> used in the wakeup path on the same 64B cache line.
>>>>
>>>> A check of perf c2c profile difference between tip and the move new declarations to
>>>> the top case could be useful.  It may give some additional clues of possibel
>>>> false sharing issues.
>>>
>>> Thank you for the suggestion. We are currently looking at perf counter
>>> data to see how the cache efficiency has changed between the two kernels.
>>> We suspect that the need for the data in the other cache line too in the
>>> wakeup path is resulting in higher cache misses in the levels closer to
>>> the core.
>>>
>>> I don't think it is a false sharing problem as these per CPU data are
>>> set when the system first boots up and will only be change again during
>>> a CPU hotplug operation. However, it might be beneficial to see the c2c
>>> profile if perf counters don't indicate anything out of the ordinary.
>>>
>>
>> Would it be possible if any other frequent-written variables share
>> same cacheline with these per-cpu data causing false sharing? 
> 
> This indeed seems to be the case. I'll leave the
> detailed analysis below.
> 

The percpu variables seems won't be affected by the non-percpu variables as the
percpu section is already cacheline aligned. arm64's vmlinux.ld.S shows
PERCPU_SECTION(L1_CACHE_BYTES).

>> What
>> about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY?
>>
> 
> Making all the sd_* variables DEFINE_PER_CPU_READ_MOSTLY or
> placing all the sd_* variables on the same cache line doesn't
> help the regression. In fact, it makes it worse.
> 
> Following are the results on different test kernels:
> tip				- 5.19.0-rc2 tip
> 				  (commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle")
> cluster				- tip + both the patches of the series
> Patch1				- tip + only Patch 1
> align_first (Patch 1) 		- tip + only Patch 1 + all sd_* variables in same cache line
> per_cpu_aigned_struct (Patch 1) - tip + only Patch 1 + all sd_* variables part of a per_cpu struct which is cacheline aligned
> 
> Clients:       tip                    cluster                 Patch1             align_first (Patch 1)       per_cpu_aigned_struct (Patch 1)
>     8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3018.63 (-7.51 pct)     2993.65 (-8.27 pct)         1728.89 (-47.02 pct)
>    16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    4869.26 (-18.99 pct)    4820.18 (-19.81 pct)        3840.64 (-36.10 pct)
>    32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    8159.60 (-32.33 pct)    7868.82 (-34.74 pct)        7130.19 (-40.86 pct)
>    64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   13161.92 (-38.08 pct)   12327.86 (-42.00 pct)       12572.70 (-40.85 pct)
> 
> Following is the system map for the kernel variant "align_first (Patch 1)":
> 
> --
> 00000000000204c0 d sugov_cpu
>  ------------------------------------------------ 20500 (Cache Line Start)
> 0000000000020508 d root_cpuacct_cpuusage
> 0000000000020510 D cpufreq_update_util_data
>  ------------------------------------------------ 20540 (Cache Line Start)
> 0000000000020540 D sd_asym_cpucapacity                  
> 0000000000020548 D sd_asym_packing                      
> 0000000000020550 D sd_numa                              
> 0000000000020558 D sd_cluster                           
> 0000000000020560 D sd_llc_shared                        
> 0000000000020568 D sd_share_id                          
> 000000000002056c D sd_llc_id                            
> 0000000000020570 D sd_llc_size                          
> 0000000000020578 D sd_llc                               
>  ------------------------------------------------ 20580 (Cache Line Start)
> 0000000000020580 d wake_up_klogd_work
> 00000000000205a0 d printk_pending
> 00000000000205a4 d printk_count_nmi
> 00000000000205a5 d printk_count
> 00000000000205a8 d printk_context
>  ------------------------------------------------ 205c0 (Cache Line Start)
> 00000000000205c0 d rcu_tasks_trace__percpu
> --
> 
> At this point it was clear that one or more sd_* variable needs
> to be co-located with the per CPU variables in cache line starting
> at 20540. We began moving variable out of the cache line one by one
> to see which variable makes the difference as we found out that as
> long as root_cpuacct_cpuusage and sd_llc_id are on the same cache
> line, the results were equivalent of what we saw on the tip. As both
> the variables seem to be accesses very frequently, access to one will
> prime the cache line containing the other variable as well leading to
> better cache efficiency.
> 
> Placing root_cpuacct_cpuusage, sd_llc_id, sd_share_id, sd_llc_shared
> and sd_cluster on the same cache line, the results are as follows:
> 
> Kernel versions:
> tip					- 5.19.0-rc2 tip
> cluster					- tip + both the patches of the series
> cluster (Custom Layout)			- tip + both the patches of the series + reworked system map
> cluster (Custom Layout) + SIS_UTIL	- cluster (Custom Layout) + v4 of SIS_UTIL patchset by Chenyu
> 					  (https://lore.kernel.org/lkml/20220612163428.849378-1-yu.c.chen@intel.com/)
> 
> Clients:      tip                     cluster          cluster (Custom Layout)       cluster (Custom Layout)
>                                                                                             + SIS Util
>     1    444.41 (0.00 pct)       439.27 (-1.15 pct)      438.06 (-1.42 pct)             447.75 (0.75 pct)
>     2    879.23 (0.00 pct)       831.49 (-5.42 pct)      846.98 (-3.66 pct)             871.64 (-0.86 pct)
>     4    1648.83 (0.00 pct)      1608.07 (-2.47 pct)     1621.38 (-1.66 pct)            1656.34 (0.45 pct)
>     8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3103.40 (-4.91 pct)     *      3227.88 (-1.10 pct)
>    16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    5838.04 (-2.88 pct)            6232.92 (3.68 pct)
>    32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    11577.73 (-3.98 pct)           11774.10 (-2.35 pct)
>    64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   19563.57 (-7.97 pct)    *      22044.93 (3.70 pct)
>   128    30795.27 (0.00 pct)     30861.34 (0.21 pct)     31705.47 (2.95 pct)            28986.14 (-5.87 pct)    *
>   256    25138.43 (0.00 pct)     24711.90 (-1.69 pct)    23929.42 (-4.80 pct)    *      43984.52 (74.96 pct)    [Known to be unstable without SIS_UTIL]
>   512    51287.93 (0.00 pct)     51855.55 (1.10 pct)     52278.33 (1.93 pct)            51511.51 (0.43 pct)
>  1024    53176.97 (0.00 pct)     52554.55 (-1.17 pct)    52995.27 (-0.34 pct)           52807.04 (-0.69 pct)
> 
> Chenyu's SIS_UTIL patch was merged today in the tip
> (commit: 70fb5ccf2ebb "sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg")
> and seems to bring back performance to the same level
> as seen on the tip used during our testing.
> 
> The system map of the tested configuration labelled "Custom Layout"
> is as follows:
> 
> --
> 00000000000204c0 d sugov_cpu
>  ------------------------------------------------ 20500 (Cache Line Start)
> 0000000000020508 d root_cpuacct_cpuusage
> 0000000000020510 D cpufreq_update_util_data
> 0000000000020518 D sd_llc_id
> 000000000002051c D sd_share_id
> 0000000000020520 D sd_llc_shared
> 0000000000020528 D sd_cluster
> 0000000000020530 D sd_asym_cpucapacity
> 0000000000020538 D sd_asym_packing
>  ------------------------------------------------ 20540 (Cache Line Start)
> 0000000000020540 D sd_numa
> 0000000000020548 D sd_llc_size
> 0000000000020550 D sd_llc
> 0000000000020560 d wake_up_klogd_work
>  ------------------------------------------------ 20580 (Cache Line Start)
> 0000000000020580 d printk_pending
> 0000000000020584 d printk_count_nmi
> 0000000000020585 d printk_count
> 0000000000020588 d printk_context
>  ------------------------------------------------ 205c0 (Cache Line Start)
> 00000000000205c0 d rcu_tasks_trace__percpu
> --
> 

Thanks for the data. It looks sufficient to illustrate the issue. I'm still a little curious that
is it specific on Zen 3? Since I cannot get the Zen 3 machine so I tried to reproduce this on the a
AMD 7742 (2 x 64C/128T) and an Intel 6148 (2 x 40C/80T) machine with your config based on tip/sche/core
of commit f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle".
With cluster's percpu definition patch(the worst case) the root_cpuacct_cpuusage and sd_llc_id are
separated but make little difference to the tbench.

For 7742 with NPS = 1:					
	tip/sched/core				tip/sched/core-cls-percpu				
threads	iter-1	iter-2	iter-3	avg		iter-1	iter-2	iter-3	avg	
1	287.254	285.252	288.29	286.93		288.105	285.884	286.164	286.72   (-0.075%)
2	573.623	570.326	573.67	572.54		577.242	574.257	575.611	575.70   (0.553%)
4	1140.01	1135.54	1137.9	1137.82		1153.45	1146.66	1149.8	1149.97  (1.068%)
8	2232.73	2228.71	2237.47	2232.97		2280.83	2291.94	2287.55	2286.77  (2.409%)
16	4510.65	4465.2	4522.22	4499.36		4551.19	4573.85	4575.78	4566.94  (1.502%)
32	8432.43	8542.35	8404.7	8459.83		8627.4	8712.29	8764.53	8701.41  (2.856%)
64	16404.2	16464.7	15686.4	16185.10	15878.3	15638.3	15735.6	15750.73 (-2.684%)
128	23327.6	22767.4	21017.3	22370.77	23774.9	23733.2	21744.9	23084.33 (3.190%)

layout of core-cls-percpu:
0000000000020508 d root_cpuacct_cpuusage
0000000000020510 D cpufreq_update_util_data
0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_cluster
0000000000020538 D sd_llc_shared
---------------------------------boundary
0000000000020540 D sd_share_id
0000000000020544 D sd_llc_id
0000000000020548 D sd_llc_size
0000000000020550 D sd_llc

For 6148 with SNC disabled:
         tip/sched/core       tip/sched/core-cls-percpu
1        314.44 (   0.00%)      316.68 *   0.71%*
2        630.22 (   0.00%)      633.84 *   0.57%*
4       1260.70 (   0.00%)     1267.32 *   0.52%*
8       2475.31 (   0.00%)     2485.94 *   0.43%*
16      4684.03 (   0.00%)     4682.21 *  -0.04%*
32      7496.28 (   0.00%)     7360.41 *  -1.81%*
64      7981.56 (   0.00%)     7851.73 *  -1.63%*
128    15359.81 (   0.00%)    15255.68 *  -0.68%*
256    15104.31 (   0.00%)    15253.85 *   0.99%*
320    15146.47 (   0.00%)    15151.39 *   0.03%*

layout of core-cls-percpu:
0000000000020508 d root_cpuacct_cpuusage
0000000000020510 D cpufreq_update_util_data
0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_cluster
0000000000020538 D sd_llc_shared
---------------------------------boundary
0000000000020540 D sd_share_id
0000000000020544 D sd_llc_id
0000000000020548 D sd_llc_size
0000000000020550 D sd_llc


> The System-map is however dependent on multiple factors
> such as config options enabled, etc. which can change from
> build to build.
> Finding a permanent solution to the issue might take
> some more time.
> 
> Meanwhile, as this is an issue unrelated to the
> functionality of this patch, it should not block
> the landing of this patch series.
> Thank you everyone for your pointers and patience
> during this debug.
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thanks. Then I'll respin the patch.

Regards,
Yicong


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-09 12:06 [PATCH v4 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
  2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
@ 2022-06-09 12:06 ` Yicong Yang
  2022-06-09 22:47   ` Tim Chen
  2022-06-26 12:13   ` Abel Wu
  1 sibling, 2 replies; 23+ messages in thread
From: Yicong Yang @ 2022-06-09 12:06 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39

From: Barry Song <song.bao.hua@hisilicon.com>

For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency.

Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
doesn't consider SMT for this moment.

Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.

With this patch, We noticed enhancement on tbench within one numa or cross
two numa.

On numa 0:
                            5.19-rc1                patched
Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*

On numa 0-1:
                            5.19-rc1                patched
Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*

Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
---
 kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77b2048a9326..6d173e196ad3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+#ifdef CONFIG_SCHED_CLUSTER
+/*
+ * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
+ */
+static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
+			       int target, int *nr)
+{
+	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
+	int cpu, idle_cpu;
+
+	/* TODO: Support SMT system with cluster topology */
+	if (!sched_smt_active() && sd) {
+		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
+			if (!--*nr)
+				break;
+
+			idle_cpu = __select_idle_cpu(cpu, p);
+			if ((unsigned int)idle_cpu < nr_cpumask_bits)
+				return idle_cpu;
+		}
+
+		cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+	}
+
+	return -1;
+}
+#else
+static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
+			       int target, int *nr)
+{
+	return -1;
+}
+#endif
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		time = cpu_clock(this);
 	}
 
+	idle_cpu = scan_cluster(p, cpus, target, &nr);
+	if ((unsigned int)idle_cpu < nr_cpumask_bits)
+		return idle_cpu;
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 				return i;
 
 		} else {
-			if (!--nr)
+			if (--nr <= 0)
 				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
@@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_cache(prev, target) &&
+	if (prev != target && cpus_share_resources(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
 	    asym_fits_capacity(task_util, prev))
 		return prev;
@@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	p->recent_used_cpu = prev;
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
-	    cpus_share_cache(recent_used_cpu, target) &&
+	    cpus_share_resources(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
 	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_capacity(task_util, recent_used_cpu)) {
-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2022-06-09 22:47   ` Tim Chen
  2022-06-10  4:01     ` Barry Song
  2022-06-10  6:39     ` Yicong Yang
  2022-06-26 12:13   ` Abel Wu
  1 sibling, 2 replies; 23+ messages in thread
From: Tim Chen @ 2022-06-09 22:47 UTC (permalink / raw)
  To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39

On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote:
> From: Barry Song <song.bao.hua@hisilicon.com>
> 
> For platforms having clusters like Kunpeng920, CPUs within the same cluster
> have lower latency when synchronizing and accessing shared resources like
> cache. Thus, this patch tries to find an idle cpu within the cluster of the
> target CPU before scanning the whole LLC to gain lower latency.
> 
> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
> doesn't consider SMT for this moment.
> 
> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
> 
> With this patch, We noticed enhancement on tbench within one numa or cross
> two numa.
> 
> On numa 0:
>                             5.19-rc1                patched
> Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
> Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
> Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
> Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
> Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
> Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
> Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
> Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*
> 
> On numa 0-1:
>                             5.19-rc1                patched
> Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
> Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
> Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
> Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
> Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
> Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
> Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
> Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
> Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*

Yicong,

Have you tried any workload where tasks don't share data
with each other but have sleep/wakeup?  That's the case
where we actually want to spread the tasks out among the clusters
to void contention for L2 cache.

Will be nice to make sure there's no regression there for
such workload.

Code itself looks good.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

> 
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> ---
>  kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 77b2048a9326..6d173e196ad3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>  
>  #endif /* CONFIG_SCHED_SMT */
>  
> +#ifdef CONFIG_SCHED_CLUSTER
> +/*
> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> + */
> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
> +			       int target, int *nr)
> +{
> +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> +	int cpu, idle_cpu;
> +
> +	/* TODO: Support SMT system with cluster topology */
> +	if (!sched_smt_active() && sd) {
> +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> +			if (!--*nr)
> +				break;
> +
> +			idle_cpu = __select_idle_cpu(cpu, p);
> +			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +				return idle_cpu;
> +		}
> +
> +		cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> +	}
> +
> +	return -1;
> +}
> +#else
> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
> +			       int target, int *nr)
> +{
> +	return -1;
> +}
> +#endif
> +
>  /*
>   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		time = cpu_clock(this);
>  	}
>  
> +	idle_cpu = scan_cluster(p, cpus, target, &nr);
> +	if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +		return idle_cpu;
> +
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  				return i;
>  
>  		} else {
> -			if (!--nr)
> +			if (--nr <= 0)
>  				return -1;
>  			idle_cpu = __select_idle_cpu(cpu, p);
>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	/*
>  	 * If the previous CPU is cache affine and idle, don't be stupid:
>  	 */
> -	if (prev != target && cpus_share_cache(prev, target) &&
> +	if (prev != target && cpus_share_resources(prev, target) &&
>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>  	    asym_fits_capacity(task_util, prev))
>  		return prev;
> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	p->recent_used_cpu = prev;
>  	if (recent_used_cpu != prev &&
>  	    recent_used_cpu != target &&
> -	    cpus_share_cache(recent_used_cpu, target) &&
> +	    cpus_share_resources(recent_used_cpu, target) &&
>  	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
>  	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
>  	    asym_fits_capacity(task_util, recent_used_cpu)) {


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-09 22:47   ` Tim Chen
@ 2022-06-10  4:01     ` Barry Song
  2022-06-10  6:39     ` Yicong Yang
  1 sibling, 0 replies; 23+ messages in thread
From: Barry Song @ 2022-06-10  4:01 UTC (permalink / raw)
  To: Tim Chen
  Cc: Yicong Yang, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Gautham R. Shenoy, LKML, LAK, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	prime.zeng, Jonathan Cameron, ego, Srikar Dronamraju, Linuxarm,
	Guodong Xu, hesham.almatary, john.garry, Yang Shen

On Fri, Jun 10, 2022 at 10:47 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote:
> > From: Barry Song <song.bao.hua@hisilicon.com>
> >
> > For platforms having clusters like Kunpeng920, CPUs within the same cluster
> > have lower latency when synchronizing and accessing shared resources like
> > cache. Thus, this patch tries to find an idle cpu within the cluster of the
> > target CPU before scanning the whole LLC to gain lower latency.
> >
> > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
> > doesn't consider SMT for this moment.
> >
> > Testing has been done on Kunpeng920 by pinning tasks to one numa and two
> > numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
> >
> > With this patch, We noticed enhancement on tbench within one numa or cross
> > two numa.
> >
> > On numa 0:
> >                             5.19-rc1                patched
> > Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
> > Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
> > Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
> > Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
> > Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
> > Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
> > Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
> > Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*
> >
> > On numa 0-1:
> >                             5.19-rc1                patched
> > Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
> > Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
> > Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
> > Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
> > Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
> > Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
> > Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
> > Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
> > Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*
>
> Yicong,
>
> Have you tried any workload where tasks don't share data
> with each other but have sleep/wakeup?  That's the case
> where we actually want to spread the tasks out among the clusters
> to void contention for L2 cache.

guess there isn't a way to universally win in term of the modifcation of
scheduling policies. The good news is that  as long as LB is still there,
tasks still get a good chance to spread if they stop talking to each other
again and again after a small talk.

if tasks do like spreading but keep in constant touch with each other, we
may have to interfere with them from userspace by openMP, numactl or
taskset etc. as WAKE_AFFINE is always an enemy for this requirement.

>
> Will be nice to make sure there's no regression there for
> such workload.
>
> Code itself looks good.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>
> >
> > Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > ---
> >  kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 41 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 77b2048a9326..6d173e196ad3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> >
> >  #endif /* CONFIG_SCHED_SMT */
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +/*
> > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> > + */
> > +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
> > +                            int target, int *nr)
> > +{
> > +     struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> > +     int cpu, idle_cpu;
> > +
> > +     /* TODO: Support SMT system with cluster topology */
> > +     if (!sched_smt_active() && sd) {
> > +             for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> > +                     if (!--*nr)
> > +                             break;
> > +
> > +                     idle_cpu = __select_idle_cpu(cpu, p);
> > +                     if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > +                             return idle_cpu;
> > +             }
> > +
> > +             cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> > +     }
> > +
> > +     return -1;
> > +}
> > +#else
> > +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
> > +                            int target, int *nr)
> > +{
> > +     return -1;
> > +}
> > +#endif
> > +
> >  /*
> >   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> >   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> > @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >               time = cpu_clock(this);
> >       }
> >
> > +     idle_cpu = scan_cluster(p, cpus, target, &nr);
> > +     if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > +             return idle_cpu;
> > +
> >       for_each_cpu_wrap(cpu, cpus, target + 1) {
> >               if (has_idle_core) {
> >                       i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >                               return i;
> >
> >               } else {
> > -                     if (!--nr)
> > +                     if (--nr <= 0)
> >                               return -1;
> >                       idle_cpu = __select_idle_cpu(cpu, p);
> >                       if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >       /*
> >        * If the previous CPU is cache affine and idle, don't be stupid:
> >        */
> > -     if (prev != target && cpus_share_cache(prev, target) &&
> > +     if (prev != target && cpus_share_resources(prev, target) &&
> >           (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> >           asym_fits_capacity(task_util, prev))
> >               return prev;
> > @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >       p->recent_used_cpu = prev;
> >       if (recent_used_cpu != prev &&
> >           recent_used_cpu != target &&
> > -         cpus_share_cache(recent_used_cpu, target) &&
> > +         cpus_share_resources(recent_used_cpu, target) &&
> >           (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> >           cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> >           asym_fits_capacity(task_util, recent_used_cpu)) {
>

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-09 22:47   ` Tim Chen
  2022-06-10  4:01     ` Barry Song
@ 2022-06-10  6:39     ` Yicong Yang
  2022-06-10 21:19       ` Tim Chen
  2022-06-11  3:03       ` Chen Yu
  1 sibling, 2 replies; 23+ messages in thread
From: Yicong Yang @ 2022-06-10  6:39 UTC (permalink / raw)
  To: Tim Chen, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, Chen Yu

On 2022/6/10 6:47, Tim Chen wrote:
> On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote:
>> From: Barry Song <song.bao.hua@hisilicon.com>
>>
>> For platforms having clusters like Kunpeng920, CPUs within the same cluster
>> have lower latency when synchronizing and accessing shared resources like
>> cache. Thus, this patch tries to find an idle cpu within the cluster of the
>> target CPU before scanning the whole LLC to gain lower latency.
>>
>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
>> doesn't consider SMT for this moment.
>>
>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
>>
>> With this patch, We noticed enhancement on tbench within one numa or cross
>> two numa.
>>
>> On numa 0:
>>                             5.19-rc1                patched
>> Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
>> Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
>> Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
>> Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
>> Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
>> Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
>> Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
>> Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*
>>
>> On numa 0-1:
>>                             5.19-rc1                patched
>> Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
>> Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
>> Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
>> Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
>> Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
>> Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
>> Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
>> Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
>> Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*
> 
> Yicong,
> 
> Have you tried any workload where tasks don't share data
> with each other but have sleep/wakeup?  That's the case
> where we actually want to spread the tasks out among the clusters
> to void contention for L2 cache.
> 
> Will be nice to make sure there's no regression there for
> such workload.
> 

Any certain workload you'd like me test? I'm willing to do :)

I've tested this patch with MySQL as well (like in v2). This won't hurt
the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL
posted in [1]. We leverage the nr to suppress redundant scanning in the
current approach and seems SIS_UTIL is more efficient in this case.

			 5.19-rc1		   patched	 patched+SIS_UTIL[1]
TPS-16threads		  6215.11	  6172.74 (-0.68%)	  6217.33 (0.04%)
QPS-16threads		124302.21	123454.68 (-0.68%)	124346.52 (0.04%)
avg-lat-16threads	     2.57	     2.59 (-0.65%)	     2.57 (0.00%)
TPS-24threads		  8726.40	  8690.87 (-0.41%)	  8833.08 (1.22%)
QPS-24threads		174527.88	173817.42 (-0.41%)	176661.54 (1.21%)
avg-lat-24threads	     2.75	     2.76 (-0.36%)	     2.71 (1.33%)
TPS-32threads		  9555.42	  9514.86 (-0.42%)	 10010.87 (4.77%)
QPS-32threads		191108.37	190297.28 (-0.42%)	200217.35 (4.55%)
avg-lat-32threads	     3.35	     3.36 (-0.30%)	     3.20 (4.58%)
TPS-64threads		 10290.10	 10324.75 (0.34%)	 10819.77 (5.15%)
QPS-64threads		205802.05	206494.95 (0.34%)	216395.40 (4.90%)
avg-lat-64threads	     6.22	     6.20 (0.38%)	     5.92 (4.88%)


> Code itself looks good.
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> 

Thanks.

[1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/

>>
>> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>> ---
>>  kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 41 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 77b2048a9326..6d173e196ad3 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>>  
>>  #endif /* CONFIG_SCHED_SMT */
>>  
>> +#ifdef CONFIG_SCHED_CLUSTER
>> +/*
>> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
>> + */
>> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
>> +			       int target, int *nr)
>> +{
>> +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
>> +	int cpu, idle_cpu;
>> +
>> +	/* TODO: Support SMT system with cluster topology */
>> +	if (!sched_smt_active() && sd) {
>> +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
>> +			if (!--*nr)
>> +				break;
>> +
>> +			idle_cpu = __select_idle_cpu(cpu, p);
>> +			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> +				return idle_cpu;
>> +		}
>> +
>> +		cpumask_andnot(cpus, cpus, sched_domain_span(sd));
>> +	}
>> +
>> +	return -1;
>> +}
>> +#else
>> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
>> +			       int target, int *nr)
>> +{
>> +	return -1;
>> +}
>> +#endif
>> +
>>  /*
>>   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
>>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
>> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>  		time = cpu_clock(this);
>>  	}
>>  
>> +	idle_cpu = scan_cluster(p, cpus, target, &nr);
>> +	if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> +		return idle_cpu;
>> +
>>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>>  		if (has_idle_core) {
>>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>  				return i;
>>  
>>  		} else {
>> -			if (!--nr)
>> +			if (--nr <= 0)
>>  				return -1;
>>  			idle_cpu = __select_idle_cpu(cpu, p);
>>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>  	/*
>>  	 * If the previous CPU is cache affine and idle, don't be stupid:
>>  	 */
>> -	if (prev != target && cpus_share_cache(prev, target) &&
>> +	if (prev != target && cpus_share_resources(prev, target) &&
>>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>>  	    asym_fits_capacity(task_util, prev))
>>  		return prev;
>> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>  	p->recent_used_cpu = prev;
>>  	if (recent_used_cpu != prev &&
>>  	    recent_used_cpu != target &&
>> -	    cpus_share_cache(recent_used_cpu, target) &&
>> +	    cpus_share_resources(recent_used_cpu, target) &&
>>  	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
>>  	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
>>  	    asym_fits_capacity(task_util, recent_used_cpu)) {
> 
> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-10  6:39     ` Yicong Yang
@ 2022-06-10 21:19       ` Tim Chen
  2022-06-11  3:03       ` Chen Yu
  1 sibling, 0 replies; 23+ messages in thread
From: Tim Chen @ 2022-06-10 21:19 UTC (permalink / raw)
  To: Yicong Yang, Yicong Yang, peterz, mingo, juri.lelli,
	vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, Chen Yu

On Fri, 2022-06-10 at 14:39 +0800, Yicong Yang wrote:
> 
> 
> I've tested this patch with MySQL as well (like in v2). This won't hurt
> the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL
> posted in [1]. We leverage the nr to suppress redundant scanning in the
> current approach and seems SIS_UTIL is more efficient in this case.
> 
> 			 5.19-rc1		   patched	 patched+SIS_UTIL[1]
> TPS-16threads		  6215.11	  6172.74 (-0.68%)	  6217.33 (0.04%)
> QPS-16threads		124302.21	123454.68 (-0.68%)	124346.52 (0.04%)
> avg-lat-16threads	     2.57	     2.59 (-0.65%)	     2.57 (0.00%)
> TPS-24threads		  8726.40	  8690.87 (-0.41%)	  8833.08 (1.22%)
> QPS-24threads		174527.88	173817.42 (-0.41%)	176661.54 (1.21%)
> avg-lat-24threads	     2.75	     2.76 (-0.36%)	     2.71 (1.33%)
> TPS-32threads		  9555.42	  9514.86 (-0.42%)	 10010.87 (4.77%)
> QPS-32threads		191108.37	190297.28 (-0.42%)	200217.35 (4.55%)
> avg-lat-32threads	     3.35	     3.36 (-0.30%)	     3.20 (4.58%)
> TPS-64threads		 10290.10	 10324.75 (0.34%)	 10819.77 (5.15%)
> QPS-64threads		205802.05	206494.95 (0.34%)	216395.40 (4.90%)
> avg-lat-64threads	     6.22	     6.20 (0.38%)	     5.92 (4.88%)
> 

Thanks for the numbers. SIS_UTIL will keep off migrations off the cluster
that doesn't really improve overall utilization.  We have higher chance that
L2 cache is warm.  So it makes sense that we see a bit
better performance there.

Tim



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-10  6:39     ` Yicong Yang
  2022-06-10 21:19       ` Tim Chen
@ 2022-06-11  3:03       ` Chen Yu
  2022-06-11  7:40         ` Yicong Yang
  1 sibling, 1 reply; 23+ messages in thread
From: Chen Yu @ 2022-06-11  3:03 UTC (permalink / raw)
  To: Yicong Yang
  Cc: Tim Chen, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	gautham.shenoy, linux-kernel, linux-arm-kernel, dietmar.eggemann,
	rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego,
	srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary,
	john.garry, shenyang39

Hi Yicong,
On Fri, Jun 10, 2022 at 02:39:38PM +0800, Yicong Yang wrote:
> On 2022/6/10 6:47, Tim Chen wrote:
> > On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote:
> >> From: Barry Song <song.bao.hua@hisilicon.com>
> >>
> >> For platforms having clusters like Kunpeng920, CPUs within the same cluster
> >> have lower latency when synchronizing and accessing shared resources like
> >> cache. Thus, this patch tries to find an idle cpu within the cluster of the
> >> target CPU before scanning the whole LLC to gain lower latency.
> >>
> >> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
> >> doesn't consider SMT for this moment.
> >>
> >> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
> >> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
> >>
> >> With this patch, We noticed enhancement on tbench within one numa or cross
> >> two numa.
> >>
> >> On numa 0:
> >>                             5.19-rc1                patched
> >> Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
> >> Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
> >> Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
> >> Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
> >> Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
> >> Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
> >> Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
> >> Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*
> >>
> >> On numa 0-1:
> >>                             5.19-rc1                patched
> >> Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
> >> Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
> >> Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
> >> Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
> >> Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
> >> Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
> >> Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
> >> Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
> >> Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*
> > 
> > Yicong,
> > 
> > Have you tried any workload where tasks don't share data
> > with each other but have sleep/wakeup?  That's the case
> > where we actually want to spread the tasks out among the clusters
> > to void contention for L2 cache.
> > 
> > Will be nice to make sure there's no regression there for
> > such workload.
> > 
> 
> Any certain workload you'd like me test? I'm willing to do :)
> 
> I've tested this patch with MySQL as well (like in v2). This won't hurt
> the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL
> posted in [1]. We leverage the nr to suppress redundant scanning in the
> current approach and seems SIS_UTIL is more efficient in this case.
> 
> 			 5.19-rc1		   patched	 patched+SIS_UTIL[1]
> TPS-16threads		  6215.11	  6172.74 (-0.68%)	  6217.33 (0.04%)
> QPS-16threads		124302.21	123454.68 (-0.68%)	124346.52 (0.04%)
> avg-lat-16threads	     2.57	     2.59 (-0.65%)	     2.57 (0.00%)
> TPS-24threads		  8726.40	  8690.87 (-0.41%)	  8833.08 (1.22%)
> QPS-24threads		174527.88	173817.42 (-0.41%)	176661.54 (1.21%)
> avg-lat-24threads	     2.75	     2.76 (-0.36%)	     2.71 (1.33%)
> TPS-32threads		  9555.42	  9514.86 (-0.42%)	 10010.87 (4.77%)
> QPS-32threads		191108.37	190297.28 (-0.42%)	200217.35 (4.55%)
> avg-lat-32threads	     3.35	     3.36 (-0.30%)	     3.20 (4.58%)
> TPS-64threads		 10290.10	 10324.75 (0.34%)	 10819.77 (5.15%)
> QPS-64threads		205802.05	206494.95 (0.34%)	216395.40 (4.90%)
> avg-lat-64threads	     6.22	     6.20 (0.38%)	     5.92 (4.88%)
> 
>
Thanks for the testings. I'm refining the patch according to Mel's suggestion and
will send it out later. Some minor nits below. 
> > Code itself looks good.
> > 
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > 
> 
> Thanks.
> 
> [1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/
> 
> >>
> >> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> >> ---
> >>  kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> >>  1 file changed, 41 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 77b2048a9326..6d173e196ad3 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> >>  
> >>  #endif /* CONFIG_SCHED_SMT */
> >>  
> >> +#ifdef CONFIG_SCHED_CLUSTER
> >> +/*
> >> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> >> + */
> >> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
> >> +			       int target, int *nr)
> >> +{
> >> +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> >> +	int cpu, idle_cpu;
> >> +
> >> +	/* TODO: Support SMT system with cluster topology */
> >> +	if (!sched_smt_active() && sd) {
> >> +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> >> +			if (!--*nr)
> >> +				break;
Maybe I understand it incorrectly, here we changed the value of nr. Then if
we did not find any idle CPU in the cluster, nr is set to 0. Then in the LLC scan,
since there is 'if (--nr <= 0)', the LLC scan terminates.
Is the policy like: if we can not find one idle CPU in the cluster, we give
up scan for one in the LLC?
[cut]
> >> +
> >>  /*
> >>   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
If we add logic for cluster scan, might need to update the comment as well.
> >>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> >> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>  		time = cpu_clock(this);
> >>  	}
> >>  
> >> +	idle_cpu = scan_cluster(p, cpus, target, &nr);
> >> +	if ((unsigned int)idle_cpu < nr_cpumask_bits)
> >> +		return idle_cpu;
> >> +
> >>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
> >>  		if (has_idle_core) {
> >>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> >> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>  				return i;
> >>  
> >>  		} else {
> >> -			if (!--nr)
> >> +			if (--nr <= 0)
May I know the reason to change this? I thought this could be a seperate patch.

thanks,
Chenyu
> >>  				return -1;
> >>  			idle_cpu = __select_idle_cpu(cpu, p);
> >>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> >> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >>  	/*
> >>  	 * If the previous CPU is cache affine and idle, don't be stupid:
> >>  	 */
> >> -	if (prev != target && cpus_share_cache(prev, target) &&
> >> +	if (prev != target && cpus_share_resources(prev, target) &&
> >>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> >>  	    asym_fits_capacity(task_util, prev))
> >>  		return prev;
> >> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >>  	p->recent_used_cpu = prev;
> >>  	if (recent_used_cpu != prev &&
> >>  	    recent_used_cpu != target &&
> >> -	    cpus_share_cache(recent_used_cpu, target) &&
> >> +	    cpus_share_resources(recent_used_cpu, target) &&
> >>  	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> >>  	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> >>  	    asym_fits_capacity(task_util, recent_used_cpu)) {
> > 
> > 
> > .
> > 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-11  3:03       ` Chen Yu
@ 2022-06-11  7:40         ` Yicong Yang
  2022-06-11  9:04           ` Chen Yu
  0 siblings, 1 reply; 23+ messages in thread
From: Yicong Yang @ 2022-06-11  7:40 UTC (permalink / raw)
  To: Chen Yu
  Cc: yangyicong, Tim Chen, peterz, mingo, juri.lelli, vincent.guittot,
	gautham.shenoy, linux-kernel, linux-arm-kernel, dietmar.eggemann,
	rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego,
	srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary,
	john.garry, shenyang39

On 2022/6/11 11:03, Chen Yu wrote:
> Hi Yicong,
> On Fri, Jun 10, 2022 at 02:39:38PM +0800, Yicong Yang wrote:
>> On 2022/6/10 6:47, Tim Chen wrote:
>>> On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote:
>>>> From: Barry Song <song.bao.hua@hisilicon.com>
>>>>
>>>> For platforms having clusters like Kunpeng920, CPUs within the same cluster
>>>> have lower latency when synchronizing and accessing shared resources like
>>>> cache. Thus, this patch tries to find an idle cpu within the cluster of the
>>>> target CPU before scanning the whole LLC to gain lower latency.
>>>>
>>>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
>>>> doesn't consider SMT for this moment.
>>>>
>>>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
>>>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
>>>>
>>>> With this patch, We noticed enhancement on tbench within one numa or cross
>>>> two numa.
>>>>
>>>> On numa 0:
>>>>                             5.19-rc1                patched
>>>> Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
>>>> Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
>>>> Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
>>>> Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
>>>> Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
>>>> Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
>>>> Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
>>>> Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*
>>>>
>>>> On numa 0-1:
>>>>                             5.19-rc1                patched
>>>> Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
>>>> Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
>>>> Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
>>>> Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
>>>> Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
>>>> Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
>>>> Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
>>>> Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
>>>> Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*
>>>
>>> Yicong,
>>>
>>> Have you tried any workload where tasks don't share data
>>> with each other but have sleep/wakeup?  That's the case
>>> where we actually want to spread the tasks out among the clusters
>>> to void contention for L2 cache.
>>>
>>> Will be nice to make sure there's no regression there for
>>> such workload.
>>>
>>
>> Any certain workload you'd like me test? I'm willing to do :)
>>
>> I've tested this patch with MySQL as well (like in v2). This won't hurt
>> the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL
>> posted in [1]. We leverage the nr to suppress redundant scanning in the
>> current approach and seems SIS_UTIL is more efficient in this case.
>>
>> 			 5.19-rc1		   patched	 patched+SIS_UTIL[1]
>> TPS-16threads		  6215.11	  6172.74 (-0.68%)	  6217.33 (0.04%)
>> QPS-16threads		124302.21	123454.68 (-0.68%)	124346.52 (0.04%)
>> avg-lat-16threads	     2.57	     2.59 (-0.65%)	     2.57 (0.00%)
>> TPS-24threads		  8726.40	  8690.87 (-0.41%)	  8833.08 (1.22%)
>> QPS-24threads		174527.88	173817.42 (-0.41%)	176661.54 (1.21%)
>> avg-lat-24threads	     2.75	     2.76 (-0.36%)	     2.71 (1.33%)
>> TPS-32threads		  9555.42	  9514.86 (-0.42%)	 10010.87 (4.77%)
>> QPS-32threads		191108.37	190297.28 (-0.42%)	200217.35 (4.55%)
>> avg-lat-32threads	     3.35	     3.36 (-0.30%)	     3.20 (4.58%)
>> TPS-64threads		 10290.10	 10324.75 (0.34%)	 10819.77 (5.15%)
>> QPS-64threads		205802.05	206494.95 (0.34%)	216395.40 (4.90%)
>> avg-lat-64threads	     6.22	     6.20 (0.38%)	     5.92 (4.88%)
>>
>>
> Thanks for the testings. I'm refining the patch according to Mel's suggestion and
> will send it out later. Some minor nits below. 
>>> Code itself looks good.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>
>>
>> Thanks.
>>
>> [1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/
>>
>>>>
>>>> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
>>>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>>>> ---
>>>>  kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>>>  1 file changed, 41 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 77b2048a9326..6d173e196ad3 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>>>>  
>>>>  #endif /* CONFIG_SCHED_SMT */
>>>>  
>>>> +#ifdef CONFIG_SCHED_CLUSTER
>>>> +/*
>>>> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
>>>> + */
>>>> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
>>>> +			       int target, int *nr)
>>>> +{
>>>> +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
>>>> +	int cpu, idle_cpu;
>>>> +
>>>> +	/* TODO: Support SMT system with cluster topology */
>>>> +	if (!sched_smt_active() && sd) {
>>>> +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
>>>> +			if (!--*nr)
>>>> +				break;
> Maybe I understand it incorrectly, here we changed the value of nr. Then if
> we did not find any idle CPU in the cluster, nr is set to 0. Then in the LLC scan,
> since there is 'if (--nr <= 0)', the LLC scan terminates.
> Is the policy like: if we can not find one idle CPU in the cluster, we give
> up scan for one in the LLC?
> [cut]

We're still scanning in the LLC since cluster is within the LLC (on Kunpeng 920 cluster cpus
share L3 Tag and on Jacobsville cluster cpus share L2, while on both platform LLC should be
the L3 cache). So if we've run out of nr we should stop scanning the rest of LLC as well
and ...

>>>> +
>>>>  /*
>>>>   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> If we add logic for cluster scan, might need to update the comment as well.

...the comment here still stands.

>>>>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
>>>> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>  		time = cpu_clock(this);
>>>>  	}
>>>>  
>>>> +	idle_cpu = scan_cluster(p, cpus, target, &nr);
>>>> +	if ((unsigned int)idle_cpu < nr_cpumask_bits)
>>>> +		return idle_cpu;
>>>> +
>>>>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>>>>  		if (has_idle_core) {
>>>>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>>> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>>  				return i;
>>>>  
>>>>  		} else {
>>>> -			if (!--nr)
>>>> +			if (--nr <= 0)
> May I know the reason to change this? I thought this could be a seperate patch.
> 

scan_cluster() may run out of nr (on Kunpeng 920 there're 4 cpus in one cluster and nr will be at least 4 with
SIS_PROP) and we cannot stop scanning correctly without this change. So the change should be in this patch.

Thanks,
Yicong

> thanks,
> Chenyu
>>>>  				return -1;
>>>>  			idle_cpu = __select_idle_cpu(cpu, p);
>>>>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>>>> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>>>  	/*
>>>>  	 * If the previous CPU is cache affine and idle, don't be stupid:
>>>>  	 */
>>>> -	if (prev != target && cpus_share_cache(prev, target) &&
>>>> +	if (prev != target && cpus_share_resources(prev, target) &&
>>>>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>>>>  	    asym_fits_capacity(task_util, prev))
>>>>  		return prev;
>>>> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>>>  	p->recent_used_cpu = prev;
>>>>  	if (recent_used_cpu != prev &&
>>>>  	    recent_used_cpu != target &&
>>>> -	    cpus_share_cache(recent_used_cpu, target) &&
>>>> +	    cpus_share_resources(recent_used_cpu, target) &&
>>>>  	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
>>>>  	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
>>>>  	    asym_fits_capacity(task_util, recent_used_cpu)) {
>>>
>>>
>>> .
>>>
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-11  7:40         ` Yicong Yang
@ 2022-06-11  9:04           ` Chen Yu
  0 siblings, 0 replies; 23+ messages in thread
From: Chen Yu @ 2022-06-11  9:04 UTC (permalink / raw)
  To: Yicong Yang
  Cc: yangyicong, Tim Chen, peterz, mingo, juri.lelli, vincent.guittot,
	gautham.shenoy, linux-kernel, linux-arm-kernel, dietmar.eggemann,
	rostedt, bsegall, bristot, prime.zeng, jonathan.cameron, ego,
	srikar, linuxarm, 21cnbao, guodong.xu, hesham.almatary,
	john.garry, shenyang39

On Sat, Jun 11, 2022 at 03:40:23PM +0800, Yicong Yang wrote:
> On 2022/6/11 11:03, Chen Yu wrote:
> > Hi Yicong,
> > On Fri, Jun 10, 2022 at 02:39:38PM +0800, Yicong Yang wrote:
> >> On 2022/6/10 6:47, Tim Chen wrote:
> >>> On Thu, 2022-06-09 at 20:06 +0800, Yicong Yang wrote:
> >>>> From: Barry Song <song.bao.hua@hisilicon.com>
> >>>>
> >>>> For platforms having clusters like Kunpeng920, CPUs within the same cluster
> >>>> have lower latency when synchronizing and accessing shared resources like
> >>>> cache. Thus, this patch tries to find an idle cpu within the cluster of the
> >>>> target CPU before scanning the whole LLC to gain lower latency.
> >>>>
> >>>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
> >>>> doesn't consider SMT for this moment.
> >>>>
> >>>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
> >>>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
> >>>>
> >>>> With this patch, We noticed enhancement on tbench within one numa or cross
> >>>> two numa.
> >>>>
> >>>> On numa 0:
> >>>>                             5.19-rc1                patched
> >>>> Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
> >>>> Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
> >>>> Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
> >>>> Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
> >>>> Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
> >>>> Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
> >>>> Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
> >>>> Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*
> >>>>
> >>>> On numa 0-1:
> >>>>                             5.19-rc1                patched
> >>>> Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
> >>>> Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
> >>>> Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
> >>>> Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
> >>>> Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
> >>>> Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
> >>>> Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
> >>>> Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
> >>>> Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*
> >>>
> >>> Yicong,
> >>>
> >>> Have you tried any workload where tasks don't share data
> >>> with each other but have sleep/wakeup?  That's the case
> >>> where we actually want to spread the tasks out among the clusters
> >>> to void contention for L2 cache.
> >>>
> >>> Will be nice to make sure there's no regression there for
> >>> such workload.
> >>>
> >>
> >> Any certain workload you'd like me test? I'm willing to do :)
> >>
> >> I've tested this patch with MySQL as well (like in v2). This won't hurt
> >> the MySQL case with SIS_PROP but observed some improvement with SIS_UTIL
> >> posted in [1]. We leverage the nr to suppress redundant scanning in the
> >> current approach and seems SIS_UTIL is more efficient in this case.
> >>
> >> 			 5.19-rc1		   patched	 patched+SIS_UTIL[1]
> >> TPS-16threads		  6215.11	  6172.74 (-0.68%)	  6217.33 (0.04%)
> >> QPS-16threads		124302.21	123454.68 (-0.68%)	124346.52 (0.04%)
> >> avg-lat-16threads	     2.57	     2.59 (-0.65%)	     2.57 (0.00%)
> >> TPS-24threads		  8726.40	  8690.87 (-0.41%)	  8833.08 (1.22%)
> >> QPS-24threads		174527.88	173817.42 (-0.41%)	176661.54 (1.21%)
> >> avg-lat-24threads	     2.75	     2.76 (-0.36%)	     2.71 (1.33%)
> >> TPS-32threads		  9555.42	  9514.86 (-0.42%)	 10010.87 (4.77%)
> >> QPS-32threads		191108.37	190297.28 (-0.42%)	200217.35 (4.55%)
> >> avg-lat-32threads	     3.35	     3.36 (-0.30%)	     3.20 (4.58%)
> >> TPS-64threads		 10290.10	 10324.75 (0.34%)	 10819.77 (5.15%)
> >> QPS-64threads		205802.05	206494.95 (0.34%)	216395.40 (4.90%)
> >> avg-lat-64threads	     6.22	     6.20 (0.38%)	     5.92 (4.88%)
> >>
> >>
> > Thanks for the testings. I'm refining the patch according to Mel's suggestion and
> > will send it out later. Some minor nits below. 
> >>> Code itself looks good.
> >>>
> >>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> >>>
> >>
> >> Thanks.
> >>
> >> [1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/
> >>
> >>>>
> >>>> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> >>>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> >>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> >>>> ---
> >>>>  kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> >>>>  1 file changed, 41 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 77b2048a9326..6d173e196ad3 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> >>>>  
> >>>>  #endif /* CONFIG_SCHED_SMT */
> >>>>  
> >>>> +#ifdef CONFIG_SCHED_CLUSTER
> >>>> +/*
> >>>> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> >>>> + */
> >>>> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
> >>>> +			       int target, int *nr)
> >>>> +{
> >>>> +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> >>>> +	int cpu, idle_cpu;
> >>>> +
> >>>> +	/* TODO: Support SMT system with cluster topology */
> >>>> +	if (!sched_smt_active() && sd) {
> >>>> +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> >>>> +			if (!--*nr)
> >>>> +				break;
> > Maybe I understand it incorrectly, here we changed the value of nr. Then if
> > we did not find any idle CPU in the cluster, nr is set to 0. Then in the LLC scan,
> > since there is 'if (--nr <= 0)', the LLC scan terminates.
> > Is the policy like: if we can not find one idle CPU in the cluster, we give
> > up scan for one in the LLC?
> > [cut]
> 
> We're still scanning in the LLC since cluster is within the LLC (on Kunpeng 920 cluster cpus
> share L3 Tag and on Jacobsville cluster cpus share L2, while on both platform LLC should be
> the L3 cache). So if we've run out of nr we should stop scanning the rest of LLC as well
> and ...
>
I see, the budget has been exhausted and we don't spend more time on LLC. 
> >>>> +
> >>>>  /*
> >>>>   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> > If we add logic for cluster scan, might need to update the comment as well.
> 
> ...the comment here still stands.
> 
> >>>>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> >>>> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>>>  		time = cpu_clock(this);
> >>>>  	}
> >>>>  
> >>>> +	idle_cpu = scan_cluster(p, cpus, target, &nr);
> >>>> +	if ((unsigned int)idle_cpu < nr_cpumask_bits)
> >>>> +		return idle_cpu;
> >>>> +
> >>>>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
> >>>>  		if (has_idle_core) {
> >>>>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> >>>> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>>>  				return i;
> >>>>  
> >>>>  		} else {
> >>>> -			if (!--nr)
> >>>> +			if (--nr <= 0)
> > May I know the reason to change this? I thought this could be a seperate patch.
> > 
> 
> scan_cluster() may run out of nr (on Kunpeng 920 there're 4 cpus in one cluster and nr will be at least 4 with
> SIS_PROP) and we cannot stop scanning correctly without this change. So the change should be in this patch.
> 
So when nr reaches 0, the original loop could not stop. Thanks for the explanation.

thanks,
Chenyu
> Thanks,
> Yicong
> 
> > thanks,
> > Chenyu
> >>>>  				return -1;
> >>>>  			idle_cpu = __select_idle_cpu(cpu, p);
> >>>>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> >>>> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >>>>  	/*
> >>>>  	 * If the previous CPU is cache affine and idle, don't be stupid:
> >>>>  	 */
> >>>> -	if (prev != target && cpus_share_cache(prev, target) &&
> >>>> +	if (prev != target && cpus_share_resources(prev, target) &&
> >>>>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> >>>>  	    asym_fits_capacity(task_util, prev))
> >>>>  		return prev;
> >>>> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >>>>  	p->recent_used_cpu = prev;
> >>>>  	if (recent_used_cpu != prev &&
> >>>>  	    recent_used_cpu != target &&
> >>>> -	    cpus_share_cache(recent_used_cpu, target) &&
> >>>> +	    cpus_share_resources(recent_used_cpu, target) &&
> >>>>  	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> >>>>  	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> >>>>  	    asym_fits_capacity(task_util, recent_used_cpu)) {
> >>>
> >>>
> >>> .
> >>>
> > .
> > 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  2022-06-09 22:47   ` Tim Chen
@ 2022-06-26 12:13   ` Abel Wu
  2022-06-27  8:16     ` Yicong Yang
  1 sibling, 1 reply; 23+ messages in thread
From: Abel Wu @ 2022-06-26 12:13 UTC (permalink / raw)
  To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39


On 6/9/22 8:06 PM, Yicong Yang Wrote:
> From: Barry Song <song.bao.hua@hisilicon.com>
> 
> For platforms having clusters like Kunpeng920, CPUs within the same cluster
> have lower latency when synchronizing and accessing shared resources like
> cache. Thus, this patch tries to find an idle cpu within the cluster of the
> target CPU before scanning the whole LLC to gain lower latency.
> 
> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
> doesn't consider SMT for this moment.
> 
> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
> 
> With this patch, We noticed enhancement on tbench within one numa or cross
> two numa.
> 
> On numa 0:
>                              5.19-rc1                patched
> Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
> Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
> Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
> Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
> Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
> Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
> Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
> Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*
> 
> On numa 0-1:
>                              5.19-rc1                patched
> Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
> Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
> Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
> Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
> Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
> Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
> Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
> Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
> Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*
> 
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> ---
>   kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 77b2048a9326..6d173e196ad3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>   
>   #endif /* CONFIG_SCHED_SMT */
>   
> +#ifdef CONFIG_SCHED_CLUSTER
> +/*
> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> + */
> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
> +			       int target, int *nr)
> +{
> +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> +	int cpu, idle_cpu;
> +
> +	/* TODO: Support SMT system with cluster topology */
> +	if (!sched_smt_active() && sd) {
> +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> +			if (!--*nr)
> +				break;

return -1;
:)

> +
> +			idle_cpu = __select_idle_cpu(cpu, p);
> +			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +				return idle_cpu;
> +		}
> +
> +		cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> +	}
> +
> +	return -1;
> +}
> +#else
> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
> +			       int target, int *nr)
> +{
> +	return -1;
> +}
> +#endif
> +
>   /*
>    * Scan the LLC domain for idle CPUs; this is dynamically regulated by
>    * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> @@ -6375,6 +6409,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>   		time = cpu_clock(this);
>   	}
>   
> +	idle_cpu = scan_cluster(p, cpus, target, &nr);
> +	if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +		return idle_cpu;
> +
>   	for_each_cpu_wrap(cpu, cpus, target + 1) {
>   		if (has_idle_core) {
>   			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -6382,7 +6420,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>   				return i;
>   
>   		} else {
> -			if (!--nr)
> +			if (--nr <= 0)
>   				return -1;
>   			idle_cpu = __select_idle_cpu(cpu, p);
>   			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> @@ -6481,7 +6519,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   	/*
>   	 * If the previous CPU is cache affine and idle, don't be stupid:
>   	 */
> -	if (prev != target && cpus_share_cache(prev, target) &&
> +	if (prev != target && cpus_share_resources(prev, target) &&
>   	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>   	    asym_fits_capacity(task_util, prev))
>   		return prev;
> @@ -6507,7 +6545,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   	p->recent_used_cpu = prev;
>   	if (recent_used_cpu != prev &&
>   	    recent_used_cpu != target &&
> -	    cpus_share_cache(recent_used_cpu, target) &&
> +	    cpus_share_resources(recent_used_cpu, target) &&
>   	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
>   	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
>   	    asym_fits_capacity(task_util, recent_used_cpu)) {

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-26 12:13   ` Abel Wu
@ 2022-06-27  8:16     ` Yicong Yang
  0 siblings, 0 replies; 23+ messages in thread
From: Yicong Yang @ 2022-06-27  8:16 UTC (permalink / raw)
  To: Abel Wu, Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39

On 2022/6/26 20:13, Abel Wu wrote:
> 
> On 6/9/22 8:06 PM, Yicong Yang Wrote:
>> From: Barry Song <song.bao.hua@hisilicon.com>
>>
>> For platforms having clusters like Kunpeng920, CPUs within the same cluster
>> have lower latency when synchronizing and accessing shared resources like
>> cache. Thus, this patch tries to find an idle cpu within the cluster of the
>> target CPU before scanning the whole LLC to gain lower latency.
>>
>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
>> doesn't consider SMT for this moment.
>>
>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
>>
>> With this patch, We noticed enhancement on tbench within one numa or cross
>> two numa.
>>
>> On numa 0:
>>                              5.19-rc1                patched
>> Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
>> Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
>> Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
>> Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
>> Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
>> Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
>> Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
>> Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*
>>
>> On numa 0-1:
>>                              5.19-rc1                patched
>> Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
>> Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
>> Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
>> Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
>> Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
>> Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
>> Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
>> Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
>> Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*
>>
>> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>> ---
>>   kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 41 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 77b2048a9326..6d173e196ad3 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6327,6 +6327,40 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>>     #endif /* CONFIG_SCHED_SMT */
>>   +#ifdef CONFIG_SCHED_CLUSTER
>> +/*
>> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
>> + */
>> +static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
>> +                   int target, int *nr)
>> +{
>> +    struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
>> +    int cpu, idle_cpu;
>> +
>> +    /* TODO: Support SMT system with cluster topology */
>> +    if (!sched_smt_active() && sd) {
>> +        for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
>> +            if (!--*nr)
>> +                break;
> 
> return -1;
> :)
> 

thanks for the comment. If we've run out of nr the select_idle_cpu() will stop scanning as well,
so it's unnecessary to clear the mask of cluster cpus and just return.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-06-29  3:24 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-09 12:06 [PATCH v4 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
2022-06-09 12:06 ` [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
2022-06-09 22:28   ` Tim Chen
2022-06-10  6:54     ` Yicong Yang
2022-06-15 14:19   ` K Prateek Nayak
2022-06-15 15:43     ` Gautham R. Shenoy
2022-06-16  8:10       ` Yicong Yang
2022-06-16  7:55     ` Yicong Yang
     [not found]       ` <6bf4f032-7d07-d4a4-4f5a-28f3871131c0@amd.com>
2022-06-17 16:50         ` Tim Chen
2022-06-20 11:20           ` K Prateek Nayak
2022-06-20 13:37             ` Abel Wu
2022-06-28 11:55               ` K Prateek Nayak
2022-06-29  3:22                 ` Yicong Yang
2022-06-09 12:06 ` [PATCH v4 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-06-09 22:47   ` Tim Chen
2022-06-10  4:01     ` Barry Song
2022-06-10  6:39     ` Yicong Yang
2022-06-10 21:19       ` Tim Chen
2022-06-11  3:03       ` Chen Yu
2022-06-11  7:40         ` Yicong Yang
2022-06-11  9:04           ` Chen Yu
2022-06-26 12:13   ` Abel Wu
2022-06-27  8:16     ` Yicong Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).