From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9E65C433E0 for ; Wed, 6 Jan 2021 20:12:04 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8790623131 for ; Wed, 6 Jan 2021 20:12:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8790623131 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hisilicon.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:MIME-Version:In-Reply-To:References:Message-ID:Date: Subject:To:From:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=VOiVbFC3y+28oOSTd5k9BosEis4enaEPDTWXc/T4Smc=; b=afI+LtxubesmFGIJCBrTYYhQ2 f8m0BkciVHqJaKniRzqaTMmpUKKxbJd8BhqRvY9U9QklFywOBCy3YCT4m8we14xOySW+weNtLc3xC RgibOFvNW5gQYfqoug03ir03LWc4t1ef/Kdn4YUCtA+Wj1MS8auOCk95PwDXOFXXwoqkHOzIf5zgM 2JEH0GbAT0HIM97xFtUbMII3ovO6ilHpae2EhxUM9Bp2NMZZt7QSMUP1FgfRpEt/KqwM2Jpi/1n65 E2baZR/RMxZkS8uHUic0DZLAKLtzqe61SPpfKZgI3MhkKnI+OWRu9cDwxDumGgsyFSZxqBPOXz1tX oQaZstAeg==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kxF7X-0003Dq-2c; Wed, 06 Jan 2021 20:09:27 +0000 Received: from frasgout.his.huawei.com ([185.176.79.56]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1kxF7T-0003Cx-9P for linux-arm-kernel@lists.infradead.org; Wed, 06 Jan 2021 20:09:25 +0000 Received: from fraeml707-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DB0jC26mdz67XTg; Thu, 7 Jan 2021 04:05:35 +0800 (CST) Received: from lhreml713-chm.china.huawei.com (10.201.108.64) by fraeml707-chm.china.huawei.com (10.206.15.35) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Wed, 6 Jan 2021 21:09:13 +0100 Received: from dggemi761-chm.china.huawei.com (10.1.198.147) by lhreml713-chm.china.huawei.com (10.201.108.64) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.2106.2; Wed, 6 Jan 2021 20:09:12 +0000 Received: from dggemi761-chm.china.huawei.com ([10.9.49.202]) by dggemi761-chm.china.huawei.com ([10.9.49.202]) with mapi id 15.01.2106.002; Thu, 7 Jan 2021 04:09:10 +0800 From: "Song Bao Hua (Barry Song)" To: Vincent Guittot Subject: RE: [RFC PATCH v3 2/2] scheduler: add scheduler level for clusters Thread-Topic: [RFC PATCH v3 2/2] scheduler: add scheduler level for clusters Thread-Index: AQHW5Abz29pi23P1F0ub1GZGwq/cb6oaQ/4AgAC/iYA= Date: Wed, 6 Jan 2021 20:09:10 +0000 Message-ID: <2a9ba02aa5c243eebc91de492e3aa27f@hisilicon.com> References: <20210106083026.40444-1-song.bao.hua@hisilicon.com> <20210106083026.40444-3-song.bao.hua@hisilicon.com> In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.126.201.29] MIME-Version: 1.0 X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210106_150923_944196_05951C22 X-CRM114-Status: GOOD ( 40.99 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Juri Lelli , Mark Rutland , Peter Zijlstra , Catalin Marinas , Ben Segall , "xuwei \(O\)" , Will Deacon , Aubrey Li , ACPI Devel Maling List , Ingo Molnar , Mel Gorman , Valentin Schneider , "Cc: Len Brown" , "linuxarm@openeuler.org" , Steven Rostedt , "Zengtao \(B\)" , Jonathan Cameron , Dietmar Eggemann , LAK , "gregkh@linuxfoundation.org" , "Rafael J. Wysocki" , linux-kernel , Sudeep Holla , "tiantao \(H\)" Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org > -----Original Message----- > From: Vincent Guittot [mailto:vincent.guittot@linaro.org] > Sent: Thursday, January 7, 2021 5:29 AM > To: Song Bao Hua (Barry Song) > Cc: Valentin Schneider ; Catalin Marinas > ; Will Deacon ; Rafael J. Wysocki > ; Cc: Len Brown ; > gregkh@linuxfoundation.org; Jonathan Cameron ; > Ingo Molnar ; Peter Zijlstra ; Juri > Lelli ; Dietmar Eggemann ; > Steven Rostedt ; Ben Segall ; Mel > Gorman ; Mark Rutland ; Sudeep Holla > ; Aubrey Li ; LAK > ; linux-kernel > ; ACPI Devel Maling List > ; linuxarm@openeuler.org; xuwei (O) > ; Zengtao (B) ; tiantao (H) > > Subject: Re: [RFC PATCH v3 2/2] scheduler: add scheduler level for clusters > > On Wed, 6 Jan 2021 at 09:35, Barry Song wrote: > > > > ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each > > cluster has 4 cpus. All clusters share L3 cache data, but each cluster > > has local L3 tag. On the other hand, each clusters will share some > > internal system bus. This means cache coherence overhead inside one cluster > > is much less than the overhead across clusters. > > > > This patch adds the sched_domain for clusters. On kunpeng 920, without > > this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this > > patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3. > > > > This will affect load balance. For example, without this patch, while cpu0 > > becomes idle, it will pull a task from cpu1-cpu15. With this patch, cpu0 > > will try to pull a task from cpu1-cpu3 first. This will have much less > > overhead of task migration. > > > > On the other hand, while doing WAKE_AFFINE, this patch will try to find > > a core in the target cluster before scanning the whole llc domain. > > This means it will proactively use a core which has better affinity with > > target core at first. > > > > Though it is named "cluster", architectures or machines can define its > > exact meaning of cluster as long as some cpus can share some resources > > in lower level than llc. So the implementation is applicable to all > > architectures. > > > > Different cpus might have different resource sharing like L1, L2, cache > > tags, internal busses etc. > > Since it is hard to know where we should start to scan, this patch adds > > a SD_SHARE_CLS_RESOURCES rather than directly leveraging the existing > > Not sure that we need this new flag. See more about this below > > You should have a look at https://lkml.org/lkml/2020/12/14/560 which > rework select_idle_core/cpu/smt > > > SD_SHARE_PKG_RESOURCES flag. Architectures or machines can decide what > > is cluster and who should get SD_SHARE_CLS_RESOURCES. select_idle_cpu() > > will scan from the first sched_domain with SD_SHARE_CLS_RESOURCES. > > > > The below is a hackbench result: > > > > we run the below command with different -g parameter from 1 to 10, for each > > different g, we run the command 10 times and get the average time > > $ numactl -N 0 hackbench -p -T -l 20000 -g $1 > > > > hackbench will report the time which is needed to complete a certain number > > of messages transmissions between a certain number of tasks, for example: > > $ numactl -N 0 hackbench -p -T -l 20000 -g 10 > > Running in threaded mode with 10 groups using 40 file descriptors each > > (== 400 tasks) > > Each sender will pass 20000 messages of 100 bytes > > Time: 8.874 > > > > The below is the result of hackbench w/ and w/o the patch: > > g 1 2 3 4 5 6 7 8 9 10 > > w/o 1.4777 2.0112 3.1919 4.2654 5.3246 6.4019 7.5939 8.7073 9.7526 10.8987 > > w/ 1.4793 1.9344 2.9080 3.9267 4.8339 5.7186 6.6923 7.5088 8.3715 9.2173 > > +8.9% +7.9% +9.3% +10.7% +11.8% +13.8% +14.2% +15.5% > > > > Tracing the kernel while g=10, it shows select_idle_cpu() has a large chance > > to get cpu in the same cluster with the target while it sometimes gets cpu > > outside the cluster: > > target cpu > > 19 -> 17 > > 13 -> 15 > > 23 -> 20 > > 23 -> 20 > > 19 -> 17 > > 13 -> 15 > > 16 -> 17 > > 19 -> 17 > > 7 -> 5 > > 10 -> 11 > > 23 -> 20 > > *23 -> 4 > > ... > > > > Signed-off-by: Barry Song > > --- > > -v3: > > - rebased againest 5.11-rc2 > > - with respect to the comments of Valentin Schneider, Peter Zijlstra, > > Vincent Guittot and Mel Gorman etc. > > * moved the scheduler changes from arm64 to the common place for all > > architectures. > > * added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain > > where select_idle_cpu() should begin to scan from > > * removed redundant select_idle_cluster() function since all code is > > in select_idle_cpu() now. it also avoided scanning cluster cpus > > twice in v2 code; > > * redo the hackbench in one numa after the above changes > > > > arch/arm64/Kconfig | 7 +++++++ > > include/linux/sched/sd_flags.h | 9 +++++++++ > > include/linux/sched/topology.h | 7 +++++++ > > include/linux/topology.h | 7 +++++++ > > kernel/sched/fair.c | 27 +++++++++++++++++++++------ > > kernel/sched/topology.c | 6 ++++++ > > 6 files changed, 57 insertions(+), 6 deletions(-) > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > index 05e1735..546cd61 100644 > > --- a/arch/arm64/Kconfig > > +++ b/arch/arm64/Kconfig > > @@ -973,6 +973,13 @@ config SCHED_MC > > making when dealing with multi-core CPU chips at a cost of slightly > > increased overhead in some places. If unsure say N here. > > > > +config SCHED_CLUSTER > > + bool "Cluster scheduler support" > > + help > > + Cluster scheduler support improves the CPU scheduler's decision > > + making when dealing with machines that have clusters(sharing internal > > + bus or sharing LLC cache tag). If unsure say N here. > > + > > config SCHED_SMT > > bool "SMT scheduler support" > > help > > diff --git a/include/linux/sched/sd_flags.h > b/include/linux/sched/sd_flags.h > > index 34b21e9..fc3c894 100644 > > --- a/include/linux/sched/sd_flags.h > > +++ b/include/linux/sched/sd_flags.h > > @@ -100,6 +100,15 @@ SD_FLAG(SD_ASYM_CPUCAPACITY, SDF_SHARED_PARENT | > SDF_NEEDS_GROUPS) > > SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS) > > > > /* > > + * Domain members share CPU cluster resources (i.e. llc cache tags) > > + * > > + * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer > share > > + * the cluster resouces (such as llc tags and internal bus) > > + * NEEDS_GROUPS: Caches are shared between groups. > > + */ > > +SD_FLAG(SD_SHARE_CLS_RESOURCES, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS) > > + > > +/* > > * Domain members share CPU package resources (i.e. caches) > > * > > * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer > share > > diff --git a/include/linux/sched/topology.h > b/include/linux/sched/topology.h > > index 8f0f778..846fcac 100644 > > --- a/include/linux/sched/topology.h > > +++ b/include/linux/sched/topology.h > > @@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void) > > } > > #endif > > > > +#ifdef CONFIG_SCHED_CLUSTER > > +static inline int cpu_cluster_flags(void) > > +{ > > + return SD_SHARE_CLS_RESOURCES | SD_SHARE_PKG_RESOURCES; > > +} > > +#endif > > + > > #ifdef CONFIG_SCHED_MC > > static inline int cpu_core_flags(void) > > { > > diff --git a/include/linux/topology.h b/include/linux/topology.h > > index bf2cc3c..81be614 100644 > > --- a/include/linux/topology.h > > +++ b/include/linux/topology.h > > @@ -211,6 +211,13 @@ static inline const struct cpumask *cpu_smt_mask(int > cpu) > > } > > #endif > > > > +#ifdef CONFIG_SCHED_CLUSTER > > +static inline const struct cpumask *cpu_cluster_mask(int cpu) > > +{ > > + return topology_cluster_cpumask(cpu); > > +} > > +#endif > > + > > static inline const struct cpumask *cpu_cpu_mask(int cpu) > > { > > return cpumask_of_node(cpu_to_node(cpu)); > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 04a3ce2..c14fae6 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6145,6 +6145,7 @@ static int select_idle_cpu(struct task_struct *p, struct > sched_domain *sd, int t > > { > > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); > > struct sched_domain *this_sd; > > + struct sched_domain *prev_ssd = NULL, *ssd; > > u64 avg_cost, avg_idle; > > u64 time; > > int this = smp_processor_id(); > > @@ -6174,15 +6175,29 @@ static int select_idle_cpu(struct task_struct *p, > struct sched_domain *sd, int t > > > > time = cpu_clock(this); > > > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > - > > - for_each_cpu_wrap(cpu, cpus, target) { > > - if (!--nr) > > - return -1; > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > > + /* > > + * we first scan those child domains who declare they are sharing > > + * cluster resources such as llc tags, internal busses; then scan > > + * the whole llc > > + */ > > + for_each_domain(target, ssd) { > > I don't like looping the sched domain in the fast path. Instead you > should follow similar policy as for smt the rework patchset mentioned > above: > In the select_idle_core(), you can add a static key for cluster case > and loop the cpu_cluster_mask(cpu), similarly to what is done with > cpu_smt_mask but for looking for one idle cpu cpu_cluster_mask() in > instead of all cpus in the case of smt Hi Vincent, Thanks! Probably I got the idea to iterate sched_domains from Valentin's comment: https://lore.kernel.org/lkml/jhj1rg9v7gr.mognet@arm.com/ just in case hardware has a hierarchy of clusters, it might want to scan smaller cluster, then bigger cluster: +-------------------+ | | | | | bigger cluster | | | ++-----------------++ | | | | | | +-----------+------+ +---+---------------+ | | | | | small cluster | | small cluster | | | | | | | | | +------------------+ +-------------------+ But you are right. Anyway, the current code supports one layer of cluster only. This makes neither SD_SHARE_CLS_RESOURCES nor looping sched_domains useful. > > > + if ((ssd->flags & SD_SHARE_CLS_RESOURCES) || (ssd == sd)) { > > + cpumask_and(cpus, sched_domain_span(ssd), p->cpus_ptr); > > + if (prev_ssd) > > + cpumask_andnot(cpus, cpus, > sched_domain_span(prev_ssd)); > > + for_each_cpu_wrap(cpu, cpus, target) { > > + if (!--nr) > > + return -1; > > + if (available_idle_cpu(cpu) || > sched_idle_cpu(cpu)) > > + goto done; > > + } > > + prev_ssd = ssd; > > + } > > + if (ssd == sd) > > break; > > } > > > > +done: > > time = cpu_clock(this) - time; > > update_avg(&this_sd->avg_scan_cost, time); > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index 5d3675c..79030c9 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -1361,6 +1361,7 @@ int __read_mostly node_reclaim_distance = > RECLAIM_DISTANCE; > > */ > > #define TOPOLOGY_SD_FLAGS \ > > (SD_SHARE_CPUCAPACITY | \ > > + SD_SHARE_CLS_RESOURCES | \ > > SD_SHARE_PKG_RESOURCES | \ > > SD_NUMA | \ > > SD_ASYM_PACKING) > > @@ -1480,6 +1481,11 @@ static struct sched_domain_topology_level > default_topology[] = { > > #ifdef CONFIG_SCHED_SMT > > { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > #endif > > + > > +#ifdef CONFIG_SCHED_CLUSTER > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > +#endif > > + > > #ifdef CONFIG_SCHED_MC > > { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, > > #endif > > -- > > 2.7.4 > > Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel