From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=c6Tt=GJ=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-14.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E9E65C433E0
	for <linux-arm-kernel@archiver.kernel.org>; Wed,  6 Jan 2021 20:12:04 +0000 (UTC)
Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 8790623131
	for <linux-arm-kernel@archiver.kernel.org>; Wed,  6 Jan 2021 20:12:04 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8790623131
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hisilicon.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding:
	Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive:
	List-Unsubscribe:List-Id:MIME-Version:In-Reply-To:References:Message-ID:Date:
	Subject:To:From:Reply-To:Content-ID:Content-Description:Resent-Date:
	Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	 bh=VOiVbFC3y+28oOSTd5k9BosEis4enaEPDTWXc/T4Smc=; b=afI+LtxubesmFGIJCBrTYYhQ2
	f8m0BkciVHqJaKniRzqaTMmpUKKxbJd8BhqRvY9U9QklFywOBCy3YCT4m8we14xOySW+weNtLc3xC
	RgibOFvNW5gQYfqoug03ir03LWc4t1ef/Kdn4YUCtA+Wj1MS8auOCk95PwDXOFXXwoqkHOzIf5zgM
	2JEH0GbAT0HIM97xFtUbMII3ovO6ilHpae2EhxUM9Bp2NMZZt7QSMUP1FgfRpEt/KqwM2Jpi/1n65
	E2baZR/RMxZkS8uHUic0DZLAKLtzqe61SPpfKZgI3MhkKnI+OWRu9cDwxDumGgsyFSZxqBPOXz1tX
	oQaZstAeg==;
Received: from localhost ([::1] helo=merlin.infradead.org)
	by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux))
	id 1kxF7X-0003Dq-2c; Wed, 06 Jan 2021 20:09:27 +0000
Received: from frasgout.his.huawei.com ([185.176.79.56])
 by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux))
 id 1kxF7T-0003Cx-9P
 for linux-arm-kernel@lists.infradead.org; Wed, 06 Jan 2021 20:09:25 +0000
Received: from fraeml707-chm.china.huawei.com (unknown [172.18.147.200])
 by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DB0jC26mdz67XTg;
 Thu,  7 Jan 2021 04:05:35 +0800 (CST)
Received: from lhreml713-chm.china.huawei.com (10.201.108.64) by
 fraeml707-chm.china.huawei.com (10.206.15.35) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2106.2; Wed, 6 Jan 2021 21:09:13 +0100
Received: from dggemi761-chm.china.huawei.com (10.1.198.147) by
 lhreml713-chm.china.huawei.com (10.201.108.64) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id
 15.1.2106.2; Wed, 6 Jan 2021 20:09:12 +0000
Received: from dggemi761-chm.china.huawei.com ([10.9.49.202]) by
 dggemi761-chm.china.huawei.com ([10.9.49.202]) with mapi id 15.01.2106.002;
 Thu, 7 Jan 2021 04:09:10 +0800
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Subject: RE: [RFC PATCH v3 2/2] scheduler: add scheduler level for clusters
Thread-Topic: [RFC PATCH v3 2/2] scheduler: add scheduler level for clusters
Thread-Index: AQHW5Abz29pi23P1F0ub1GZGwq/cb6oaQ/4AgAC/iYA=
Date: Wed, 6 Jan 2021 20:09:10 +0000
Message-ID: <2a9ba02aa5c243eebc91de492e3aa27f@hisilicon.com>
References: <20210106083026.40444-1-song.bao.hua@hisilicon.com>
 <20210106083026.40444-3-song.bao.hua@hisilicon.com>
 <CAKfTPtCUcAXEZBkj5kOdjNh4UGbsUNH2oOx3Ecv3AsdDmS04xQ@mail.gmail.com>
In-Reply-To: <CAKfTPtCUcAXEZBkj5kOdjNh4UGbsUNH2oOx3Ecv3AsdDmS04xQ@mail.gmail.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.126.201.29]
MIME-Version: 1.0
X-CFilter-Loop: Reflected
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20210106_150923_944196_05951C22 
X-CRM114-Status: GOOD (  40.99  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Cc: Juri Lelli <juri.lelli@redhat.com>, Mark Rutland <mark.rutland@arm.com>,
 Peter Zijlstra <peterz@infradead.org>,
 Catalin Marinas <catalin.marinas@arm.com>, Ben Segall <bsegall@google.com>,
 "xuwei \(O\)" <xuwei5@huawei.com>, Will Deacon <will@kernel.org>,
 Aubrey Li <aubrey.li@linux.intel.com>,
 ACPI Devel Maling List <linux-acpi@vger.kernel.org>,
 Ingo Molnar <mingo@redhat.com>, Mel Gorman <mgorman@suse.de>,
 Valentin Schneider <valentin.schneider@arm.com>,
 "Cc: Len Brown" <lenb@kernel.org>,
 "linuxarm@openeuler.org" <linuxarm@openeuler.org>,
 Steven Rostedt <rostedt@goodmis.org>,
 "Zengtao \(B\)" <prime.zeng@hisilicon.com>,
 Jonathan Cameron <jonathan.cameron@huawei.com>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
 LAK <linux-arm-kernel@lists.infradead.org>,
 "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
 "Rafael J. Wysocki" <rjw@rjwysocki.net>,
 linux-kernel <linux-kernel@vger.kernel.org>,
 Sudeep Holla <sudeep.holla@arm.com>, "tiantao \(H\)" <tiantao6@hisilicon.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org


> -----Original Message-----
> From: Vincent Guittot [mailto:vincent.guittot@linaro.org]
> Sent: Thursday, January 7, 2021 5:29 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Valentin Schneider <valentin.schneider@arm.com>; Catalin Marinas
> <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>; Rafael J. Wysocki
> <rjw@rjwysocki.net>; Cc: Len Brown <lenb@kernel.org>;
> gregkh@linuxfoundation.org; Jonathan Cameron <jonathan.cameron@huawei.com>;
> Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri
> Lelli <juri.lelli@redhat.com>; Dietmar Eggemann <dietmar.eggemann@arm.com>;
> Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel
> Gorman <mgorman@suse.de>; Mark Rutland <mark.rutland@arm.com>; Sudeep Holla
> <sudeep.holla@arm.com>; Aubrey Li <aubrey.li@linux.intel.com>; LAK
> <linux-arm-kernel@lists.infradead.org>; linux-kernel
> <linux-kernel@vger.kernel.org>; ACPI Devel Maling List
> <linux-acpi@vger.kernel.org>; linuxarm@openeuler.org; xuwei (O)
> <xuwei5@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; tiantao (H)
> <tiantao6@hisilicon.com>
> Subject: Re: [RFC PATCH v3 2/2] scheduler: add scheduler level for clusters
> 
> On Wed, 6 Jan 2021 at 09:35, Barry Song <song.bao.hua@hisilicon.com> wrote:
> >
> > ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
> > cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> > has local L3 tag. On the other hand, each clusters will share some
> > internal system bus. This means cache coherence overhead inside one cluster
> > is much less than the overhead across clusters.
> >
> > This patch adds the sched_domain for clusters. On kunpeng 920, without
> > this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
> > patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.
> >
> > This will affect load balance. For example, without this patch, while cpu0
> > becomes idle, it will pull a task from cpu1-cpu15. With this patch, cpu0
> > will try to pull a task from cpu1-cpu3 first. This will have much less
> > overhead of task migration.
> >
> > On the other hand, while doing WAKE_AFFINE, this patch will try to find
> > a core in the target cluster before scanning the whole llc domain.
> > This means it will proactively use a core which has better affinity with
> > target core at first.
> >
> > Though it is named "cluster", architectures or machines can define its
> > exact meaning of cluster as long as some cpus can share some resources
> > in lower level than llc. So the implementation is applicable to all
> > architectures.
> >
> > Different cpus might have different resource sharing like L1, L2, cache
> > tags, internal busses etc.
> > Since it is hard to know where we should start to scan, this patch adds
> > a SD_SHARE_CLS_RESOURCES rather than directly leveraging the existing
> 
> Not sure that we need this new flag. See more about this below
> 
> You should have a look at https://lkml.org/lkml/2020/12/14/560 which
> rework select_idle_core/cpu/smt
> 
> > SD_SHARE_PKG_RESOURCES flag. Architectures or machines can decide what
> > is cluster and who should get SD_SHARE_CLS_RESOURCES. select_idle_cpu()
> > will scan from the first sched_domain with SD_SHARE_CLS_RESOURCES.
> >
> > The below is a hackbench result:
> >
> > we run the below command with different -g parameter from 1 to 10, for each
> > different g, we run the command 10 times and get the average time
> > $ numactl -N 0 hackbench -p -T -l 20000 -g $1
> >
> > hackbench will report the time which is needed to complete a certain number
> > of messages transmissions between a certain number of tasks, for example:
> > $ numactl -N 0 hackbench -p -T -l 20000 -g 10
> > Running in threaded mode with 10 groups using 40 file descriptors each
> > (== 400 tasks)
> > Each sender will pass 20000 messages of 100 bytes
> > Time: 8.874
> >
> > The below is the result of hackbench w/ and w/o the patch:
> > g     1      2      3      4      5      6      7      8      9      10
> > w/o 1.4777 2.0112 3.1919 4.2654 5.3246 6.4019 7.5939 8.7073 9.7526 10.8987
> > w/  1.4793 1.9344 2.9080 3.9267 4.8339 5.7186 6.6923 7.5088 8.3715 9.2173
> >                   +8.9%  +7.9%  +9.3%  +10.7% +11.8% +13.8% +14.2% +15.5%
> >
> > Tracing the kernel while g=10, it shows select_idle_cpu() has a large chance
> > to get cpu in the same cluster with the target while it sometimes gets cpu
> > outside the cluster:
> > target cpu
> > 19  -> 17
> > 13  -> 15
> > 23  -> 20
> > 23  -> 20
> > 19  -> 17
> > 13  -> 15
> > 16  -> 17
> > 19  -> 17
> > 7   -> 5
> > 10  -> 11
> > 23  -> 20
> > *23 -> 4
> > ...
> >
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >  -v3:
> >   - rebased againest 5.11-rc2
> >   - with respect to the comments of Valentin Schneider, Peter Zijlstra,
> >     Vincent Guittot and Mel Gorman etc.
> >   * moved the scheduler changes from arm64 to the common place for all
> >     architectures.
> >   * added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain
> >     where select_idle_cpu() should begin to scan from
> >   * removed redundant select_idle_cluster() function since all code is
> >     in select_idle_cpu() now. it also avoided scanning cluster cpus
> >     twice in v2 code;
> >   * redo the hackbench in one numa after the above changes
> >
> >  arch/arm64/Kconfig             |  7 +++++++
> >  include/linux/sched/sd_flags.h |  9 +++++++++
> >  include/linux/sched/topology.h |  7 +++++++
> >  include/linux/topology.h       |  7 +++++++
> >  kernel/sched/fair.c            | 27 +++++++++++++++++++++------
> >  kernel/sched/topology.c        |  6 ++++++
> >  6 files changed, 57 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 05e1735..546cd61 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -973,6 +973,13 @@ config SCHED_MC
> >           making when dealing with multi-core CPU chips at a cost of slightly
> >           increased overhead in some places. If unsure say N here.
> >
> > +config SCHED_CLUSTER
> > +       bool "Cluster scheduler support"
> > +       help
> > +         Cluster scheduler support improves the CPU scheduler's decision
> > +         making when dealing with machines that have clusters(sharing internal
> > +         bus or sharing LLC cache tag). If unsure say N here.
> > +
> >  config SCHED_SMT
> >         bool "SMT scheduler support"
> >         help
> > diff --git a/include/linux/sched/sd_flags.h
> b/include/linux/sched/sd_flags.h
> > index 34b21e9..fc3c894 100644
> > --- a/include/linux/sched/sd_flags.h
> > +++ b/include/linux/sched/sd_flags.h
> > @@ -100,6 +100,15 @@ SD_FLAG(SD_ASYM_CPUCAPACITY, SDF_SHARED_PARENT |
> SDF_NEEDS_GROUPS)
> >  SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
> >
> >  /*
> > + * Domain members share CPU cluster resources (i.e. llc cache tags)
> > + *
> > + * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer
> share
> > + *               the cluster resouces (such as llc tags and internal bus)
> > + * NEEDS_GROUPS: Caches are shared between groups.
> > + */
> > +SD_FLAG(SD_SHARE_CLS_RESOURCES, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
> > +
> > +/*
> >   * Domain members share CPU package resources (i.e. caches)
> >   *
> >   * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer
> share
> > diff --git a/include/linux/sched/topology.h
> b/include/linux/sched/topology.h
> > index 8f0f778..846fcac 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void)
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +static inline int cpu_cluster_flags(void)
> > +{
> > +       return SD_SHARE_CLS_RESOURCES | SD_SHARE_PKG_RESOURCES;
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_SCHED_MC
> >  static inline int cpu_core_flags(void)
> >  {
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index bf2cc3c..81be614 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -211,6 +211,13 @@ static inline const struct cpumask *cpu_smt_mask(int
> cpu)
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +static inline const struct cpumask *cpu_cluster_mask(int cpu)
> > +{
> > +       return topology_cluster_cpumask(cpu);
> > +}
> > +#endif
> > +
> >  static inline const struct cpumask *cpu_cpu_mask(int cpu)
> >  {
> >         return cpumask_of_node(cpu_to_node(cpu));
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce2..c14fae6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6145,6 +6145,7 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, int t
> >  {
> >         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> >         struct sched_domain *this_sd;
> > +       struct sched_domain *prev_ssd = NULL, *ssd;
> >         u64 avg_cost, avg_idle;
> >         u64 time;
> >         int this = smp_processor_id();
> > @@ -6174,15 +6175,29 @@ static int select_idle_cpu(struct task_struct *p,
> struct sched_domain *sd, int t
> >
> >         time = cpu_clock(this);
> >
> > -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > -
> > -       for_each_cpu_wrap(cpu, cpus, target) {
> > -               if (!--nr)
> > -                       return -1;
> > -               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > +       /*
> > +        * we first scan those child domains who declare they are sharing
> > +        * cluster resources such as llc tags, internal busses; then scan
> > +        * the whole llc
> > +        */
> > +       for_each_domain(target, ssd) {
> 
> I don't like looping the sched domain in the fast path. Instead you
> should follow similar policy as for smt the rework patchset mentioned
> above:
> In the select_idle_core(), you can add a static key for cluster case
> and loop the cpu_cluster_mask(cpu), similarly to what is done with
> cpu_smt_mask but for looking for one idle cpu cpu_cluster_mask() in
> instead of all cpus in the case of smt

Hi Vincent,
Thanks!

Probably I got the idea to iterate sched_domains from Valentin's comment:
https://lore.kernel.org/lkml/jhj1rg9v7gr.mognet@arm.com/

just in case hardware has a hierarchy of clusters, it might want to
scan smaller cluster, then bigger cluster:
           +-------------------+
           |                   |
           |                   |
           | bigger cluster    |
           |                   |
           ++-----------------++
            |                 |
            |                 |
            |                 |
+-----------+------+      +---+---------------+
|                  |      |                   |
|   small cluster  |      |   small cluster   |
|                  |      |                   |
|                  |      |                   |
+------------------+      +-------------------+

But you are right. Anyway, the current code supports one layer
of cluster only. This makes neither SD_SHARE_CLS_RESOURCES
nor looping sched_domains useful.


> 
> > +               if ((ssd->flags & SD_SHARE_CLS_RESOURCES) || (ssd == sd)) {
> > +                       cpumask_and(cpus, sched_domain_span(ssd), p->cpus_ptr);
> > +                       if (prev_ssd)
> > +                               cpumask_andnot(cpus, cpus,
> sched_domain_span(prev_ssd));
> > +                       for_each_cpu_wrap(cpu, cpus, target) {
> > +                               if (!--nr)
> > +                                       return -1;
> > +                               if (available_idle_cpu(cpu) ||
> sched_idle_cpu(cpu))
> > +                                       goto done;
> > +                       }
> > +                       prev_ssd = ssd;
> > +               }
> > +               if (ssd == sd)
> >                         break;
> >         }
> >
> > +done:
> >         time = cpu_clock(this) - time;
> >         update_avg(&this_sd->avg_scan_cost, time);
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 5d3675c..79030c9 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1361,6 +1361,7 @@ int __read_mostly         node_reclaim_distance =
> RECLAIM_DISTANCE;
> >   */
> >  #define TOPOLOGY_SD_FLAGS              \
> >         (SD_SHARE_CPUCAPACITY   |       \
> > +        SD_SHARE_CLS_RESOURCES |       \
> >          SD_SHARE_PKG_RESOURCES |       \
> >          SD_NUMA                |       \
> >          SD_ASYM_PACKING)
> > @@ -1480,6 +1481,11 @@ static struct sched_domain_topology_level
> default_topology[] = {
> >  #ifdef CONFIG_SCHED_SMT
> >         { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> >  #endif
> > +
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > +#endif
> > +
> >  #ifdef CONFIG_SCHED_MC
> >         { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> >  #endif
> > --
> > 2.7.4
> >

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel