From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id E733AC433F5
	for <linux-arm-kernel@archiver.kernel.org>; Wed, 16 Feb 2022 09:20:39 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:In-Reply-To:References:
	Message-ID:Date:Subject:CC:To:From:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=OrvyhZXe9jJ6RBEjOlKZKd4jzDZADYQs0wXYut6ZvwY=; b=GOtRS+/Exvy+Ph
	cctX4q0Q18vMYR3s8QF+9f8F8kUfKDiRCeawLu/sjBw6VIcHPOH1ECIKfREHVsHBe1PhMFcai+Y/1
	7KXixfZe3Gpj96+mAw6ffYfwr51coTFJPXxibpy2MZIdzAtYhOyrhuG2vuLjf2xHiV4C5WNfXTqyg
	7GG46yQVPrzyLsG+WJfy2IhZBkB+o9InbqPQLgjDy8YGZUW0Ue4x7gT8cn7tI9ebdkrh0pnYqy9hs
	qUV2hPrrNOYV+RRTuJ/mSRLz2G/fZRv/J5MeF+Vmg6TqTWdoKdmVfYlqmRs1j6YYhHGn7wvh5ZnlW
	kQJ1ikFkC+a9RJtv8Ycg==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1nKGT4-006I5y-Ct; Wed, 16 Feb 2022 09:19:22 +0000
Received: from szxga03-in.huawei.com ([45.249.212.189])
 by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
 id 1nKGSv-006I10-9W
 for linux-arm-kernel@lists.infradead.org; Wed, 16 Feb 2022 09:19:19 +0000
Received: from kwepemi100018.china.huawei.com (unknown [172.30.72.57])
 by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4JzC442Kv3z8wY0;
 Wed, 16 Feb 2022 17:15:48 +0800 (CST)
Received: from kwepemm600016.china.huawei.com (7.193.23.20) by
 kwepemi100018.china.huawei.com (7.221.188.35) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2308.21; Wed, 16 Feb 2022 17:19:07 +0800
Received: from kwepemm600014.china.huawei.com (7.193.23.54) by
 kwepemm600016.china.huawei.com (7.193.23.20) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2308.21; Wed, 16 Feb 2022 17:19:07 +0800
Received: from kwepemm600014.china.huawei.com ([7.193.23.54]) by
 kwepemm600014.china.huawei.com ([7.193.23.54]) with mapi id 15.01.2308.021;
 Wed, 16 Feb 2022 17:19:07 +0800
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: Barry Song <21cnbao@gmail.com>, "Gautham R. Shenoy"
 <gautham.shenoy@amd.com>
CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>, yangyicong
 <yangyicong@huawei.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar
 <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot
 <vincent.guittot@linaro.org>, Tim Chen <tim.c.chen@linux.intel.com>, LKML
 <linux-kernel@vger.kernel.org>, LAK <linux-arm-kernel@lists.infradead.org>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt
 <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, "Daniel Bristot de
 Oliveira" <bristot@redhat.com>, "Zengtao (B)" <prime.zeng@hisilicon.com>,
 Jonathan Cameron <jonathan.cameron@huawei.com>, "ego@linux.vnet.ibm.com"
 <ego@linux.vnet.ibm.com>, Linuxarm <linuxarm@huawei.com>, Guodong Xu
 <guodong.xu@linaro.org>
Subject: RE: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in
 wake-up path
Thread-Topic: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in
 wake-up path
Thread-Index: AQHYEoxEPt1JVjGTeECFv7fIQnP5T6x2fV2AgABOQwCAALZOgP//LYOAgAdEbYCAALNAAIAD4KCAgAAw74CABQbjAIAA8oyAgAzNYwCAAIYZ8A==
Date: Wed, 16 Feb 2022 09:19:07 +0000
Message-ID: <dd9a5329e35241f6ab0bbb723ad72813@hisilicon.com>
References: <20220126080947.4529-1-yangyicong@hisilicon.com>
 <20220126080947.4529-3-yangyicong@hisilicon.com>
 <YfK9DSMFabjYm/MV@BLR-5CG11610CF.amd.com>
 <CAGsJ_4xL3tynB9P=rKMoX2otW4bMMU5Z-P9zSudMV3+fr2hpXw@mail.gmail.com>
 <20220128071337.GC618915@linux.vnet.ibm.com>
 <CAGsJ_4yoUONACY-j+9XxSNC0VgmdyRdHC=z87dWvZvVSASzXRQ@mail.gmail.com>
 <20220201093859.GE618915@linux.vnet.ibm.com>
 <CAGsJ_4z8cer7Y5si+J_=awQetFJZMVeaQ+RDSXQz9EGOPTGMQg@mail.gmail.com>
 <20220204073317.GG618915@linux.vnet.ibm.com>
 <CAGsJ_4xjgy3D0VzbTdmJihJ+nut_NeTEb4krh8jup4rbvTY_ww@mail.gmail.com>
 <YgE3TrBrB0psljDk@BLR-5CG11610CF.amd.com>
 <CAGsJ_4xg6heV-0yqvcwNNEyOcrfwv3uN45YfR1Jcawys0ROrow@mail.gmail.com>
 <CAGsJ_4z-YxcPytzmGViRzEueL1F7HEE4OEuezvDg6TvEs1HJEA@mail.gmail.com>
In-Reply-To: <CAGsJ_4z-YxcPytzmGViRzEueL1F7HEE4OEuezvDg6TvEs1HJEA@mail.gmail.com>
Accept-Language: en-GB, zh-CN, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.126.201.242]
MIME-Version: 1.0
X-CFilter-Loop: Reflected
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220216_011913_733935_E0BDBF4D 
X-CRM114-Status: GOOD (  42.63  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org


> -----Original Message-----
> From: Barry Song [mailto:21cnbao@gmail.com]
> Sent: Wednesday, February 16, 2022 10:13 PM
> To: Gautham R. Shenoy <gautham.shenoy@amd.com>
> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>; yangyicong
> <yangyicong@huawei.com>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar
> <mingo@redhat.com>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot
> <vincent.guittot@linaro.org>; Tim Chen <tim.c.chen@linux.intel.com>; LKML
> <linux-kernel@vger.kernel.org>; LAK <linux-arm-kernel@lists.infradead.org>;
> Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt
> <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Daniel Bristot de
> Oliveira <bristot@redhat.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; ego@linux.vnet.ibm.com;
> Linuxarm <linuxarm@huawei.com>; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>; Guodong Xu <guodong.xu@linaro.org>
> Subject: Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in
> wake-up path
> 
> On Tue, Feb 8, 2022 at 6:42 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Feb 8, 2022 at 4:14 AM Gautham R. Shenoy <gautham.shenoy@amd.com>
> wrote:
> > >
> > >
> > > On Fri, Feb 04, 2022 at 11:28:25PM +1300, Barry Song wrote:
> > >
> > > > > We already figured out that there are no idle CPUs in this cluster.
> So dont
> > > > > we gain performance by picking a idle CPU/core in the neighbouring cluster.
> > > > > If there are no idle CPU/core in the neighbouring cluster, then it does
> make
> > > > > sense to fallback on the current cluster.
> > > >
> > > > What you suggested is exactly the approach we have tried at the first
> beginning
> > > > during debugging. but we didn't gain performance according to benchmark,
> we
> > > > were actually losing. that is why we added this line to stop ping-pong:
> > > >          /* Don't ping-pong tasks in and out cluster frequently */
> > > >          if (cpus_share_resources(target, prev_cpu))
> > > >             return target;
> > > >
> > > > If we delete this, we are seeing a big loss of tbench while system
> > > > load is medium
> > > > and above.
> > >
> > > Thanks for clarifying this Barry. Indeed, if the workload is sensitive
> > > to data ping-ponging across L2 clusters, this heuristic makes sense. I
> > > was thinking of workloads that require lower tail latency, in which
> > > case exploring the larger LLC would have made more sense, assuming
> > > that the larger LLC has an idle core/CPU.
> > >
> > > In the absence of any hints from the workload, like something that
> > > Peter had previous suggested
> > >
> (https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@hirez.programming.kicks-ass
> .net/),
> > > optimizing for cache-access seems to be the right thing to do.
> >
> > Thanks, gautham.
> >
> > Yep. Peter mentioned some hints like SCHED_BATCH and SCHED_IDLE.
> > To me, the case we are discussing seems to be more complicated than
> > applying some scheduling policy on separate tasks by SCHED_BATCH
> > or IDLE.
> >
> > For example, in case we have a process, and this process has 20 threads.
> > thread0-9 might care about cache-coherence latency and want to avoid
> > ping-ponging, and thread10-thread19 might want to have tail-latency
> > as small as possible. So we need some way to tell kernel, "hey, bro, please
> > try to keep thread0-9 still as ping-ponging will hurt them while trying your
> > best to find idle cpu in a wider range for thread10-19". But it seems
> > SCHED_XXX as a scheduler policy hint can't tell kernel how to organize tasks
> > into groups, and is also incapable of telling kernel different groups have
> > different needs.
> >
> > So it seems we want some special cgroups to organize tasks and we can apply
> > some special hints on each different group. for example, putting thread0-9
> > in a cgroup and thread10-19 in another, then:
> > 1. apply "COMMUNCATION-SENSITVE" on the 1st group
> > 2. apply "TAIL-LATENCY-SENTIVE" on the 2nd one.
> > I am not quite sure how to do this and if this can find its way into
> > the mainline.
> >
> > On the other hand, for this particular patch, the most controversial
> > part is those
> > two lines to avoid ping-ponging, and I am seeing dropping this can hurt workload
> > like tbench only when system load is high, so I wonder if the approach[1]
> from
> > Chen Yu and Tim can somehow resolve the problem alternatively, thus we can
> > avoid the controversial part.
> > since their patch can also shrink the scanning range while llc load is high.
> >
> > [1]
> https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/
> 
> Yicong's testing shows the patch from Chen Yu and Tim can somehow resolve the
> problem and make sure there is no performance regression for tbench
> while load is
> high after we remove the code to avoid ping-pong:
> 
> 5.17-rc1: vanilla
> rc1 + chenyu: vanilla + chenyu's LLC overload patch
> rc1+chenyu+cls: vanilla + chenyu's  patch + my this patchset
> rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset -
> the code avoiding ping-pong
> rc1+cls: vanilla + my this patchset
> 
> tbench running on numa 0 &1:
>                             5.17-rc1          rc1 + chenyu
> rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
> Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*
> 357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
> Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*
> 714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
> Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*
> 1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
> Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*
> 2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
> Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*
> 5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
> Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*
> 10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
> Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*
> 10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
> Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*
> 13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
> Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*
> 11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
> Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*
> 14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*
> 
> tbench running on numa 0 only:
> 
>                             5.17-rc1          rc1 + chenyu
> rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
> Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*
> 358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
> Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*
> 710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
> Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*
> 1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
> Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*
> 2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
> Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*
> 5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
> Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*
> 9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
> Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*
> 6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
> Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*
> 6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
> Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*
> 9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
> Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*
> 9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*
> 

Sorry, it seems the format is broken. Let me re-post the data.

 5.17-rc1: vanilla
 rc1 + chenyu: vanilla + chenyu's LLC overload patch
 rc1+chenyu+cls: vanilla + chenyu's  patch + my this patchset
 rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset - the code avoiding ping-pong
 rc1+cls: vanilla + my this patchset

tbench running on numa 0&1:
                            5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*      357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*      714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*     1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*     2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*     5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*    10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*    10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*    13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*    11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*    14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*

tbench running on numa 0 only:
                            5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*      358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*      710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*     1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*     2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*     5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*     9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*     6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*     6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*     9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*     9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*

> like rc1+cls, in some
> cases(256, 512 threads on numa0&1), it is even much better.
> 
> Thanks
> Barry
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel