From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D0856C433EF for ; Thu, 16 Jun 2022 08:03:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date: Message-ID:From:References:To:Subject:CC:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=ChaS7KLohogfIBtM5rkG6TjKfWO2ceuWIe204avlf1c=; b=lKk3N5nUV8Rr7czHlPxzrd8K29 LkMUF2nCgqnkFFqDGEwxjKq5LUVdlQ+uAelot+hv4HbbvQAZ9BGwkYZxtBluSZEBiMl/XZLhDnlp4 uZmvYgdRkRDOQJEqiqhynYOD1f43uhUrW9+J/8RtjEH/k8d+2vMPimPJkdfDuPwBcaDFVKKkqt3nU zlWTHb7ou/dVU/jQ4it5/6lAcXSxTceQs/z2E0cgrBWDC99NkPM0v2gU/+Hr04RwnRsCrqpZpzRCN ETipbA5CCelWa9LMZ4J9otSljb4cUYIJlOSbFE8lIdupTX5dD9P0YFWWIlNCnFwe8vMmALEvy3nwO n7SO4HJw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1o1kSN-001G4P-BX; Thu, 16 Jun 2022 08:02:24 +0000 Received: from szxga03-in.huawei.com ([45.249.212.189]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1o1kLZ-001DWK-3Z for linux-arm-kernel@lists.infradead.org; Thu, 16 Jun 2022 07:55:24 +0000 Received: from canpemm500009.china.huawei.com (unknown [172.30.72.57]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4LNvbF1YJRzDrDc; Thu, 16 Jun 2022 15:54:49 +0800 (CST) Received: from [10.67.102.169] (10.67.102.169) by canpemm500009.china.huawei.com (7.192.105.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Thu, 16 Jun 2022 15:55:15 +0800 CC: , , , , , , , , , <21cnbao@gmail.com>, , , , Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API To: K Prateek Nayak , Yicong Yang , , , , , , , , References: <20220609120622.47724-1-yangyicong@hisilicon.com> <20220609120622.47724-2-yangyicong@hisilicon.com> From: Yicong Yang Message-ID: <81fbcadb-a58d-2cef-9c05-154555ec1d68@huawei.com> Date: Thu, 16 Jun 2022 15:55:15 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1 MIME-Version: 1.0 In-Reply-To: X-Originating-IP: [10.67.102.169] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To canpemm500009.china.huawei.com (7.192.105.203) X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220616_005521_576070_AA8F4C68 X-CRM114-Status: GOOD ( 36.66 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hi Prateek, Seems my previous reply is in the wrong format so the server rejected it..just repost.. Thanks a lot for the test! On 2022/6/15 22:19, K Prateek Nayak wrote: > Hello Yicong, > > I replied to the v3 of the series by mistake! > (https://lore.kernel.org/lkml/0b065646-5b05-cbdb-b20c-e1dfef3f4d79@amd.com/) > But rest assured all the analysis discussed there was done with > the v4 patch series. I'll add the same observations below so we > can continue discussion on v4. > > We are observing some serious regression with tbench with this patch > series applied. The issue doesn't seem to be related to the actual > functionality of the patch but how the patch changes the per CPU > variable layout. > > Discussed below are the results from running tbench on a dual > socket Zen3 (2 x 64C/128T) configured in different NPS modes. > > NPS Modes are used to logically divide single socket into > multiple NUMA region. > Following is the NUMA configuration for each NPS mode on the system: > > NPS1: Each socket is a NUMA node. > Total 2 NUMA nodes in the dual socket machine. > > Node 0: 0-63, 128-191 > Node 1: 64-127, 192-255 > > NPS2: Each socket is further logically divided into 2 NUMA regions. > Total 4 NUMA nodes exist over 2 socket. > > Node 0: 0-31, 128-159 > Node 1: 32-63, 160-191 > Node 2: 64-95, 192-223 > Node 3: 96-127, 223-255 > > NPS4: Each socket is logically divided into 4 NUMA regions. > Total 8 NUMA nodes exist over 2 socket. > > Node 0: 0-15, 128-143 > Node 1: 16-31, 144-159 > Node 2: 32-47, 160-175 > Node 3: 48-63, 176-191 > Node 4: 64-79, 192-207 > Node 5: 80-95, 208-223 > Node 6: 96-111, 223-231 > Node 7: 112-127, 232-255 > > Benchmark Results: > > Kernel versions: > - tip: 5.19.0-rc2 tip sched/core > - cluster: 5.19.0-rc2 tip sched/core + both the patches of the series > > When we started testing, the tip was at: > commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle" > > * - Data points of concern > > ~~~~~~ > tbench > ~~~~~~ > > NPS1 > > Clients: tip cluster > 1 444.41 (0.00 pct) 439.27 (-1.15 pct) > 2 879.23 (0.00 pct) 831.49 (-5.42 pct) * > 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) > 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) * > 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) * > 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) * > 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) * > 128 30795.27 (0.00 pct) 30861.34 (0.21 pct) > 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) > 512 51287.93 (0.00 pct) 51855.55 (1.10 pct) > 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) > > NPS2 > > Clients: tip cluster > 1 445.45 (0.00 pct) 441.75 (-0.83 pct) > 2 869.24 (0.00 pct) 845.61 (-2.71 pct) > 4 1644.28 (0.00 pct) 1586.49 (-3.51 pct) > 8 3120.83 (0.00 pct) 2967.01 (-4.92 pct) * > 16 5972.29 (0.00 pct) 5208.58 (-12.78 pct) * > 32 11776.38 (0.00 pct) 10229.53 (-13.13 pct) * > 64 20933.15 (0.00 pct) 17033.45 (-18.62 pct) * > 128 32195.00 (0.00 pct) 29507.85 (-8.34 pct) * > 256 24641.52 (0.00 pct) 27225.00 (10.48 pct) > 512 50806.96 (0.00 pct) 51377.50 (1.12 pct) > 1024 51993.96 (0.00 pct) 50773.35 (-2.34 pct) > > NPS4 > > Clients: tip cluster > 1 442.10 (0.00 pct) 435.06 (-1.59 pct) > 2 870.94 (0.00 pct) 858.64 (-1.41 pct) > 4 1615.30 (0.00 pct) 1607.27 (-0.49 pct) > 8 3195.95 (0.00 pct) 3020.63 (-5.48 pct) * > 16 5937.41 (0.00 pct) 5719.87 (-3.66 pct) > 32 11800.41 (0.00 pct) 11229.65 (-4.83 pct) * > 64 20844.71 (0.00 pct) 20432.79 (-1.97 pct) > 128 31003.62 (0.00 pct) 29441.20 (-5.03 pct) * > 256 27476.37 (0.00 pct) 25857.30 (-5.89 pct) * [Know to have run to run variance] > 512 52276.72 (0.00 pct) 51659.16 (-1.18 pct) > 1024 51372.10 (0.00 pct) 51026.87 (-0.67 pct) > > Note: tbench results for 256 workers are known to have > run to run variation on the test machine. Any regression > seen for the data point can be safely ignored. > > The behavior is consistent for both tip and patched kernel > across multiple runs of tbench. > > ~~~~~~~~~~~~~~~~~~~~ > Analysis done so far > ~~~~~~~~~~~~~~~~~~~~ > > To root cause this issue quicker, we have focused on 8 to 64 clients > data points with the machine running in NPS1 mode. > > - Even on disabling HW prefetcher, the behavior remains consistent > signifying HW prefetcher is not the cause of the problem. > > - Bisecting: > > When we ran the tests with only Patch 1 of the series, the > regression was visible and the numbers were worse. > > Clients: tip cluster Patch 1 Only > 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct) > 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct) > 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct) > 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct) > > We further bisected the hunks to narrow down the cause to the per CPU > variable declarations. > > > On 6/9/2022 5:36 PM, Yicong Yang wrote: >> >> [..snip..] >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 01259611beb9..b9bcfcf8d14d 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); >> DECLARE_PER_CPU(int, sd_llc_size); >> DECLARE_PER_CPU(int, sd_llc_id); >> +DECLARE_PER_CPU(int, sd_share_id); >> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); > > The main reason for the regression seems to be the above declarations. > The regression seem to go away if we do one of the following: > > - Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY() > instead of DECLARE_PER_CPU() and change the corresponding definition > below to DEFINE_PER_CPU_READ_MOSTLY(). > > Clients: tip Patch 1 Patch 1 (READ_MOSTLY) > 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3237.33 (-0.56 pct) > 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5914.53 (-2.92 pct) > 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11536.05 (3.40 pct) > 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21162.33 (0.67 pct) > > - Convert sd_share_id and sd_cluster to static arrays. > > Clients: tip Patch 1 Patch 1 (Static Array) > 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3203.27 (-1.61 pct) > 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 6198.35 (1.73 pct) > 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11385.76 (2.05 pct) > 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21919.80 (4.28 pct) > > - Move the declarations of sd_share_id and sd_cluster to the top > > Clients: tip Patch 1 Patch 1 (Declarion on Top) > 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct) > 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct) > 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct) > 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct) > > Unfortunately, none of these are complete solutions. For example, using > DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression > but doesn't eliminate it entirely: > > Clients: tip cluster cluster (READ_MOSTLY) > 1 444.41 (0.00 pct) 439.27 (-1.15 pct) 435.95 (-1.90 pct) > 2 879.23 (0.00 pct) 831.49 (-5.42 pct) 842.09 (-4.22 pct) > 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1598.77 (-3.03 pct) > 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3090.86 (-5.29 pct) * > 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5360.28 (-10.82 pct) * > 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11083.66 (-8.08 pct) * > 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 20984.30 (-1.28 pct) > 128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 30735.20 (-0.19 pct) > 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 24021.21 (-4.44 pct) > 512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 51672.73 (0.75 pct) > 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52620.02 (-1.04 pct) > > We are still trying to root cause the underlying issue that > brought about such drastic regression in tbench performance. > >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c >> index 05b6c2ad90b9..0595827d481d 100644 >> --- a/kernel/sched/topology.c >> +++ b/kernel/sched/topology.c >> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd) >> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); >> DEFINE_PER_CPU(int, sd_llc_size); >> DEFINE_PER_CPU(int, sd_llc_id); >> +DEFINE_PER_CPU(int, sd_share_id); >> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); >> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); >> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); >> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); >> >> [..snip..] >> > > We would like some time to investigate this issue and root cause > the reason for this regression. One significant difference I can see for now is that Kunpeng 920 is a non-SMT machine while Zen3 is a SMT machine. So in the select_idle_sibling() path we won't use sd_llc_shared. Since sd_share_id and sd_cluster are inserted between sd_llc_id and sd_llc_shared which are both used in the path on your test, I guess the change of the layout may affect something like the cache behaviour. Can you also test with SMT disabled? Or change the layout a bit to see if there's any difference, like: DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); +DEFINE_PER_CPU(int, sd_share_id); +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); I need to further look into it and have some tests on a SMT machine. Would you mind to share the kernel config as well? I'd like to compare the config as well. Thanks, Yicong _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel