From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7054DC77B7A for ; Wed, 24 May 2023 08:06:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date: Message-ID:From:References:To:Subject:CC:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=7twJES28spy77MXXvaDcTZQLZIy+Pl+NwEIKn85MXuU=; b=cbeT+nnxr1+fCHh0jbG7RJ6u2y 3VR4ax+Cr+gkcbmuY3CvkbAHdlGUmBaBcQhMosK1wHAhgYey+DKq/KrJTkLMSgxBv+HN1Z9mVuhTI 0ezLAFsQrb0NtPhQyKV9DdsVCSZ/R1fQPF7thlaXjbn5EYMgNLMyuOgZeKx1OzbdyeTkMQTxrV7V7 R/WV0DvwR6Ml6eWKy9JvE2jHrW40HuiAem/XowFrWT+4h/dybxea6uuuz2hg5lNP6F+kzLTTsiGrj YWoJ296PHuEgHYSWSJVBkpPlat8cEPS8nN3lsNw50gFCyXg85ucPFT7refqdSUnGTSLqhvFflJao0 jiun2R+g==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1q1jV5-00CiCT-0P; Wed, 24 May 2023 08:05:39 +0000 Received: from szxga03-in.huawei.com ([45.249.212.189]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1q1jV0-00CiB5-2J for linux-arm-kernel@lists.infradead.org; Wed, 24 May 2023 08:05:37 +0000 Received: from canpemm500009.china.huawei.com (unknown [172.30.72.53]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4QR3c21DlzzLmPw; Wed, 24 May 2023 16:04:02 +0800 (CST) Received: from [10.67.102.169] (10.67.102.169) by canpemm500009.china.huawei.com (7.192.105.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Wed, 24 May 2023 16:05:29 +0800 CC: , , , , , , , , , , , , , , , , , , <21cnbao@gmail.com>, , , , , , , Subject: Re: [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path To: Chen Yu References: <20220915073423.25535-1-yangyicong@huawei.com> <20220915073423.25535-3-yangyicong@huawei.com> <46e8d4fc-993b-e1d6-5e4c-cb33513d7888@huawei.com> From: Yicong Yang Message-ID: <9b3bdf37-2551-871c-1f68-ceddf4e933c3@huawei.com> Date: Wed, 24 May 2023 16:05:29 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1 MIME-Version: 1.0 In-Reply-To: X-Originating-IP: [10.67.102.169] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To canpemm500009.china.huawei.com (7.192.105.203) X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230524_010535_145358_851A7956 X-CRM114-Status: GOOD ( 32.55 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 2023/5/23 21:44, Chen Yu wrote: > On 2023-05-22 at 20:42:19 +0800, Yicong Yang wrote: >> Hi Chen, >> >> On 2023/5/22 14:29, Chen Yu wrote: >>> Hi Yicong, >>> On 2022-09-15 at 15:34:23 +0800, Yicong Yang wrote: >>>> From: Barry Song >>>> > [snip...] >> >> Thanks for the further information. The result of netperf/tbench looks good as we >> image, the cluster wakeup expects to gain more benefit when the system is under >> loaded or well-loaded. May I know how many CPUs sharing cluster on Jacobsvilla? >> > There are 4 CPUs per cluster on Jacobsville. > [snip...] >>>> @@ -6550,7 +6574,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >>>> /* >>>> * If the previous CPU is cache affine and idle, don't be stupid: >>>> */ >>>> - if (prev != target && cpus_share_cache(prev, target) && >>>> + if (prev != target && cpus_share_lowest_cache(prev, target) && >>> This change impacts hackbench in socket mode a bit. It seems that for hackbench even >>> putting the wakee on its previous CPU in the same LLC is better than putting it on >>> current cluster. But it seems to be hackbench specific. >>> >> >> ...without this do you still see the same improvement at under-loaded case (threads less-equal the CPU >> numbers) for tbench/netperf? >> The idea here is to always try to wakeup in the same cluster of the >> target to benefit from the cluster cache but the early test for the prev and recent used cpu may break >> that. Keep it as is, at low load the prev cpu or recent used cpu get more chance to be idle so we take >> less chance to benefit from the cluster and gain less performance improvement. >> > Right. Without above change I saw lower improvement at lightly load case for netperf/tbench. >> In the hackbench case as you noticed, the utilization can reach 100% ideally so the SIS_UTIL >> will regulate the scanning number to 4 or around. If the prev/recent used CPU is not in the same >> cluster with target, we're about to scanning the cluster and when found no idle CPU and has >> run out of the scanning number, we'll fallback to wakeup on the target. That maybe the reason >> why observed more wakeups on target rather than previous CPU. >> > Looks reasonable. When the budget of scanning number is low, we can not find an idle target > on local cluster and terminates scanning for an idle prev on remote cluster, although that > prev could be a better choice than target cpu. >> In this case I wondering choosing prev cpu or recent used cpu after scanning the cluster can help >> the situation here, like the snippet below (kinds of messy though). >> > This change makes sense to me. I only modified it a little bit to only give prev a second > chance. With your patch applied, the improvement of netperf/tbench remains while the > hackbench big regress was gone. > Thanks for the test, it looks justified to have this. Will include this change in next version. > hackbench > ========= > case load baseline(std%) compare%( std%) > process-pipe 1-groups 1.00 ( 2.35) -0.65 ( 1.81) > process-pipe 2-groups 1.00 ( 0.42) -2.16 ( 1.12) > process-pipe 4-groups 1.00 ( 1.84) +0.72 ( 1.34) > process-pipe 8-groups 1.00 ( 2.81) +1.12 ( 3.88) > process-sockets 1-groups 1.00 ( 1.88) -0.99 ( 4.84) > process-sockets 2-groups 1.00 ( 5.49) -4.50 ( 4.09) > process-sockets 4-groups 1.00 ( 3.54) +2.28 ( 3.13) > process-sockets 8-groups 1.00 ( 0.79) -0.13 ( 1.28) > threads-pipe 1-groups 1.00 ( 1.73) -2.39 ( 0.40) > threads-pipe 2-groups 1.00 ( 0.73) +2.88 ( 1.94) > threads-pipe 4-groups 1.00 ( 0.64) +1.12 ( 1.82) > threads-pipe 8-groups 1.00 ( 1.55) -1.59 ( 1.20) > threads-sockets 1-groups 1.00 ( 3.76) +3.21 ( 3.56) > threads-sockets 2-groups 1.00 ( 1.20) -5.56 ( 2.64) > threads-sockets 4-groups 1.00 ( 2.65) +1.48 ( 4.91) > threads-sockets 8-groups 1.00 ( 0.08) +0.18 ( 0.15) > > netperf > ======= > case load baseline(std%) compare%( std%) > TCP_RR 6-threads 1.00 ( 0.91) +2.87 ( 0.83) > TCP_RR 12-threads 1.00 ( 0.22) +3.48 ( 0.31) > TCP_RR 18-threads 1.00 ( 0.41) +7.81 ( 0.48) > TCP_RR 24-threads 1.00 ( 1.02) -0.32 ( 1.25) > TCP_RR 30-threads 1.00 ( 4.67) -0.04 ( 5.14) > TCP_RR 36-threads 1.00 ( 4.53) -0.13 ( 4.37) > TCP_RR 42-threads 1.00 ( 3.92) -0.15 ( 3.07) > TCP_RR 48-threads 1.00 ( 2.07) -0.17 ( 1.52) > UDP_RR 6-threads 1.00 ( 0.98) +4.50 ( 2.38) > UDP_RR 12-threads 1.00 ( 0.26) +3.64 ( 0.25) > UDP_RR 18-threads 1.00 ( 0.27) +9.93 ( 0.55) > UDP_RR 24-threads 1.00 ( 1.22) +0.13 ( 1.33) > UDP_RR 30-threads 1.00 ( 3.86) -0.03 ( 5.05) > UDP_RR 36-threads 1.00 ( 2.81) +0.10 ( 3.37) > UDP_RR 42-threads 1.00 ( 3.51) -0.26 ( 2.94) > UDP_RR 48-threads 1.00 ( 12.54) +0.74 ( 9.44) > > tbench > ====== > case load baseline(std%) compare%( std%) > loopback 6-threads 1.00 ( 0.04) +2.94 ( 0.26) > loopback 12-threads 1.00 ( 0.30) +4.58 ( 0.12) > loopback 18-threads 1.00 ( 0.37) +12.38 ( 0.10) > loopback 24-threads 1.00 ( 0.56) -0.27 ( 0.50) > loopback 30-threads 1.00 ( 0.17) -0.18 ( 0.06) > loopback 36-threads 1.00 ( 0.25) -0.73 ( 0.44) > loopback 42-threads 1.00 ( 0.10) -0.22 ( 0.18) > loopback 48-threads 1.00 ( 0.29) -0.48 ( 0.19) > > schbench > ======== > case load baseline(std%) compare%( std%) > normal 1-mthreads 1.00 ( 0.00) +0.00 ( 0.00) > normal 2-mthreads 1.00 ( 0.00) +0.00 ( 0.00) > normal 4-mthreads 1.00 ( 6.80) +2.78 ( 8.08) > normal 8-mthreads 1.00 ( 3.65) -0.23 ( 4.30) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0989116b0796..07495b44c68f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7127,7 +7127,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > bool has_idle_core = false; > struct sched_domain *sd; > unsigned long task_util, util_min, util_max; > - int i, recent_used_cpu; > + int i, recent_used_cpu, prev_aff = -1; > > /* > * On asymmetric system, update task utilization because we will check > @@ -7152,10 +7152,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > /* > * If the previous CPU is cache affine and idle, don't be stupid: > */ > - if (prev != target && cpus_share_lowest_cache(prev, target) && > + if (prev != target && cpus_share_cache(prev, target) && > (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > - asym_fits_cpu(task_util, util_min, util_max, prev)) > - return prev; > + asym_fits_cpu(task_util, util_min, util_max, prev)) { > + if (cpus_share_lowest_cache(prev, target)) > + return prev; > + prev_aff = prev; > + } > > /* > * Allow a per-cpu kthread to stack with the wakee if the > @@ -7223,6 +7226,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > if ((unsigned)i < nr_cpumask_bits) > return i; > > + /* > + * Give prev another chance, in case prev has not been > + * scanned in select_idle_cpu() due to nr constrain. > + */ > + if (prev_aff != -1) > + return prev_aff; > + It looks neater. We should also give recent_used_cpu a chance based on the current implementation if it does no harm. Thanks, Yicong _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel