From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1BFD33C5558 for ; Thu, 19 Mar 2026 13:13:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773926028; cv=none; b=WT5BOstgPbr8HK3LxuO0ajdlsgwHYuLxOeVgu8+oVdsfIwcjiv7zQrhltH5oap7Z+A32nvtSk0HZZ2gfHU/BcE68POhIeMpmWSvNQT/+3of5+I0/AmQ/T2B7k2j97Pe5VO/NSvHgoHYQcLOtqfjw8BDFGRsburkWmen7GIxT02U= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773926028; c=relaxed/simple; bh=nMKkNLpYVOV91UH09n4mMlmI7iT1CcLDn3gpAiMk9Qk=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=sYN1sf9Bv44jalmo49v7MTkpQ60oSTPhnGug5Z8fwHMKllb+oOSsFsTpXhsrkNGV09wpbbvAIbxYWO7btz2hwWHdek1lpzh/foUoUhXG6wXzKRyC7kEDo/AuG/wThWQg4rvEHeaZExiAtDlD8YlhBB0/GqxPWplGw99ZKYCOaM0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=S/liMW+8; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="S/liMW+8" Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 62J2hiQK2372658; Thu, 19 Mar 2026 13:13:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=d3bmXH rei64DMS/gDmhE4UgIPgYaPJwEyfDZ+MEFkF8=; b=S/liMW+8jtZETRS8Oy6i+E yi+gMr4sQBomAfWPnSwUEbCkhCVtj3eRT50H1WTRa9AzOiiOzu+ThgEvIAOyYyqi cntulr1lZASq84tKnM0/KlCNohNS07IQuySR71ZllwoBj4xhtTbHOv0mEpKeDG6u wh5isRYug7dImsB3p810qHl3/5z9+8hbP5d6CaxwQ5r75XFAFh2nMyMPFXvcjwZR OHw8jFUnlcvTghb3iICX23GT0JmCs3V91OOwEi23O9jxb2ZLcmOTyTB+4P8xJCbD SB8KuKwsQqxPpEQsMVF7NKwcN7FSxDqCyZD5JfgIIodjkR1gfaoZc48iC/0hJEZw == Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4cx7vfsa29-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 19 Mar 2026 13:13:26 +0000 (GMT) Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 62J9HaRj028785; Thu, 19 Mar 2026 13:13:25 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4cwkgkjj1e-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 19 Mar 2026 13:13:25 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 62JDDNcZ51642724 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 19 Mar 2026 13:13:23 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1FAB420043; Thu, 19 Mar 2026 13:13:23 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C453E20040; Thu, 19 Mar 2026 13:13:20 +0000 (GMT) Received: from [9.124.220.229] (unknown [9.124.220.229]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 19 Mar 2026 13:13:20 +0000 (GMT) Message-ID: <0ddeddb4-5d62-4aea-9dd5-ba5c3301628e@linux.ibm.com> Date: Thu, 19 Mar 2026 18:43:20 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/2] sched/fair: consider hk_mask early in triggering ilb To: Mukesh Kumar Chaurasiya Cc: mingo@kernel.org, peterz@infradead.org, vincent.guittot@linaro.org, linux-kernel@vger.kernel.org, kprateek.nayak@amd.com, juri.lelli@redhat.com, vschneid@redhat.com, tglx@linutronix.de, dietmar.eggemann@arm.com, frederic@kernel.org, longman@redhat.com References: <20260319065314.343932-1-sshegde@linux.ibm.com> <20260319065314.343932-2-sshegde@linux.ibm.com> Content-Language: en-US From: Shrikanth Hegde In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Proofpoint-ORIG-GUID: 416UBFoR5oaBA572AGVanrfgDTv6nbaG X-Authority-Analysis: v=2.4 cv=KajfcAYD c=1 sm=1 tr=0 ts=69bbf676 cx=c_pps a=3Bg1Hr4SwmMryq2xdFQyZA==:117 a=3Bg1Hr4SwmMryq2xdFQyZA==:17 a=IkcTkHD0fZMA:10 a=Yq5XynenixoA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=U7nrCbtTmkRpXpFmAIza:22 a=VnNF1IyMAAAA:8 a=PBi0ScZoU44ZZqApQEwA:9 a=QEXdDO2ut3YA:10 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwMzE5MDEwNSBTYWx0ZWRfX8smCUpcYdK0d zyMf9IZUnqYMMriaepDS20VGT6/I249zzlJkoABTyBq1tSr+2J37CIXwZQ/KYLp+B1r5zd0VgGV T9RXBr4NXgCBkkwhpkdFlBsg+Frx3P5i3GcLchGVmEGr8kyhfLCWyBlfk0/ciPjtLD2YTLtstbE 5QkZW/pAR4pH4Cg90pI9tMJ+T5D8QLfktrWaWn/cQUtv8VfQmLWv8DQZJkrMlo9m5pjDag8V74F Yv6iugt0yxdby+UKHQvj0nPYJJWLZovmtkLwtRqS2m5rUOFsuawfA0UeoP4whbuQRsKz/pB++Bg 6jDY/Qzxiz6ifkruw1zJXpV8+raiTmiOP7pbWy+zkQpC+YuF350GACPCgnDYZ7ev3iUSTozz5yu gSHLihHhk+IGDVYYS8EzajX1DZ0JgE+w3StHgmobHMNxflHjOp6Ics7WX254vDwAhgntqy45SP/ iTtd8KTkH5YrM5iHDSQ== X-Proofpoint-GUID: NHAgFj0LguqLJ39i8YgKkV8JpJvMaDII X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-03-19_01,2026-03-19_05,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 adultscore=0 spamscore=0 malwarescore=0 clxscore=1015 impostorscore=0 bulkscore=0 lowpriorityscore=0 priorityscore=1501 phishscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2603050001 definitions=main-2603190105 Hi Mukesh. On 3/19/26 1:45 PM, Mukesh Kumar Chaurasiya wrote: > On Thu, Mar 19, 2026 at 12:23:13PM +0530, Shrikanth Hegde wrote: >> Current code around nohz_balancer_kick and kick_ilb: >> 1. Checks for nohz.idle_cpus_mask to see if idle load balance(ilb) is >> needed. >> 2. Does a few checks to see if any conditions meet the criteria. >> 3. Tries to find the idle CPU. But the idle CPU found should be part of >> housekeeping CPUs. >> >> If there is no housekeeping idle CPU, then step 2 is done >> un-necessarily, since 3 bails out without doing the ilb. >> >> Fix that by making the decision early and pass it on to find_new_ilb. >> Use a percpu cpumask instead of allocating it everytime since this is in >> fastpath. >> >> If flags is set to NOHZ_STATS_KICK since the time is after nohz.next_blocked >> but before nohz.next_balance and there are idle CPUs which are part of >> housekeeping, need to copy the same logic there too. >> >> While there, fix the stale comments around nohz.nr_cpus >> >> Signed-off-by: Shrikanth Hegde >> --- >> >> Didn't add the fixes tag since it addresses more than stale comments. >> >> kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++-------------- >> 1 file changed, 31 insertions(+), 14 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b19aeaa51ebc..02cca2c7a98d 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7392,6 +7392,7 @@ static inline unsigned int cfs_h_nr_delayed(struct rq *rq) >> static DEFINE_PER_CPU(cpumask_var_t, load_balance_mask); >> static DEFINE_PER_CPU(cpumask_var_t, select_rq_mask); >> static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask); >> +static DEFINE_PER_CPU(cpumask_var_t, kick_ilb_tmpmask); >> >> #ifdef CONFIG_NO_HZ_COMMON >> >> @@ -12629,15 +12630,14 @@ static inline int on_null_domain(struct rq *rq) >> * - When one of the busy CPUs notices that there may be an idle rebalancing >> * needed, they will kick the idle load balancer, which then does idle >> * load balancing for all the idle CPUs. >> + * >> + * @cpus idle CPUs in HK_TYPE_KERNEL_NOISE housekeeping >> */ >> -static inline int find_new_ilb(void) >> +static inline int find_new_ilb(struct cpumask *cpus) >> { >> - const struct cpumask *hk_mask; >> int ilb_cpu; >> >> - hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE); >> - >> - for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) { >> + for_each_cpu(ilb_cpu, cpus) { >> >> if (ilb_cpu == smp_processor_id()) >> continue; >> @@ -12656,7 +12656,7 @@ static inline int find_new_ilb(void) >> * We pick the first idle CPU in the HK_TYPE_KERNEL_NOISE housekeeping set >> * (if there is one). >> */ >> -static void kick_ilb(unsigned int flags) >> +static void kick_ilb(unsigned int flags, struct cpumask *cpus) >> { >> int ilb_cpu; >> >> @@ -12667,7 +12667,7 @@ static void kick_ilb(unsigned int flags) >> if (flags & NOHZ_BALANCE_KICK) >> nohz.next_balance = jiffies+1; >> >> - ilb_cpu = find_new_ilb(); >> + ilb_cpu = find_new_ilb(cpus); >> if (ilb_cpu < 0) >> return; >> >> @@ -12700,6 +12700,7 @@ static void kick_ilb(unsigned int flags) >> */ >> static void nohz_balancer_kick(struct rq *rq) >> { >> + struct cpumask *ilb_cpus = this_cpu_cpumask_var_ptr(kick_ilb_tmpmask); >> unsigned long now = jiffies; >> struct sched_domain_shared *sds; >> struct sched_domain *sd; >> @@ -12715,27 +12716,41 @@ static void nohz_balancer_kick(struct rq *rq) >> */ >> nohz_balance_exit_idle(rq); >> >> + /* ILB considers only HK_TYPE_KERNEL_NOISE housekeeping CPUs */ >> + >> if (READ_ONCE(nohz.has_blocked_load) && >> - time_after(now, READ_ONCE(nohz.next_blocked))) >> + time_after(now, READ_ONCE(nohz.next_blocked))) { >> flags = NOHZ_STATS_KICK; >> + cpumask_and(ilb_cpus, nohz.idle_cpus_mask, >> + housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)); >> + } >> >> /* >> - * Most of the time system is not 100% busy. i.e nohz.nr_cpus > 0 >> - * Skip the read if time is not due. >> + * Most of the time system is not 100% busy. i.e there are idle >> + * housekeeping CPUs. >> + * >> + * So, Skip the reading idle_cpus_mask if time is not due. >> * >> * If none are in tickless mode, there maybe a narrow window >> * (28 jiffies, HZ=1000) where flags maybe set and kick_ilb called. >> * But idle load balancing is not done as find_new_ilb fails. >> - * That's very rare. So read nohz.nr_cpus only if time is due. >> + * That's very rare. So check (idle_cpus_mask & HK_TYPE_KERNEL_NOISE) >> + * only if time is due. >> + * >> */ >> if (time_before(now, nohz.next_balance)) >> goto out; >> >> + /* Avoid the double computation */ >> + if (flags != NOHZ_STATS_KICK) >> + cpumask_and(ilb_cpus, nohz.idle_cpus_mask, >> + housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)); >> + > There is no usage of ilb_cpus till this point. We can avoid this if > condition and get the ilb_cpus here itself instead of earlier. No there is. Why? struct cpumask *ilb_cpus = this_cpu_cpumask_var_ptr(kick_ilb_tmpmask) << this is just a variable. if (READ_ONCE(nohz.has_blocked_load) && time_after(now, READ_ONCE(nohz.next_blocked))) flags = NOHZ_STATS_KICK if (time_before(now, nohz.next_balance)) goto out; If there are idle cpus, nohz.has_blocked_load=1 on idle entry which could be after previous nohz idle balance. After 32 jiffies time now points after next_blocked. But nohz.next_balance is typically set to 60 jiffies. So, it goes to out with flags set and that passes ilb_cpus which is not set yet. Hence both places setting the ilb_cpu is necessary. I kept it at both places and added flags check since it is difficult to predict movement of nohz.next_balance and nohz.next_blocked since there multiple CPUs involved which maybe doing idle entry/exit. On first tick after idle exit, nohz_balancer_kick would be called. >> /* >> * None are in tickless mode and hence no need for NOHZ idle load >> * balancing >> */ >> - if (unlikely(cpumask_empty(nohz.idle_cpus_mask))) >> + if (unlikely(cpumask_empty(ilb_cpus))) >> return; >> >> if (rq->nr_running >= 2) { >> @@ -12767,7 +12782,7 @@ static void nohz_balancer_kick(struct rq *rq) >> * When balancing between cores, all the SMT siblings of the >> * preferred CPU must be idle. >> */ >> - for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) { >> + for_each_cpu_and(i, sched_domain_span(sd), ilb_cpus) { >> if (sched_asym(sd, i, cpu)) { >> flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK; >> goto unlock; >> @@ -12820,7 +12835,7 @@ static void nohz_balancer_kick(struct rq *rq) >> flags |= NOHZ_NEXT_KICK; >> >> if (flags) >> - kick_ilb(flags); >> + kick_ilb(flags, ilb_cpus); >> } >> >> static void set_cpu_sd_state_busy(int cpu) >> @@ -14253,6 +14268,8 @@ __init void init_sched_fair_class(void) >> zalloc_cpumask_var_node(&per_cpu(select_rq_mask, i), GFP_KERNEL, cpu_to_node(i)); >> zalloc_cpumask_var_node(&per_cpu(should_we_balance_tmpmask, i), >> GFP_KERNEL, cpu_to_node(i)); >> + zalloc_cpumask_var_node(&per_cpu(kick_ilb_tmpmask, i), >> + GFP_KERNEL, cpu_to_node(i)); >> >> #ifdef CONFIG_CFS_BANDWIDTH >> INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i)); >> -- >> 2.43.0 >> > Rest LGTM > Thank you for going through the patch.