public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: Shubhang Kaushik <shubhang@os.amperecomputing.com>,
	mingo@kernel.org, peterz@infradead.org,
	vincent.guittot@linaro.org, linux-kernel@vger.kernel.org
Cc: kprateek.nayak@amd.com, juri.lelli@redhat.com,
	vschneid@redhat.com, tglx@linutronix.de,
	dietmar.eggemann@arm.com, frederic@kernel.org,
	longman@redhat.com
Subject: Re: [PATCH 1/2] sched/fair: consider hk_mask early in triggering ilb
Date: Fri, 20 Mar 2026 08:17:46 +0530	[thread overview]
Message-ID: <c909d047-f630-4184-b8ff-c80a28c99342@linux.ibm.com> (raw)
In-Reply-To: <788875ec-787b-3024-6f01-ef3ed9bd6ec7@os.amperecomputing.com>

Hi Shubhang. Thanks for taking a look.

On 3/20/26 4:28 AM, Shubhang Kaushik wrote:
> Hi Shrikanth,
> 
> On Thu, 19 Mar 2026, Shrikanth Hegde wrote:
> 
>> Current code around nohz_balancer_kick and kick_ilb:
>> 1. Checks for nohz.idle_cpus_mask to see if idle load balance(ilb) is
>>   needed.
>> 2. Does a few checks to see if any conditions meet the criteria.
>> 3. Tries to find the idle CPU. But the idle CPU found should be part of
>>   housekeeping CPUs.
>>
>> If there is no housekeeping idle CPU, then step 2 is done
>> un-necessarily, since 3 bails out without doing the ilb.
>>
>> Fix that by making the decision early and pass it on to find_new_ilb.
>> Use a percpu cpumask instead of allocating it everytime since this is in
>> fastpath.
>>
>> If flags is set to NOHZ_STATS_KICK since the time is after 
>> nohz.next_blocked
>> but before nohz.next_balance and there are idle CPUs which are part of
>> housekeeping, need to copy the same logic there too.
>>
>> While there, fix the stale comments around nohz.nr_cpus
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>
>> Didn't add the fixes tag since it addresses more than stale comments.
>>
>> kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++--------------
>> 1 file changed, 31 insertions(+), 14 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b19aeaa51ebc..02cca2c7a98d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7392,6 +7392,7 @@ static inline unsigned int 
>> cfs_h_nr_delayed(struct rq *rq)
>> static DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
>> static DEFINE_PER_CPU(cpumask_var_t, select_rq_mask);
>> static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
>> +static DEFINE_PER_CPU(cpumask_var_t, kick_ilb_tmpmask);
>>
>> #ifdef CONFIG_NO_HZ_COMMON
>>
>> @@ -12629,15 +12630,14 @@ static inline int on_null_domain(struct rq *rq)
>>  * - When one of the busy CPUs notices that there may be an idle 
>> rebalancing
>>  *   needed, they will kick the idle load balancer, which then does idle
>>  *   load balancing for all the idle CPUs.
>> + *
>> + *   @cpus idle CPUs in HK_TYPE_KERNEL_NOISE housekeeping
>>  */
>> -static inline int find_new_ilb(void)
>> +static inline int find_new_ilb(struct cpumask *cpus)
>> {
>> -    const struct cpumask *hk_mask;
>>     int ilb_cpu;
>>
>> -    hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>> -
>> -    for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
>> +    for_each_cpu(ilb_cpu, cpus) {
>>
>>         if (ilb_cpu == smp_processor_id())
>>             continue;
>> @@ -12656,7 +12656,7 @@ static inline int find_new_ilb(void)
>>  * We pick the first idle CPU in the HK_TYPE_KERNEL_NOISE housekeeping 
>> set
>>  * (if there is one).
>>  */
>> -static void kick_ilb(unsigned int flags)
>> +static void kick_ilb(unsigned int flags, struct cpumask *cpus)
>> {
>>     int ilb_cpu;
>>
>> @@ -12667,7 +12667,7 @@ static void kick_ilb(unsigned int flags)
>>     if (flags & NOHZ_BALANCE_KICK)
>>         nohz.next_balance = jiffies+1;
>>
>> -    ilb_cpu = find_new_ilb();
>> +    ilb_cpu = find_new_ilb(cpus);
>>     if (ilb_cpu < 0)
>>         return;
>>
>> @@ -12700,6 +12700,7 @@ static void kick_ilb(unsigned int flags)
>>  */
>> static void nohz_balancer_kick(struct rq *rq)
>> {
>> +    struct cpumask *ilb_cpus = 
>> this_cpu_cpumask_var_ptr(kick_ilb_tmpmask);
>>     unsigned long now = jiffies;
>>     struct sched_domain_shared *sds;
>>     struct sched_domain *sd;
>> @@ -12715,27 +12716,41 @@ static void nohz_balancer_kick(struct rq *rq)
>>      */
>>     nohz_balance_exit_idle(rq);
>>
>> +    /* ILB considers only HK_TYPE_KERNEL_NOISE housekeeping CPUs */
>> +
>>     if (READ_ONCE(nohz.has_blocked_load) &&
>> -        time_after(now, READ_ONCE(nohz.next_blocked)))
>> +        time_after(now, READ_ONCE(nohz.next_blocked))) {
>>         flags = NOHZ_STATS_KICK;
>> +        cpumask_and(ilb_cpus, nohz.idle_cpus_mask,
>> +                housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
>> +    }
>>
> Moving cpumask_and() here would be taxing the busy/requesting CPU on 
> every kick check, thereby breaking the lazy evaluation.
> 

There is time checks. So not every tick.

> In nohz_full scenario, the housekeeping mask is static (default CPU 0).

One could specify the mask. What do you mean CPU 0?

> If this HK CPU is busy, it is already running the tick and will handle 
> load balancing itslef. If it is idle, it is already in the 
> idle_cpus_mask. Moving this intersection to the busy CPU disturbs lazy 
> evaluation. Why force every busy worker to perform bitmask math just to 
> handle a no-idle-HK case tha the system handles naturally by being busy?
> 

lazy evaluation is about NOHZ_BALANCE_KICK which does the actual idle balance.
But NOHZ_STATS_KICK would set more often on normal system at 32 msec.

You do have a point. What if it was not due to nohz.next_blocked but due to later
checks such as nr_running > 1. I would say thats rare. I didn;t want to put too many
checks such as do this only for nohz_full systems and do lazy evaluation for systems
without it.

I do agree benefit of this patch is limited to case where one
specifies nohz_full=<small set of CPUS>.

>>     /*
>> -     * Most of the time system is not 100% busy. i.e nohz.nr_cpus > 0
>> -     * Skip the read if time is not due.
>> +     * Most of the time system is not 100% busy. i.e there are idle
>> +     * housekeeping CPUs.
>> +     *
>> +     * So, Skip the reading idle_cpus_mask if time is not due.
>>      *
>>      * If none are in tickless mode, there maybe a narrow window
>>      * (28 jiffies, HZ=1000) where flags maybe set and kick_ilb called.
>>      * But idle load balancing is not done as find_new_ilb fails.
>> -     * That's very rare. So read nohz.nr_cpus only if time is due.
>> +     * That's very rare. So check (idle_cpus_mask & 
>> HK_TYPE_KERNEL_NOISE)
>> +     * only if time is due.
>> +     *
>>      */
>>     if (time_before(now, nohz.next_balance))
>>         goto out;
>>
>> +    /* Avoid the double computation */
>> +    if (flags != NOHZ_STATS_KICK)
>> +        cpumask_and(ilb_cpus, nohz.idle_cpus_mask,
>> +                housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
>> +
>>     /*
>>      * None are in tickless mode and hence no need for NOHZ idle load
>>      * balancing
>>      */
>> -    if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
>> +    if (unlikely(cpumask_empty(ilb_cpus)))
>>         return;
>>
> Checking ilb_cpus here is a broken fix because the underlying 
> nohz.idle_cpus_mask is currently unreliable. Under CONFIG_NO_HZ_FULL, 
> there is a documented visibility bug: tick_nohz_full_stop_tick() often 
> stops the tick before idle entry, causing nohz_balance_enter_idle() to 
> be skipped. This means many valid, tickless idle CPUs are never added to 
> the mask in the first place.

This is not in the scope of this patch.

> https://lore.kernel.org/lkml/20260203-fix-nohz-idle-v1-1- 
> ad05a5872080@os.amperecomputing.com/
> 
>>     if (rq->nr_running >= 2) {
>> @@ -12767,7 +12782,7 @@ static void nohz_balancer_kick(struct rq *rq)
>>          * When balancing between cores, all the SMT siblings of the
>>          * preferred CPU must be idle.
>>          */
>> -        for_each_cpu_and(i, sched_domain_span(sd), 
>> nohz.idle_cpus_mask) {
>> +        for_each_cpu_and(i, sched_domain_span(sd), ilb_cpus) {
>>             if (sched_asym(sd, i, cpu)) {
>>                 flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>>                 goto unlock;
>> @@ -12820,7 +12835,7 @@ static void nohz_balancer_kick(struct rq *rq)
>>         flags |= NOHZ_NEXT_KICK;
>>
>>     if (flags)
>> -        kick_ilb(flags);
>> +        kick_ilb(flags, ilb_cpus);
>> }
>>
>> static void set_cpu_sd_state_busy(int cpu)
>> @@ -14253,6 +14268,8 @@ __init void init_sched_fair_class(void)
>>         zalloc_cpumask_var_node(&per_cpu(select_rq_mask,    i), 
>> GFP_KERNEL, cpu_to_node(i));
>>         zalloc_cpumask_var_node(&per_cpu(should_we_balance_tmpmask, i),
>>                     GFP_KERNEL, cpu_to_node(i));
>> +        zalloc_cpumask_var_node(&per_cpu(kick_ilb_tmpmask, i),
>> +                    GFP_KERNEL, cpu_to_node(i));
>>
>> #ifdef CONFIG_CFS_BANDWIDTH
>>         INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i));
>> -- 
>> 2.43.0
>>
>>
> 
> Regards,
> Shubhang Kaushik


  reply	other threads:[~2026-03-20  2:48 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19  6:53 [PATCH 0/2] sched/fair: Minor improvements while triggering idle load balance Shrikanth Hegde
2026-03-19  6:53 ` [PATCH 1/2] sched/fair: consider hk_mask early in triggering ilb Shrikanth Hegde
2026-03-19  8:15   ` Mukesh Kumar Chaurasiya
2026-03-19 13:13     ` Shrikanth Hegde
2026-03-19 22:58   ` Shubhang Kaushik
2026-03-20  2:47     ` Shrikanth Hegde [this message]
2026-03-20  3:37   ` K Prateek Nayak
2026-03-20  9:19     ` Shrikanth Hegde
2026-03-20 11:43       ` Peter Zijlstra
2026-03-20 14:12         ` Shrikanth Hegde
2026-03-20 14:28           ` Shrikanth Hegde
2026-03-19  6:53 ` [PATCH 2/2] sched/fair: get this cpu once in find_new_ilb Shrikanth Hegde
2026-03-19  8:18   ` Mukesh Kumar Chaurasiya
2026-03-19  9:20   ` Peter Zijlstra
2026-03-19 13:03     ` Shrikanth Hegde
2026-03-19 13:39       ` Peter Zijlstra
2026-03-20  3:40   ` K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c909d047-f630-4184-b8ff-c80a28c99342@linux.ibm.com \
    --to=sshegde@linux.ibm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=shubhang@os.amperecomputing.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox