All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Vineeth Pillai <vineethrp@google.com>,
	Suleiman Souhlal <suleiman@google.com>,
	Hsin Yi <hsinyi@google.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	"Paul E . McKenney" <paulmck@kernel.org>
Subject: Re: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB
Date: Fri, 6 Oct 2023 12:51:41 +0200	[thread overview]
Message-ID: <ZR/mvd8Uw8PG+jx0@gmail.com> (raw)
In-Reply-To: <20231005161727.1855004-1-joel@joelfernandes.org>


* Joel Fernandes (Google) <joel@joelfernandes.org> wrote:

> From: Vineeth Pillai <vineethrp@google.com>
> 
> Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> balancing for it because it can't perform its own periodic load balancing.
> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> the upcoming nohz-idle load balancing is too distant in the future. This update
> process is done by triggering an ILB, as the general ILB handler
> (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> and selecting the smallest value.
> 
> Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> primarily results in the ILB handler updating 'nohz.next_balance' while
> possibly not doing any load balancing at all. However, sending an IPI merely to
> refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> efficient method to update 'nohz.next_balance' from the local CPU.
> 
> Fortunately, there already exists a mechanism to directly invoke the ILB
> handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> balancing and solely exists to update a CPU's blocked load if it couldn't pull
> more tasks during regular "newly idle balancing" - and it does so without
> having to send any IPIs. Once the flag is set, the ILB handler is called
> directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> the blocked load without an IPI, in our situation, we aim to refresh
> 'nohz.next_balance' without an IPI but we can piggy back on this.
> 
> So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> indicate nohz.next_balance needs an update via this direct call shortcut. Note
> that we set this flag without knowledge that the tick is about to be stopped,
> because at the point we do it, we have no way of knowing that. However we do
> know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> is well worth updating nohz.next_balance a few more times.
> 
> Also just to note, without this patch we observe the following pattern:
> 
> 1. A CPU is about to stop its tick.
> 2. It sets nohz.needs_update to 1.
> 3. It then stops its tick and goes idle.
> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> 5. The ILB CPU ends up being the one that just stopped its tick!
> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
>    and disturbing it!
> 
> Testing shows a considerable reduction in IPIs when doing this:
> 
> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> the IPI call count profiled over 10s period is as follows:
> without fix: ~10500
> with fix: ~1000
> 
> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> 
> [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]
> 
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Hsin Yi <hsinyi@google.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Vineeth Pillai <vineethrp@google.com>
> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb225921bbca..2ece55f32782 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
>  	/*
>  	 * Ensures that if nohz_idle_balance() fails to observe our
>  	 * @idle_cpus_mask store, it must observe the @has_blocked
> -	 * and @needs_update stores.
> +	 * stores.
>  	 */
>  	smp_mb__after_atomic();
>  
>  	set_cpu_sd_state_idle(cpu);
>  
> -	WRITE_ONCE(nohz.needs_update, 1);
>  out:
>  	/*
>  	 * Each time a cpu enter idle, we assume that it has blocked load and
> @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  }
>  
>  /*
> - * Check if we need to run the ILB for updating blocked load before entering
> - * idle state.
> + * Check if we need to run the ILB for updating blocked load and/or updating
> + * nohz.next_balance before entering idle state.
>   */
>  void nohz_run_idle_balance(int cpu)
>  {
>  	unsigned int flags;
>  
> -	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> +	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> +
> +	if (!flags)
> +		return;
>  
>  	/*
>  	 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
>  	 * (ie NOHZ_STATS_KICK set) and will do the same.
>  	 */
> -	if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> -		_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> +	if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> +	    !need_resched())
> +		_nohz_idle_balance(cpu_rq(cpu), flags);
>  }
>  
>  static void nohz_newidle_balance(struct rq *this_rq)
> @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  	if (this_rq->avg_idle < sysctl_sched_migration_cost)
>  		return;
>  
> +	/* If rq->next_balance before nohz.next_balance, trigger ILB */
> +	if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> +		atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> +
>  	/* Don't need to update blocked load of idle CPUs*/
>  	if (!READ_ONCE(nohz.has_blocked) ||
>  	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))

Ok, judging by your IPI reduction numbers this is definitely an 
optimization we want to do.

The patch does make _nohz_idle_balance() run more parallel, as previously 
it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at 
least for next_balance updates), but I think it's still SMP-safe, as all 
key data structure updates are already rq-locked AFAICS.

One thing I noticed is that we now use nohz.needs_update only in a single 
remaining case, when _nohz_idle_balance() "self-defers":

                /*
                 * If this CPU gets work to do, stop the load balancing
                 * work being done for other CPUs. Next load
                 * balancing owner will pick it up.
                 */
                if (need_resched()) {
                        if (flags & NOHZ_STATS_KICK)
                                has_blocked_load = true;
                        if (flags & NOHZ_NEXT_KICK)
                                WRITE_ONCE(nohz.needs_update, 1);
                        goto abort;
                }

Getting a need-resched flag set on this CPU is a pretty dubious reason to 
skip an ILB run IMO, and we could do entirely without that complication, 
allowing us to remove the nohz.needs_update flag handling logic altogether?

If we do that then the !need_resched() flag needs to go from 
nohz_run_idle_balance() too:

        /*
         * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
         * (ie NOHZ_STATS_KICK set) and will do the same.
         */
        if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
            !need_resched())
                _nohz_idle_balance(cpu_rq(cpu), flags);

... if I'm reading the code right that is.

Thanks,

	Ingo

  reply	other threads:[~2023-10-06 10:51 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-05 16:17 [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB Joel Fernandes (Google)
2023-10-06 10:51 ` Ingo Molnar [this message]
2023-10-06 16:32   ` Joel Fernandes
2023-10-08 17:35   ` Joel Fernandes
2023-10-09 18:33     ` Vineeth Pillai
2023-10-10  7:15     ` Vincent Guittot
2023-10-10 19:32       ` Joel Fernandes
2023-10-06 13:46 ` Vincent Guittot
2023-10-06 16:46   ` Joel Fernandes
2023-10-06 19:18 ` Shrikanth Hegde
2023-10-06 20:10   ` Shrikanth Hegde
2023-10-08 16:50     ` Joel Fernandes
2023-10-06 21:20   ` Vineeth Pillai
2023-10-08 16:46   ` Joel Fernandes
2023-10-06 20:01 ` Peter Zijlstra
2023-10-08 16:39   ` Joel Fernandes
2023-10-09 11:25     ` Ingo Molnar
2023-10-09 20:11       ` Steven Rostedt
2023-10-10 17:55       ` Joel Fernandes
2023-10-19 14:56 ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZR/mvd8Uw8PG+jx0@gmail.com \
    --to=mingo@kernel.org \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=hsinyi@google.com \
    --cc=joel@joelfernandes.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=suleiman@google.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethrp@google.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.