All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joel Fernandes <joel@joelfernandes.org>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	"Vineeth Pillai (Google)" <vineeth@bitbyteword.org>,
	Suleiman Souhlal <suleiman@google.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	"Paul E . McKenney" <paulmck@kernel.org>
Subject: Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
Date: Thu, 13 Jun 2024 07:13:42 +0000	[thread overview]
Message-ID: <20240613071342.GA1810503@google.com> (raw)
In-Reply-To: <ZVOVkDf28foTkn/A@vingu-book>

Getting to this pretty late, sorry, see below.

On Tue, Nov 14, 2023 at 04:43:12PM +0100, Vincent Guittot wrote:
> Le jeudi 09 nov. 2023 à 10:02:54 (+0000), Joel Fernandes a écrit :
> > Hi Vincent,
> > 
> > Sorry for late reply, I was in Tokyo all these days and was waiting to get to
> > writing a proper reply. See my replies below:
> > 
> > On Thu, Oct 26, 2023 at 04:23:35PM +0200, Vincent Guittot wrote:
> > > On Sun, 22 Oct 2023 at 02:28, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote:
> > > > > On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google)
> > > > > <joel@joelfernandes.org> wrote:
> > > > > >
> > > > > > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
> > > > > >
> > > > > > When newidle balancing triggers, we see that it constantly clobbers
> > > > > > rq->next_balance even when there is no newidle balance happening due to
> > > > > > the cost estimates.  Due to this, we see that periodic load balance
> > > > > > (rebalance_domains) may trigger way more often when the CPU is going in
> > > > > > and out of idle at a high rate but is no really idle. Repeatedly
> > > > > > triggering load balance there is a bad idea as it is a heavy operation.
> > > > > > It also causes increases in softirq.
> > > > >
> > > > > we have 2 balance intervals:
> > > > > - one when idle based on the sd->balance_interval = sd_weight
> > > > > - one when busy which increases the period by multiplying it with
> > > > > busy_factor = 16
> > > >
> > > > On my production system I see load balance triggering every 4 jiffies! In a
> > > 
> > > Which kind of system do you have? sd->balance_interval is in ms
> > 
> > Yes, sorry I meant it triggers every jiffies which is extreme sometimes. It
> > is an ADL SoC (12th gen Intel, 4 P cores 8 E cores) get_sd_balance_interval()
> > returns 4 jiffies there. On my Qemu system, I see 8 jiffies.
> 
> Do you have details about the sched_domain  hierarchy ?
> That could be part of your problem (see below)

The hierarchy is pretty simple:

$ cat /sys/kernel/debug/sched/domains/cpu*/domain0/name
MC
MC
MC
MC

I boot qemu like this by passing "-smp cpus=4,threads=1,sockets=1"

> > 
> > [...]
> > > > > > Another issue is ->last_balance is not updated after newidle balance
> > > > > > causing mistakes in the ->next_balance calculations.
> > > > >
> > > > > newly idle load balance is not equal to idle load balance. It's a
> > > > > light load balance trying to pull one  task and you can't really
> > > > > consider it to the normal load balance
> > > >
> > > > True. However the point is that it is coupled with the other load balance
> > > > mechanism and the two are not independent. As you can see below, modifying
> > > > rq->next_balance in newidle also causes the periodic balance to happen more
> > > > aggressively as well if there is a high transition from busy to idle and
> > > > viceversa.
> > > 
> > > As mentioned, rq->next_balance is updated whenever cpu enters idle
> > > (i.e. in newidle_balance() but it's not related with doing a newly
> > > idle load balance.
> > 
> > Yes, I understand that. But my point was that the update of rq->next_balance
> > from the newidle path is itself buggy and interferes with the load balance
> > happening from the tick (trigger_load_balance -> run_rebalance_domains).
> 
> Newidle path is not buggy. It only uses sd->last_balance + interval to
> estimate the next balance  which is the correct thing to do. Your problem
> comes from the update of sd->last_balance which never happens and remains
> in the past whereas you call run_rebalance_domains() which should
> run load_balance for all domains with a sd->last_balance + interval in the
> past.
> Your problem most probably comes from the should_we_balance which always or
> "almost always" returns false in your use case for some sched_domain and
> prevents to updat sd->last_balance. Could you try the patch below ?
> It should fix your problem of trying to rebalance every tick whereas
> rebalance_domain is called.
> At least this should show if it's your problem but I'm not sure it's the right
> things to do all the time ...

I tried your diff below. It did not make a difference to the problem. Only
this patch series made a ~10-20x softirq reduction.

> 
> ---
>  kernel/sched/fair.c | 18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3745ca289240..9ea1f42e5362 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11568,17 +11568,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>  		need_decay = update_newidle_cost(sd, 0);
>  		max_cost += sd->max_newidle_lb_cost;
> 
> -		/*
> -		 * Stop the load balance at this level. There is another
> -		 * CPU in our sched group which is doing load balancing more
> -		 * actively.
> -		 */
> -		if (!continue_balancing) {
> -			if (need_decay)
> -				continue;
> -			break;
> -		}
> -
>  		interval = get_sd_balance_interval(sd, busy);
> 
>  		need_serialize = sd->flags & SD_SERIALIZE;
> @@ -11588,7 +11577,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>  		}
> 
>  		if (time_after_eq(jiffies, sd->last_balance + interval)) {
> -			if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
> +			/*
> +			 * Stop the load balance at this level. There is another
> +			 * CPU in our sched group which is doing load balancing more
> +			 * actively.
> +			 */
> +			if (continue_balancing && load_balance(cpu, rq, sd, idle, &continue_balancing)) {

This diff did not solve the problem. Let me go see what other paths are not
updating sd->last_balance in the run_rebalance_domains()..

thanks,

 - Joel


  parent reply	other threads:[~2024-06-13  7:13 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-20  1:40 [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2) Joel Fernandes (Google)
2023-10-20  1:40 ` [PATCH 2/3] sched/nohz: Update comments about NEWILB_KICK Joel Fernandes (Google)
2023-10-20  7:51   ` Ingo Molnar
2023-10-20  8:02   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
2023-10-20  1:40 ` [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance Joel Fernandes (Google)
2023-10-20  7:53   ` Ingo Molnar
2023-10-20 13:48     ` Vincent Guittot
2023-10-21  6:50       ` Ingo Molnar
2023-10-20  8:02   ` [tip: sched/core] sched/fair: " tip-bot2 for Vineeth Pillai (Google)
2023-10-20 13:40   ` [PATCH 3/3] sched: " Vincent Guittot
2023-10-20 13:56     ` Ingo Molnar
2023-10-20 15:50       ` Joel Fernandes
2023-10-22  0:28     ` Joel Fernandes
2023-10-26 14:23       ` Vincent Guittot
2023-11-09 10:02         ` Joel Fernandes
2023-11-09 12:31           ` Joel Fernandes
2023-11-14 15:43           ` Vincent Guittot
2023-11-14 17:43             ` Joel Fernandes
2024-06-13  7:13             ` Joel Fernandes [this message]
2025-11-25 13:16               ` [PATCH] sched/fair: Only increment deadline once on yield Wang Tao
2025-11-25 13:27               ` [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance Wang Tao
2023-10-20  4:17 ` [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2) Joel Fernandes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240613071342.GA1810503@google.com \
    --to=joel@joelfernandes.org \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=suleiman@google.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineeth@bitbyteword.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.