All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Joel Fernandes <joel@joelfernandes.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Paul McKenney <paulmck@kernel.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Dietmar Eggeman <dietmar.eggemann@arm.com>,
	Qais Yousef <qais.yousef@arm.com>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	"Uladzislau Rezki (Sony)" <urezki@gmail.com>,
	Neeraj upadhyay <neeraj.iitr10@gmail.com>,
	Aubrey Li <aubrey.li@linux.intel.com>
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ
Date: Wed, 24 Mar 2021 14:44:37 +0100	[thread overview]
Message-ID: <20210324134437.GA17675@vingu-book> (raw)
In-Reply-To: <274d8ae5-8f4d-7662-0e04-2fbc92b416fc@linux.intel.com>

Hi Tim,

Le mardi 23 mars 2021 à 14:37:59 (-0700), Tim Chen a écrit :
> 
> 
> On 1/29/21 9:27 AM, Vincent Guittot wrote:
> > 
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> 
> On a well known database workload, we also saw a lot of overhead to do update_blocked_averages
> in newidle_balance().  So changes to reduce this overhead is much welcomed.
> 
> Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per socket Icelake system.  
> 
> A big part of the overhead in our database workload comes from updating
> blocked averages in newidle_balance, caused by I/O threads making
> some CPUs go in and out of idle frequently in the following code path:
> 
> ----__blkdev_direct_IO_simple
>           |          
>           |----io_schedule_timeout
>           |          |          
>           |           ----schedule_timeout
>           |                     |          
>           |                      ----schedule
>           |                                |          
>           |                                 ----__schedule
>           |                                           |          
>           |                                            ----pick_next_task_fair
>           |                                                      |          
>           |                                                       ----newidle_balance
>           |                                                                 |          
>                                                                              ----update_blocked_averages
> 
> We found update_blocked_averages() now consumed most CPU time, eating up 2% of the CPU cycles once cgroup
> gets turned on.
> 
> I hacked up Joe's original patch to rate limit the update of blocked
> averages called from newidle_balance().  The 9% throughput degradation reduced to
> 5.4%.  We'll be testing Vincent's change to see if it can give
> similar performance improvement.
> 
> Though in our test environment, sysctl_sched_migration_cost was kept
> much lower (25000) compared to the default (500000), to encourage migrations to idle cpu
> and reduce latency.  We got quite a lot of calls to update_blocked_averages directly 
> and then try to load_balance in newidle_balance instead of relegating
> the responsibility to idle load balancer.  (See code snippet in newidle_balance below)  
> 
> 
>         if (this_rq->avg_idle < sysctl_sched_migration_cost ||       <-----sched_migration_cost check
>             !READ_ONCE(this_rq->rd->overload)) {
> 
>                 rcu_read_lock();
>                 sd = rcu_dereference_check_sched_domain(this_rq->sd);
>                 if (sd)
>                         update_next_balance(sd, &next_balance);
>                 rcu_read_unlock();
> 
>                 goto out;  <--- invoke idle load balancer
>         }
> 
>         raw_spin_unlock(&this_rq->lock);
> 
>         update_blocked_averages(this_cpu);
> 
> 	.... followed by load balance code ---
> 
 
> So the update_blocked_averages offload to idle_load_balancer in Vincent's patch is less 
> effective in this case with small sched_migration_cost.
> 
> Looking at the code a bit more, we don't actually load balance every time in this code path
> unless our avg_idle time exceeds some threshold.  Doing update_blocked_averages immediately 

IIUC your problem, we call update_blocked_averages() but because of:

		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
			update_next_balance(sd, &next_balance);
			break;
		}

the for_each_domain loop stops even before running load_balance on the 1st
sched domain level which means that update_blocked_averages() was called
unnecessarily. 

And this is even more true with a small sysctl_sched_migration_cost which allows newly
idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value 
for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
smallest domain but that's probably because of task_hot().

if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
skip spin_unlock/lock and for_each_domain() loop entirely

Maybe something like below:


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76e33a70d575..08933e0d87ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10723,17 +10723,21 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
         */
        rq_unpin_lock(this_rq, rf);

+       rcu_read_lock();
+       sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
        if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-           !READ_ONCE(this_rq->rd->overload)) {
+           !READ_ONCE(this_rq->rd->overload) ||
+           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

-               rcu_read_lock();
-               sd = rcu_dereference_check_sched_domain(this_rq->sd);
                if (sd)
                        update_next_balance(sd, &next_balance);
                rcu_read_unlock();

                goto out;
        }
+       rcu_read_unlock();
+

        raw_spin_unlock(&this_rq->lock);


> is only needed if we do call load_balance().  If we don't do any load balance in the code path,
> we can let the idle load balancer update the blocked averages lazily.
> 
> Something like the following perhaps on top of Vincent's patch?  We haven't really tested
> this change yet but want to see if this change makes sense to you.
> 
> Tim
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 63950d80fd0b..b93f5f52658a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	struct sched_domain *sd;
>  	int pulled_task = 0;
>  	u64 curr_cost = 0;
> +	bool updated_blocked_avg = false;
>  
>  	update_misfit_status(NULL, this_rq);
>  	/*
> @@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	raw_spin_unlock(&this_rq->lock);
>  
> -	update_blocked_averages(this_cpu);
>  	rcu_read_lock();
>  	for_each_domain(this_cpu, sd) {
>  		int continue_balancing = 1;
> @@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  		}
>  
>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
> +			if (!updated_blocked_avg) {
> +				update_blocked_averages(this_cpu);
> +				updated_blocked_avg = true;
> +			}
> +
>  			t0 = sched_clock_cpu(this_cpu);
>  
>  			pulled_task = load_balance(this_cpu, this_rq,
>  
> 

  reply	other threads:[~2021-03-24 13:45 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-22 15:46 [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ Joel Fernandes (Google)
2021-01-22 16:56 ` Vincent Guittot
2021-01-22 18:39   ` Qais Yousef
2021-01-22 19:14     ` Joel Fernandes
2021-01-25 13:23     ` Vincent Guittot
2021-01-26 16:36       ` Qais Yousef
2021-01-22 19:10   ` Joel Fernandes
2021-01-25 10:44     ` Dietmar Eggemann
2021-01-25 17:30       ` Vincent Guittot
2021-01-25 17:53         ` Dietmar Eggemann
2021-01-25 14:42     ` Vincent Guittot
2021-01-27 18:43       ` Joel Fernandes
2021-01-28 13:57         ` Vincent Guittot
2021-01-28 15:09           ` Joel Fernandes
2021-01-28 16:57             ` Qais Yousef
     [not found]             ` <CAKfTPtBvwm9vZb5C=2oTF6N-Ht6Rvip4Lv18yi7O3G8e-_ZWdg@mail.gmail.com>
2021-01-29 17:27               ` Vincent Guittot
2021-02-03 11:54                 ` Dietmar Eggemann
2021-02-03 13:12                   ` Vincent Guittot
2021-02-04  9:47                     ` Dietmar Eggemann
2021-02-03 17:09                 ` Qais Yousef
2021-02-03 17:35                   ` Vincent Guittot
2021-02-04 10:45                     ` Qais Yousef
2021-02-03 19:56                 ` Joel Fernandes
2021-03-23 21:37                 ` Tim Chen
2021-03-24 13:44                   ` Vincent Guittot [this message]
2021-03-24 16:05                     ` Tim Chen
2021-04-07 14:02                       ` Vincent Guittot
2021-04-07 17:19                         ` Tim Chen
2021-04-08 14:51                           ` Vincent Guittot
2021-04-08 23:05                             ` Tim Chen
2021-04-09 15:26                               ` Vincent Guittot
2021-04-09 17:59                                 ` Tim Chen
2021-05-10 21:59                                   ` Tim Chen
2021-05-11 15:25                                     ` Vincent Guittot
2021-05-11 17:25                                       ` Tim Chen
2021-05-11 17:56                                         ` Vincent Guittot
2021-05-12 13:59                                         ` Qais Yousef
2021-05-13 18:45                                           ` Tim Chen
2021-05-17 16:14                                             ` Qais Yousef
2021-06-11 20:00                                           ` Tim Chen
2021-06-18 10:28                                             ` Vincent Guittot
2021-06-18 16:14                                               ` Tim Chen
2021-06-25  8:50                                                 ` Vincent Guittot
2021-02-01 15:13               ` Joel Fernandes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210324134437.GA17675@vingu-book \
    --to=vincent.guittot@linaro.org \
    --cc=aubrey.li@linux.intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=fweisbec@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=neeraj.iitr10@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qais.yousef@arm.com \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=urezki@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.