public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Vincent Guittot <vincent.guittot@linaro.org>,
	Joel Fernandes <joel@joelfernandes.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Paul McKenney <paulmck@kernel.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Dietmar Eggeman <dietmar.eggemann@arm.com>,
	Qais Yousef <qais.yousef@arm.com>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	"Uladzislau Rezki (Sony)" <urezki@gmail.com>,
	Neeraj upadhyay <neeraj.iitr10@gmail.com>,
	Aubrey Li <aubrey.li@linux.intel.com>
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ
Date: Tue, 23 Mar 2021 14:37:59 -0700	[thread overview]
Message-ID: <274d8ae5-8f4d-7662-0e04-2fbc92b416fc@linux.intel.com> (raw)
In-Reply-To: <20210129172727.GA30719@vingu-book>



On 1/29/21 9:27 AM, Vincent Guittot wrote:
> 
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().

On a well known database workload, we also saw a lot of overhead to do update_blocked_averages
in newidle_balance().  So changes to reduce this overhead is much welcomed.

Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per socket Icelake system.  

A big part of the overhead in our database workload comes from updating
blocked averages in newidle_balance, caused by I/O threads making
some CPUs go in and out of idle frequently in the following code path:

----__blkdev_direct_IO_simple
          |          
          |----io_schedule_timeout
          |          |          
          |           ----schedule_timeout
          |                     |          
          |                      ----schedule
          |                                |          
          |                                 ----__schedule
          |                                           |          
          |                                            ----pick_next_task_fair
          |                                                      |          
          |                                                       ----newidle_balance
          |                                                                 |          
                                                                             ----update_blocked_averages

We found update_blocked_averages() now consumed most CPU time, eating up 2% of the CPU cycles once cgroup
gets turned on.

I hacked up Joe's original patch to rate limit the update of blocked
averages called from newidle_balance().  The 9% throughput degradation reduced to
5.4%.  We'll be testing Vincent's change to see if it can give
similar performance improvement.

Though in our test environment, sysctl_sched_migration_cost was kept
much lower (25000) compared to the default (500000), to encourage migrations to idle cpu
and reduce latency.  We got quite a lot of calls to update_blocked_averages directly 
and then try to load_balance in newidle_balance instead of relegating
the responsibility to idle load balancer.  (See code snippet in newidle_balance below)  


        if (this_rq->avg_idle < sysctl_sched_migration_cost ||       <-----sched_migration_cost check
            !READ_ONCE(this_rq->rd->overload)) {

                rcu_read_lock();
                sd = rcu_dereference_check_sched_domain(this_rq->sd);
                if (sd)
                        update_next_balance(sd, &next_balance);
                rcu_read_unlock();

                goto out;  <--- invoke idle load balancer
        }

        raw_spin_unlock(&this_rq->lock);

        update_blocked_averages(this_cpu);

	.... followed by load balance code ---


So the update_blocked_averages offload to idle_load_balancer in Vincent's patch is less 
effective in this case with small sched_migration_cost.

Looking at the code a bit more, we don't actually load balance every time in this code path
unless our avg_idle time exceeds some threshold.  Doing update_blocked_averages immediately 
is only needed if we do call load_balance().  If we don't do any load balance in the code path,
we can let the idle load balancer update the blocked averages lazily.

Something like the following perhaps on top of Vincent's patch?  We haven't really tested
this change yet but want to see if this change makes sense to you.

Tim

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63950d80fd0b..b93f5f52658a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	struct sched_domain *sd;
 	int pulled_task = 0;
 	u64 curr_cost = 0;
+	bool updated_blocked_avg = false;
 
 	update_misfit_status(NULL, this_rq);
 	/*
@@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_unlock(&this_rq->lock);
 
-	update_blocked_averages(this_cpu);
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
@@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
+			if (!updated_blocked_avg) {
+				update_blocked_averages(this_cpu);
+				updated_blocked_avg = true;
+			}
+
 			t0 = sched_clock_cpu(this_cpu);
 
 			pulled_task = load_balance(this_cpu, this_rq,
 


  parent reply	other threads:[~2021-03-23 21:38 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-22 15:46 [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ Joel Fernandes (Google)
2021-01-22 16:56 ` Vincent Guittot
2021-01-22 18:39   ` Qais Yousef
2021-01-22 19:14     ` Joel Fernandes
2021-01-25 13:23     ` Vincent Guittot
2021-01-26 16:36       ` Qais Yousef
2021-01-22 19:10   ` Joel Fernandes
2021-01-25 10:44     ` Dietmar Eggemann
2021-01-25 17:30       ` Vincent Guittot
2021-01-25 17:53         ` Dietmar Eggemann
2021-01-25 14:42     ` Vincent Guittot
2021-01-27 18:43       ` Joel Fernandes
2021-01-28 13:57         ` Vincent Guittot
2021-01-28 15:09           ` Joel Fernandes
2021-01-28 16:57             ` Qais Yousef
     [not found]             ` <CAKfTPtBvwm9vZb5C=2oTF6N-Ht6Rvip4Lv18yi7O3G8e-_ZWdg@mail.gmail.com>
2021-01-29 17:27               ` Vincent Guittot
2021-02-03 11:54                 ` Dietmar Eggemann
2021-02-03 13:12                   ` Vincent Guittot
2021-02-04  9:47                     ` Dietmar Eggemann
2021-02-03 17:09                 ` Qais Yousef
2021-02-03 17:35                   ` Vincent Guittot
2021-02-04 10:45                     ` Qais Yousef
2021-02-03 19:56                 ` Joel Fernandes
2021-03-23 21:37                 ` Tim Chen [this message]
2021-03-24 13:44                   ` Vincent Guittot
2021-03-24 16:05                     ` Tim Chen
2021-04-07 14:02                       ` Vincent Guittot
2021-04-07 17:19                         ` Tim Chen
2021-04-08 14:51                           ` Vincent Guittot
2021-04-08 23:05                             ` Tim Chen
2021-04-09 15:26                               ` Vincent Guittot
2021-04-09 17:59                                 ` Tim Chen
2021-05-10 21:59                                   ` Tim Chen
2021-05-11 15:25                                     ` Vincent Guittot
2021-05-11 17:25                                       ` Tim Chen
2021-05-11 17:56                                         ` Vincent Guittot
2021-05-12 13:59                                         ` Qais Yousef
2021-05-13 18:45                                           ` Tim Chen
2021-05-17 16:14                                             ` Qais Yousef
2021-06-11 20:00                                           ` Tim Chen
2021-06-18 10:28                                             ` Vincent Guittot
2021-06-18 16:14                                               ` Tim Chen
2021-06-25  8:50                                                 ` Vincent Guittot
2021-02-01 15:13               ` Joel Fernandes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=274d8ae5-8f4d-7662-0e04-2fbc92b416fc@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=aubrey.li@linux.intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=fweisbec@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=neeraj.iitr10@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qais.yousef@arm.com \
    --cc=rostedt@goodmis.org \
    --cc=urezki@gmail.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox