[RFC PATCH v2] sched: Limit idle_balance()

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jason Low <jason.low2@hp.com>
To: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Jason Low <jason.low2@hp.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Mike Galbraith <efault@gmx.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul Turner <pjt@google.com>, Alex Shi <alex.shi@intel.com>,
	Preeti U Murthy <preeti@linux.vnet.ibm.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Namhyung Kim <namhyung@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kees Cook <keescook@chromium.org>, Mel Gorman <mgorman@suse.de>,
	Rik van Riel <riel@redhat.com>,
	aswin@hp.com, scott.norton@hp.com, chegu_vinod@hp.com,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Subject: [RFC PATCH v2] sched: Limit idle_balance()
Date: Fri, 19 Jul 2013 00:50:11 -0700	[thread overview]
Message-ID: <1374220211.5447.9.camel@j-VirtualBox> (raw)

When idle balance is being used frequently, functions such as load_balance(),
update_sd_lb_stats(), tg_load_down(), and acquiring run queue locks can increase
time spent in the kernel by quite a bit. In addition to spending kernel time
in those functions, it may sometimes indirectly increase the % time spent
acquiring other kernel locks and also reduce % time spent completing user space
functions. For example, when running fserver at a high number of users on an
8 socket box with HT-enabled, a lot of time was spent on __read_lock_failed(),
__write_lock_failed(), update_sd_lb_stats(), tg_load_down(), ect... as reported
by perf. Once we reduce the rate at which idle_balance occurs, the percentage
time spent on those functions decreased, and the percentage of the time spent
in user space functions such as add_long() increased.

Below is a sample perf report on running fserver with high # of users when
idle_balance is allowed to run often:

    18.01%     reaim    [k] mspin_lock
    16.42%     reaim    [k] __read_lock_failed
    10.27%     reaim    [k] __write_lock_failed
     2.84%     reaim    [k] _raw_spin_lock
     2.42%     reaim    [k] mutex_spin_on_owner
     2.21%     reaim    [k] update_sd_lb_stats
     1.93%     reaim    [k] start_this_handle
     1.75%     reaim    [.] add_int
     1.74%     reaim    [.] add_long
     1.59%     reaim    [k] update_cfs_rq_blocked_load
     1.42%     reaim    [k] tg_load_down
     1.35%     swapper  [k] intel_idle
     1.22%     reaim    [.] div_long
     1.21%     reaim    [k] _raw_spin_lock_irqsave
     1.18%     reaim    [k] load_balance

Below is a perf report when we lower the rate at which idle_balance is attempted:

    14.84%     reaim    [k] mspin_lock
     7.03%     reaim    [.] add_int
     6.99%     reaim    [.] add_long
     4.49%     reaim    [.] div_long
     4.18%     reaim    [.] strncat
     3.98%     reaim    [.] string_rtns_1
     3.94%     reaim    [k] mutex_spin_on_owner
     3.27%     reaim    [.] add_short
     2.77%     reaim    [k] __read_lock_failed
     1.77%     swapper  [k] intel_idle
     1.69%     reaim    [.] msort_with_tmp
     1.69%     reaim    [.] div_short
     1.66%     reaim    [.] div_int
     1.49%     reaim    [.] _int_malloc
     1.23%     reaim    [k] __write_lock_failed

However, this may not necessarily mean we should remove or permanently lower
the rate at which idle balance may be attempted. If a CPU remains idle a lot
longer than the overhead for doing idle balance, then we should allow it.

This patch attempts to keep track of each CPU's "approximate" average idle
time and the average cost of attempting each load_balance per sched domain.
Load balancing gets skipped if the approximate cost of load balancing will
be greater than N% of the approximate time the CPU remains idle. Currently,
N is set to 10% though I'm searching for a more "ideal" way to compute this.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
 arch/metag/include/asm/topology.h |    1 +
 include/linux/sched.h             |    1 +
 include/linux/topology.h          |    3 +++
 kernel/sched/core.c               |    3 +++
 kernel/sched/debug.c              |    1 +
 kernel/sched/fair.c               |   19 +++++++++++++++++++
 kernel/sched/sched.h              |    4 ++++
 7 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index 23f5118..8f0ebeb 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -26,6 +26,7 @@
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
+	.avg_idle_balance_cost	= 0,			\
 }
 
 #define cpu_to_node(cpu)	((void)(cpu), 0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1eb94c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -809,6 +809,7 @@ struct sched_domain {
 	unsigned int wake_idx;
 	unsigned int forkexec_idx;
 	unsigned int smt_gain;
+	u64 avg_idle_balance_cost;
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..59ee2e8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -106,6 +106,7 @@ int arch_update_cpu_topology(void);
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.smt_gain		= 1178,	/* 15% */			\
+	.avg_idle_balance_cost	= 0,					\
 }
 #endif
 #endif /* CONFIG_SCHED_SMT */
@@ -135,6 +136,7 @@ int arch_update_cpu_topology(void);
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
+	.avg_idle_balance_cost	= 0,					\
 }
 #endif
 #endif /* CONFIG_SCHED_MC */
@@ -166,6 +168,7 @@ int arch_update_cpu_topology(void);
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
+	.avg_idle_balance_cost	= 0,					\
 }
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8b3350..da2cb3e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1348,6 +1348,8 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 		else
 			update_avg(&rq->avg_idle, delta);
 		rq->idle_stamp = 0;
+
+		rq->idle_duration = (rq->idle_duration + delta) / 2;
 	}
 #endif
 }
@@ -7027,6 +7029,7 @@ void __init sched_init(void)
 		rq->online = 0;
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
+		rq->idle_duration = 0;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 75024a6..a3f062c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -307,6 +307,7 @@ do {									\
 	P(sched_goidle);
 #ifdef CONFIG_SMP
 	P64(avg_idle);
+	P64(idle_duration);
 #endif
 
 	P(ttwu_count);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..da7ddd6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5240,6 +5240,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	struct sched_domain *sd;
 	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
+	u64 cost = 0;
+	u64 idle_duration = this_rq->idle_duration;
 
 	this_rq->idle_stamp = this_rq->clock;
 
@@ -5256,14 +5258,31 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
 		int balance = 1;
+		u64 this_domain_balance_cost = 0;
+		u64 start_time;
 
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
 
+		/*
+		 * If the time which this_cpu remains is not lot higher than the cost
+		 * of attempt idle balancing within this domain, then stop searching.
+		 */
+		if (idle_duration / 10 < (sd->avg_idle_balance_cost + cost))
+			break;
+
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
+			start_time = sched_clock_cpu(smp_processor_id());
+
 			/* If we've pulled tasks over stop searching: */
 			pulled_task = load_balance(this_cpu, this_rq,
 						   sd, CPU_NEWLY_IDLE, &balance);
+
+			this_domain_balance_cost = sched_clock_cpu(smp_processor_id()) - start_time;
+
+			sd->avg_idle_balance_cost =
+				(sd->avg_idle_balance_cost + this_domain_balance_cost) / 2;
+			cost += this_domain_balance_cost;
 		}
 
 		interval = msecs_to_jiffies(sd->balance_interval);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce39224..6a7e620 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -521,6 +521,10 @@ struct rq {
 #endif
 
 	struct sched_avg avg;
+
+#ifdef CONFIG_SMP
+	u64 idle_duration;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
-- 
1.7.1

next             reply	other threads:[~2013-07-19  7:50 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-19  7:50 Jason Low [this message]
2013-07-19 11:24 ` [RFC PATCH v2] sched: Limit idle_balance() Preeti U Murthy
2013-07-19 19:28   ` Jason Low
2013-07-21 17:32     ` Preeti U Murthy
2013-07-22 17:42       ` Jason Low
2013-07-22  7:01 ` Srikar Dronamraju
2013-07-22 18:57   ` Jason Low
2013-07-23 11:03     ` Peter Zijlstra
2013-07-24  7:06       ` Jason Low
2013-07-23 11:06     ` Srikar Dronamraju
2013-07-23 12:05       ` Peter Zijlstra
2013-07-23 12:33         ` Mike Galbraith
2013-07-24  4:24       ` Jason Low

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:23f5118 dfblob:8f0ebeb dfblob:178a8d9 dfblob:1eb94c2
dfblob:d3cf0d6 dfblob:59ee2e8 dfblob:e8b3350 dfblob:da2cb3e
dfblob:75024a6 dfblob:a3f062c dfblob:c61a614 dfblob:da7ddd6
dfblob:ce39224 dfblob:6a7e620 )
 OR (
bs:"[RFC PATCH v2] sched: Limit idle_balance()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1374220211.5447.9.camel@j-VirtualBox \
    --to=jason.low2@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=aswin@hp.com \
    --cc=chegu_vinod@hp.com \
    --cc=efault@gmx.de \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=scott.norton@hp.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).