* [tip:sched/urgent] sched, cgroup: Reduce rq-> lock hold times for large cgroup hierarchies
@ 2012-08-13 16:49 tip-bot for Peter Zijlstra
0 siblings, 0 replies; only message in thread
From: tip-bot for Peter Zijlstra @ 2012-08-13 16:49 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, efault, pportant, tglx,
lwoodman
Commit-ID: a35b6466aabb051568b844e8c63f87a356d3d129
Gitweb: http://git.kernel.org/tip/a35b6466aabb051568b844e8c63f87a356d3d129
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Wed, 8 Aug 2012 21:46:40 +0200
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 13 Aug 2012 18:41:54 +0200
sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.
His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.
It was found that:
schedule()
idle_balance()
load_balance()
local_irq_save()
double_rq_lock()
update_h_load()
walk_tg_tree(tg_load_down)
tg_load_down()
Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.
This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.
I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).
Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: Peter Portante <pportant@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
kernel/sched/fair.c | 11 +++++++++--
kernel/sched/sched.h | 6 +++++-
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0cc03b..c219bf8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3387,6 +3387,14 @@ static int tg_load_down(struct task_group *tg, void *data)
static void update_h_load(long cpu)
{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long now = jiffies;
+
+ if (rq->h_load_throttle == now)
+ return;
+
+ rq->h_load_throttle = now;
+
rcu_read_lock();
walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
rcu_read_unlock();
@@ -4293,11 +4301,10 @@ redo:
env.src_rq = busiest;
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
+ update_h_load(env.src_cpu);
more_balance:
local_irq_save(flags);
double_rq_lock(this_rq, busiest);
- if (!env.loop)
- update_h_load(env.src_cpu);
/*
* cur_ld_moved - load moved in current iteration
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c35a1a7..531411b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -374,7 +374,11 @@ struct rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
-#endif
+#ifdef CONFIG_SMP
+ unsigned long h_load_throttle;
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_RT_GROUP_SCHED
struct list_head leaf_rt_rq_list;
#endif
^ permalink raw reply related [flat|nested] only message in thread
only message in thread, other threads:[~2012-08-13 16:49 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-13 16:49 [tip:sched/urgent] sched, cgroup: Reduce rq-> lock hold times for large cgroup hierarchies tip-bot for Peter Zijlstra
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.