From: Glauber Costa <glommer@parallels.com>
To: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org, devel@openvz.org,
Paul Turner <pjt@google.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Tejun Heo <tj@kernel.org>,
"Eric W. Biederman" <ebiederm@xmission.com>,
handai.szj@gmail.com, Andrew.Phillips@lmax.com,
Serge Hallyn <serge.hallyn@canonical.com>,
Glauber Costa <glommer@parallels.com>
Subject: [PATCH v3 4/6] add a new scheduler hook for context switch
Date: Wed, 30 May 2012 13:48:35 +0400 [thread overview]
Message-ID: <1338371317-5980-5-git-send-email-glommer@parallels.com> (raw)
In-Reply-To: <1338371317-5980-1-git-send-email-glommer@parallels.com>
To be able to count the number of switches per-cgroup, and merging
Paul's hint that it would be better to do it in fair.c and rt.c,
I am introducing a new write-side walk through a new scheduler hook,
called at every context switch away from a task (prev).
Read-side is greatly simplified, and as I'll show, the performance impact
does not seem huge. First, aside from the function call, this walk is
O(depth), which is not likely to be huge (if it is, the performance
impact is indeed bad - but one can argue this is a good punishment)
Also, this walk is likely to be cache-hot, since it is at most the very
same loop done by put_prev_task, except it loops depth - 1 instead of depth
times. This is specially important not to hurt tasks in the root cgroup,
that will pay just a branch.
I am introducing a new hook, because the existing ones didn't seem
appropriate. The main possibilities would be put_prev_task and
pick_next_task.
With pick_next_task, there are two main problems:
1) first, the loop is only cache hot in pick_next_task if tasks actually
belong to the same class and group as prev. Depending on the workload,
this is possibly unlikely.
2) This is vulnerable to wrongdoings in accountings when exiting to idle.
Consider two groups A and B with a following pattern: A exits to idle,
but after that, B is scheduled. B, on the other hand, always get back
to itself when it yields to idle. This means that the to-idle transition
in A is never noted.
So because of that, I do believe that prev is the right point of that.
However, put_prev_task is called many times from multiple places,
and the logic to differentiate a context switch from another kind of
put would make a mess out of it.
On a 4-way x86_64, hackbench -pipe 1 process 4000 shows the following results:
- units are seconds to complete the whole benchmark
- percentual stdev for easier assesment
Task sitting in the root cgroup:
Without patchset: 4.857700 (0.69 %)
With patchset: 4.863700 (0.63 %)
Difference : 0.12 %
Task sitting in a 3-level depth cgroup:
Without patchset: 5.120867 (1.60 %)
With patchset: 5.113800 (0.41 %)
Difference : 0.13 %
Task sitting in a 30-level depth cgroup (totally crazy scenario):
Without patchset: 8.829385 (2.63 %)
With patchset: 9.975467 (2.80 %)
Difference : 12 %
For any sane use case, the user is unlikely to be nesting for much more than
3-levels. For that, and for the important case for most people, the difference
is inside the standard deviation and can be said to be negligible.
Although the patch does add a penalty, it only does that for
deeply nested scenarios (but those are already paying a 100 % penalty
against no-nesting even without the patchset!)
I hope this approach is acceptable.
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 9 +++++++++
kernel/sched/fair.c | 15 +++++++++++++++
kernel/sched/rt.c | 15 +++++++++++++++
kernel/sched/sched.h | 3 +++
5 files changed, 43 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f45c0b2..d28d6ec 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1084,6 +1084,7 @@ struct sched_class {
struct task_struct * (*pick_next_task) (struct rq *rq);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
+ void (*context_switch) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4c1d7e9..db4f2c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1894,6 +1894,14 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr,
#endif /* CONFIG_PREEMPT_NOTIFIERS */
+static void sched_class_context_switch(struct rq *rq, struct task_struct *prev)
+{
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED)
+ if (prev->sched_class->context_switch)
+ prev->sched_class->context_switch(rq, prev);
+#endif
+}
+
/**
* prepare_task_switch - prepare to switch tasks
* @rq: the runqueue preparing to switch
@@ -1911,6 +1919,7 @@ static inline void
prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
+ sched_class_context_switch(rq, prev);
sched_info_switch(prev, next);
perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 940e6d1..c26fe38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2993,6 +2993,20 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
return p;
}
+static void context_switch_fair(struct rq *rq, struct task_struct *p)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &p->se;
+
+ while (se->parent) {
+ se = se->parent;
+ cfs_rq = group_cfs_rq(se);
+ cfs_rq->nr_switches++;
+ }
+#endif
+}
+
/*
* Account for a descheduled task:
*/
@@ -5255,6 +5269,7 @@ const struct sched_class fair_sched_class = {
.check_preempt_curr = check_preempt_wakeup,
.pick_next_task = pick_next_task_fair,
+ .context_switch = context_switch_fair,
.put_prev_task = put_prev_task_fair,
#ifdef CONFIG_SMP
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 30ee4e2..6f416e4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1392,6 +1392,20 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
return p;
}
+static void context_switch_rt(struct rq *rq, struct task_struct *p)
+{
+#ifdef CONFIG_RT_GROUP_SCHED
+ struct sched_rt_entity *rt_se = &p->rt;
+ struct rt_rq *rt_rq;
+
+ while (rt_se->parent) {
+ rt_se = rt_se->parent;
+ rt_rq = group_rt_rq(rt_se);
+ rt_rq->rt_nr_switches++;
+ }
+#endif
+}
+
static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
update_curr_rt(rq);
@@ -2040,6 +2054,7 @@ const struct sched_class rt_sched_class = {
.check_preempt_curr = check_preempt_curr_rt,
.pick_next_task = pick_next_task_rt,
+ .context_switch = context_switch_rt,
.put_prev_task = put_prev_task_rt,
#ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cd2f1e1..76f6839 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -237,6 +237,7 @@ struct cfs_rq {
struct list_head leaf_cfs_rq_list;
struct task_group *tg; /* group that "owns" this runqueue */
+ u64 nr_switches;
#ifdef CONFIG_SMP
/*
* h_load = weight * f(tg)
@@ -307,6 +308,8 @@ struct rt_rq {
struct rq *rq;
struct list_head leaf_rt_rq_list;
struct task_group *tg;
+
+ u64 rt_nr_switches;
#endif
};
--
1.7.10.2
next prev parent reply other threads:[~2012-05-30 9:48 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-30 9:48 [PATCH v3 0/6] per cgroup /proc/stat statistics Glauber Costa
2012-05-30 9:48 ` [PATCH v3 2/6] account guest time per-cgroup as well Glauber Costa
[not found] ` <1338371317-5980-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 10:32 ` Peter Zijlstra
2012-05-30 10:36 ` Glauber Costa
2012-05-30 10:46 ` Paul Turner
2012-05-30 9:48 ` Glauber Costa [this message]
[not found] ` <1338371317-5980-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 11:20 ` [PATCH v3 4/6] add a new scheduler hook for context switch Peter Zijlstra
2012-05-30 11:40 ` Peter Zijlstra
2012-05-30 12:08 ` Glauber Costa
2012-05-30 12:07 ` Glauber Costa
2012-05-30 9:48 ` [PATCH v3 5/6] Also record sleep start for a task group Glauber Costa
[not found] ` <1338371317-5980-6-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 11:35 ` Paul Turner
[not found] ` <CAPM31R+VXsffUOSOtMPG=g+G9OdzWMKQvx9usTFa3KBbrqPe6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-05-30 12:24 ` Glauber Costa
[not found] ` <4FC61188.8000908-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 12:44 ` Peter Zijlstra
2012-05-30 12:44 ` Glauber Costa
[not found] ` <1338371317-5980-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 9:48 ` [PATCH v3 1/6] measure exec_clock for rt sched entities Glauber Costa
[not found] ` <1338371317-5980-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 10:29 ` Peter Zijlstra
2012-05-30 10:32 ` Glauber Costa
[not found] ` <4FC5F727.2040804-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 10:42 ` Peter Zijlstra
2012-05-30 10:42 ` Glauber Costa
[not found] ` <4FC5F99B.2060109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 11:00 ` Paul Turner
[not found] ` <CAPM31RL-X2GYju-opjUumPq_cfGiRzGowXEOx9Tq4kVEG-z3SA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-05-30 12:09 ` Glauber Costa
2012-05-30 9:48 ` [PATCH v3 3/6] expose fine-grained per-cpu data for cpuacct stats Glauber Costa
[not found] ` <1338371317-5980-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 10:34 ` Peter Zijlstra
2012-05-30 10:34 ` Glauber Costa
[not found] ` <4FC5F7A0.2000201-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 10:43 ` Peter Zijlstra
2012-05-30 10:44 ` Glauber Costa
[not found] ` <4FC5FA20.50301-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 11:24 ` Peter Zijlstra
2012-05-30 11:24 ` Paul Turner
[not found] ` <CAPM31RJanAvDB+pZ+h5J3W6KXvAwPgbbeXgw6C_56tx_Mc+cgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-05-30 12:20 ` Glauber Costa
[not found] ` <4FC6107F.9020802-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 12:48 ` Paul Turner
2012-05-30 12:52 ` Glauber Costa
[not found] ` <CAPM31RJ7s+pNciOEnsfXRU9xCwnUano1k8CTEUuCjsSN5n0-1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-05-30 13:26 ` Glauber Costa
2012-05-30 13:26 ` Glauber Costa
2012-05-30 9:48 ` [PATCH v3 6/6] expose per-taskgroup schedstats in cgroup Glauber Costa
[not found] ` <1338371317-5980-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-05-30 11:22 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1338371317-5980-5-git-send-email-glommer@parallels.com \
--to=glommer@parallels.com \
--cc=Andrew.Phillips@lmax.com \
--cc=a.p.zijlstra@chello.nl \
--cc=cgroups@vger.kernel.org \
--cc=devel@openvz.org \
--cc=ebiederm@xmission.com \
--cc=handai.szj@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=pjt@google.com \
--cc=serge.hallyn@canonical.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).