[PATCH try 5] CFS: Add hierarchical tree-based penalty.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH try 5] CFS: Add hierarchical tree-based penalty.
@ 2010-10-12  5:32 William Pitcock
  2010-10-12  9:30 ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: William Pitcock @ 2010-10-12  5:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, efault, kernel

Inspired by the recent change to BFS by Con Kolivas, this patch causes
vruntime to be penalized based on parent depth from their root task
group.

I have, for the moment, decided to make it a default feature since the
design of CFS ensures that broken applications depending on task enqueue
behaviour behaving traditionally will continue to work.

Signed-off-by: William Pitcock <nenolod@dereferenced.org>
---
 include/linux/sched.h   |    2 ++
 kernel/sched.c          |    4 ++++
 kernel/sched_fair.c     |    8 ++++++++
 kernel/sched_features.h |   12 ++++++++++++
 4 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..3f0cff8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1494,6 +1494,8 @@ struct task_struct {
 		unsigned long memsw_bytes; /* uncharged mem+swap usage */
 	} memcg_batch;
 #endif
+
+	int fork_depth;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..3fc3ebc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2621,6 +2621,10 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 #endif
 
 	rq = task_rq_lock(p, &flags);
+
+	if (!(clone_flags & CLONE_THREAD))
+		p->fork_depth++;
+
 	activate_task(rq, p, 0);
 	trace_sched_wakeup_new(p, 1);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..a81acc5 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -737,6 +737,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	if (initial && sched_feat(START_DEBIT))
 		vruntime += sched_vslice(cfs_rq, se);
 
+	if (sched_feat(HIERARCHICAL_PENALTY) &&
+	    likely(entity_is_task(se))) {
+		struct task_struct *tsk = task_of(se);
+
+		if (tsk->fork_depth > 1)
+			vruntime *= tsk->fork_depth;
+	}
+
 	/* sleeps up to a single latency don't count. */
 	if (!initial) {
 		unsigned long thresh = sysctl_sched_latency;
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 83c66e8..cf17f97 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -45,6 +45,18 @@ SCHED_FEAT(LAST_BUDDY, 1)
 SCHED_FEAT(CACHE_HOT_BUDDY, 1)
 
 /*
+ * Hierarchical tree-based penalty: penalize service deficit by
+ * an order of magnitude for each parent process in the process
+ * tree.  This has the natural effect of forcing preference towards
+ * processes that are not fork()-hungry, like make(1), which helps
+ * to preserve good latency.
+ *
+ * This also has the side effect of providing in a limited way,
+ * per-user CPU entitlement partitioning.
+ */
+SCHED_FEAT(HIERARCHICAL_PENALTY, 1)
+
+/*
  * Use arch dependent cpu power functions
  */
 SCHED_FEAT(ARCH_POWER, 0)
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH try 5] CFS: Add hierarchical tree-based penalty.
@ 2010-10-12  5:32 William Pitcock
  0 siblings, 0 replies; 9+ messages in thread
From: William Pitcock @ 2010-10-12  5:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, efault, kernel

Inspired by the recent change to BFS by Con Kolivas, this patch causes
vruntime to be penalized based on parent depth from their root task
group.

I have, for the moment, decided to make it a default feature since the
design of CFS ensures that broken applications depending on task enqueue
behaviour behaving traditionally will continue to work.

Signed-off-by: William Pitcock <nenolod@dereferenced.org>
---
 include/linux/sched.h   |    2 ++
 kernel/sched.c          |    4 ++++
 kernel/sched_fair.c     |    8 ++++++++
 kernel/sched_features.h |   12 ++++++++++++
 4 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..3f0cff8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1494,6 +1494,8 @@ struct task_struct {
 		unsigned long memsw_bytes; /* uncharged mem+swap usage */
 	} memcg_batch;
 #endif
+
+	int fork_depth;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..3fc3ebc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2621,6 +2621,10 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 #endif
 
 	rq = task_rq_lock(p, &flags);
+
+	if (!(clone_flags & CLONE_THREAD))
+		p->fork_depth++;
+
 	activate_task(rq, p, 0);
 	trace_sched_wakeup_new(p, 1);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..a81acc5 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -737,6 +737,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	if (initial && sched_feat(START_DEBIT))
 		vruntime += sched_vslice(cfs_rq, se);
 
+	if (sched_feat(HIERARCHICAL_PENALTY) &&
+	    likely(entity_is_task(se))) {
+		struct task_struct *tsk = task_of(se);
+
+		if (tsk->fork_depth > 1)
+			vruntime *= tsk->fork_depth;
+	}
+
 	/* sleeps up to a single latency don't count. */
 	if (!initial) {
 		unsigned long thresh = sysctl_sched_latency;
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 83c66e8..cf17f97 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -45,6 +45,18 @@ SCHED_FEAT(LAST_BUDDY, 1)
 SCHED_FEAT(CACHE_HOT_BUDDY, 1)
 
 /*
+ * Hierarchical tree-based penalty: penalize service deficit by
+ * an order of magnitude for each parent process in the process
+ * tree.  This has the natural effect of forcing preference towards
+ * processes that are not fork()-hungry, like make(1), which helps
+ * to preserve good latency.
+ *
+ * This also has the side effect of providing in a limited way,
+ * per-user CPU entitlement partitioning.
+ */
+SCHED_FEAT(HIERARCHICAL_PENALTY, 1)
+
+/*
  * Use arch dependent cpu power functions
  */
 SCHED_FEAT(ARCH_POWER, 0)
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH try 5] CFS: Add hierarchical tree-based penalty.
  2010-10-12  5:32 [PATCH try 5] CFS: Add hierarchical tree-based penalty William Pitcock
@ 2010-10-12  9:30 ` Ingo Molnar
  2010-10-12  9:35   ` Peter Zijlstra
  2010-10-12  9:39   ` William Pitcock
  0 siblings, 2 replies; 9+ messages in thread
From: Ingo Molnar @ 2010-10-12  9:30 UTC (permalink / raw)
  To: William Pitcock; +Cc: linux-kernel, peterz, efault, kernel


* William Pitcock <nenolod@dereferenced.org> wrote:

> Inspired by the recent change to BFS by Con Kolivas, this patch causes 
> vruntime to be penalized based on parent depth from their root task 
> group.
> 
> I have, for the moment, decided to make it a default feature since the 
> design of CFS ensures that broken applications depending on task 
> enqueue behaviour behaving traditionally will continue to work.

Just curious, is this v5 submission a reply to Peter's earlier review of 
your v3 patch? If yes then please explicitly outline the changes you did 
so that Peter and others do not have to guess about the direction your 
work is taking.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH try 5] CFS: Add hierarchical tree-based penalty.
  2010-10-12  9:30 ` Ingo Molnar
@ 2010-10-12  9:35   ` Peter Zijlstra
  2010-10-12  9:39   ` William Pitcock
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2010-10-12  9:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: William Pitcock, linux-kernel, efault, kernel

On Tue, 2010-10-12 at 11:30 +0200, Ingo Molnar wrote:
> * William Pitcock <nenolod@dereferenced.org> wrote:
> 
> > Inspired by the recent change to BFS by Con Kolivas, this patch causes 
> > vruntime to be penalized based on parent depth from their root task 
> > group.
> > 
> > I have, for the moment, decided to make it a default feature since the 
> > design of CFS ensures that broken applications depending on task 
> > enqueue behaviour behaving traditionally will continue to work.
> 
> Just curious, is this v5 submission a reply to Peter's earlier review of 
> your v3 patch? If yes then please explicitly outline the changes you did 
> so that Peter and others do not have to guess about the direction your 
> work is taking.

Going by the date field in the headers he send them all in relatively
quick succession, they just took their merry time arriving (either that
or his machine's time is funkeh).



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH try 5] CFS: Add hierarchical tree-based penalty.
  2010-10-12  9:30 ` Ingo Molnar
  2010-10-12  9:35   ` Peter Zijlstra
@ 2010-10-12  9:39   ` William Pitcock
  2010-10-12  9:47     ` Ingo Molnar
  1 sibling, 1 reply; 9+ messages in thread
From: William Pitcock @ 2010-10-12  9:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, peterz, efault, kernel

Hi,

----- "Ingo Molnar" <mingo@elte.hu> wrote:

> * William Pitcock <nenolod@dereferenced.org> wrote:
> 
> > Inspired by the recent change to BFS by Con Kolivas, this patch
> causes 
> > vruntime to be penalized based on parent depth from their root task
> 
> > group.
> > 
> > I have, for the moment, decided to make it a default feature since
> the 
> > design of CFS ensures that broken applications depending on task 
> > enqueue behaviour behaving traditionally will continue to work.
> 
> Just curious, is this v5 submission a reply to Peter's earlier review
> of 
> your v3 patch? If yes then please explicitly outline the changes you
> did 
> so that Peter and others do not have to guess about the direction your
> 
> work is taking.

I just did that in the email I just sent.  Simply put, I was talking with
Con a few weeks ago about the concept of having a maximum amount of service
for all threads belonging to a process.  This did not work out so well, so
Con proposed penalizing based on fork depth, which still allows us to maintain
interactivity with make -j64 running in the background.

Actually, I lie: it works great for server scenarios where you have some
sysadmin also running azureus.  Azureus gets penalized instead, but other
apps like audacious get penalized too.

William

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH try 5] CFS: Add hierarchical tree-based penalty.
  2010-10-12  9:39   ` William Pitcock
@ 2010-10-12  9:47     ` Ingo Molnar
  2010-10-12  9:57       ` Con Kolivas
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2010-10-12  9:47 UTC (permalink / raw)
  To: William Pitcock; +Cc: linux-kernel, peterz, efault, kernel


* William Pitcock <nenolod@dereferenced.org> wrote:

> Hi,
> 
> ----- "Ingo Molnar" <mingo@elte.hu> wrote:
> 
> > * William Pitcock <nenolod@dereferenced.org> wrote:
> > 
> > > Inspired by the recent change to BFS by Con Kolivas, this patch
> > causes 
> > > vruntime to be penalized based on parent depth from their root task
> > 
> > > group.
> > > 
> > > I have, for the moment, decided to make it a default feature since
> > the 
> > > design of CFS ensures that broken applications depending on task 
> > > enqueue behaviour behaving traditionally will continue to work.
> > 
> > Just curious, is this v5 submission a reply to Peter's earlier review
> > of 
> > your v3 patch? If yes then please explicitly outline the changes you
> > did 
> > so that Peter and others do not have to guess about the direction your
> > 
> > work is taking.
> 
> I just did that in the email I just sent.  Simply put, I was talking 
> with Con a few weeks ago about the concept of having a maximum amount 
> of service for all threads belonging to a process.  This did not work 
> out so well, so Con proposed penalizing based on fork depth, which 
> still allows us to maintain interactivity with make -j64 running in 
> the background.
> 
> Actually, I lie: it works great for server scenarios where you have 
> some sysadmin also running azureus.  Azureus gets penalized instead, 
> but other apps like audacious get penalized too.

Thanks for the explanation!

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH try 5] CFS: Add hierarchical tree-based penalty.
  2010-10-12  9:47     ` Ingo Molnar
@ 2010-10-12  9:57       ` Con Kolivas
  2010-10-12 10:16         ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Con Kolivas @ 2010-10-12  9:57 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: William Pitcock, linux-kernel, peterz, efault

On Tue, 12 Oct 2010 20:47:35 Ingo Molnar wrote:
> * William Pitcock <nenolod@dereferenced.org> wrote:
> > Hi,
> > 
> > ----- "Ingo Molnar" <mingo@elte.hu> wrote:
> > > * William Pitcock <nenolod@dereferenced.org> wrote:
> > > > Inspired by the recent change to BFS by Con Kolivas, this patch
> > > 
> > > causes
> > > 
> > > > vruntime to be penalized based on parent depth from their root task
> > > > 
> > > > group.
> > > > 
> > > > I have, for the moment, decided to make it a default feature since
> > > 
> > > the
> > > 
> > > > design of CFS ensures that broken applications depending on task
> > > > enqueue behaviour behaving traditionally will continue to work.
> > > 
> > > Just curious, is this v5 submission a reply to Peter's earlier review
> > > of
> > > your v3 patch? If yes then please explicitly outline the changes you
> > > did
> > > so that Peter and others do not have to guess about the direction your
> > > 
> > > work is taking.
> > 
> > I just did that in the email I just sent.  Simply put, I was talking
> > with Con a few weeks ago about the concept of having a maximum amount
> > of service for all threads belonging to a process.  This did not work
> > out so well, so Con proposed penalizing based on fork depth, which
> > still allows us to maintain interactivity with make -j64 running in
> > the background.
> > 
> > Actually, I lie: it works great for server scenarios where you have
> > some sysadmin also running azureus.  Azureus gets penalized instead,
> > but other apps like audacious get penalized too.
> 
> Thanks for the explanation!
> 
> 	Ingo

It's a fun feature I've been playing with that was going to make it into the 
next -ck, albeit disabled by default. Here's what the patch changelog was 
going to say:

---
Make it possible to have interactivity and responsiveness at very high load
levels by having a hierarchical tree based penalty. This is achieved by
making deadlines offset by the fork depth from init. This has a similar effect
to 'nice'ing loads that are fork heavy (such as 'make'), and biases CPU and
latency towards threaded desktop applications.

When a new process is forked, its fork depth is inherited from its parent
across fork() and then is incremented by one. That fork_depth is then used
to cause a relative offset of its deadline. Threads keep the same fork_depth
as their parent process as these tend to belong to threaded desktop apps.

Using a dual core machine as an example, and running the "browser benchmark"
at http://service.futuremark.com/peacekeeper/index.action shows the effect
this patch has.

The benchmark runs a number of different browser based workloads, and gives
a score in points, where higher is better.

Running the benchmark under various different loads with the feature enabled/
disabled:

Load            Disabled        Enabled
None            2437            2437
make -j2        1642            2293
make -j24       208             2187
make -j42       failed          1626

As can be seen, on the dual core machine, a load of 2 makes the benchmark run
almost precisely 1/3 slower as would be expected with BFS' fair CPU
distribution of 3 processes between 2 CPUs. Enabling this feature makes this
benchmark progress almost unaffected at this load, and only once the load is
more than 20 times higher does it hinder the benchmark to the same degree.

Other side effects of this patch are that it weakly partitions CPU entitlement
to different users, and provides some protection against fork bombs.

Note that this drastically affects CPU distribution,  No assumption as to CPU
distribution should be made based on past behaviour. It can be difficult to
apportion a lot of CPU to a fork heavy workload with this enabled, and the
effects of 'nice' are compounded.

Unlike other approaches to improving latency under load of smaller timeslices,
enabling this feature has no detrimental effect on throughput under load.

This feature is disabled in this patch by default as it may lead to unexpected
changes in CPU distribution and there may be real world regressions.

There is a sysctl to enable/disable this feature in
/proc/sys/kernel/fork_depth_penalty

-- 
-ck

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH try 5] CFS: Add hierarchical tree-based penalty.
  2010-10-12  9:57       ` Con Kolivas
@ 2010-10-12 10:16         ` Ingo Molnar
  2010-10-12 10:26           ` Con Kolivas
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2010-10-12 10:16 UTC (permalink / raw)
  To: Con Kolivas; +Cc: William Pitcock, linux-kernel, peterz, efault


* Con Kolivas <kernel@kolivas.org> wrote:

> On Tue, 12 Oct 2010 20:47:35 Ingo Molnar wrote:
> > * William Pitcock <nenolod@dereferenced.org> wrote:
> > > Hi,
> > > 
> > > ----- "Ingo Molnar" <mingo@elte.hu> wrote:
> > > > * William Pitcock <nenolod@dereferenced.org> wrote:
> > > > > Inspired by the recent change to BFS by Con Kolivas, this patch
> > > > 
> > > > causes
> > > > 
> > > > > vruntime to be penalized based on parent depth from their root task
> > > > > 
> > > > > group.
> > > > > 
> > > > > I have, for the moment, decided to make it a default feature since
> > > > 
> > > > the
> > > > 
> > > > > design of CFS ensures that broken applications depending on task
> > > > > enqueue behaviour behaving traditionally will continue to work.
> > > > 
> > > > Just curious, is this v5 submission a reply to Peter's earlier review
> > > > of
> > > > your v3 patch? If yes then please explicitly outline the changes you
> > > > did
> > > > so that Peter and others do not have to guess about the direction your
> > > > 
> > > > work is taking.
> > > 
> > > I just did that in the email I just sent.  Simply put, I was talking
> > > with Con a few weeks ago about the concept of having a maximum amount
> > > of service for all threads belonging to a process.  This did not work
> > > out so well, so Con proposed penalizing based on fork depth, which
> > > still allows us to maintain interactivity with make -j64 running in
> > > the background.
> > > 
> > > Actually, I lie: it works great for server scenarios where you have
> > > some sysadmin also running azureus.  Azureus gets penalized instead,
> > > but other apps like audacious get penalized too.
> > 
> > Thanks for the explanation!
> > 
> > 	Ingo
> 
> It's a fun feature I've been playing with that was going to make it into the 
> next -ck, albeit disabled by default. Here's what the patch changelog was 
> going to say:

Find below the reply Peter sent to William's v5 patch. I suspect there 
will be a v6 to address those problems :)

(William: please Cc: Con too to future updates of your patch.)

Thanks,

	Ingo

----- Forwarded message from Peter Zijlstra <peterz@infradead.org> -----

Date: Tue, 12 Oct 2010 11:46:57 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: William Pitcock <nenolod@dereferenced.org>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
	Mike Galbraith <efault@gmx.de>
Subject: Re: [PATCH try 3] CFS: Add hierarchical tree-based penalty.

On Tue, 2010-10-12 at 13:34 +0400, William Pitcock wrote:
> Yes, this should be a multiplication I believe, not a divide.  My original
> code had this as a multiplication, not a division, as does the new patch.
> 
> However, I think:
> 
>     vruntime >>= tsk->fork_depth;
> 
> would do the job just as well and be faster. 

That's still somewhat iffy as explained, vruntime is the absolute
service level, multiplying that by 2 (or even more) will utterly upset
things.

Imagine two runnable tasks of weight 1, say both have a vruntime of 3
million, seconds (there being two, vruntime will advance at 1/2
wall-time).

Now, suppose you wake a third, it too had a vruntime of around 3 million
seconds (it only slept for a little while), if you then multiply that
with 2 and place it at 6 mil, it will have to wait for 6 mil seconds
before it gets serviced (twice the time of the 3 mil difference in
service time between this new and the old tasks).

So, theory says the fair thing to do is place new tasks at the weighted
average of the existing tasks, but computing that is expensive, so what
we do is place it somewhere near the leftmost task in the tree.

Now, you don't want to push it out too far to the right, otherwise we
get starvation issues and people get upset.

So you have to somehow determine a window in which you want to place
this task and then vary in that depending on your fork_depth.

Simply manipulating the absolute service levels like you propose isn't
going to work.


----- End forwarded message -----

-- 
Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH try 5] CFS: Add hierarchical tree-based penalty.
  2010-10-12 10:16         ` Ingo Molnar
@ 2010-10-12 10:26           ` Con Kolivas
  0 siblings, 0 replies; 9+ messages in thread
From: Con Kolivas @ 2010-10-12 10:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: William Pitcock, linux-kernel, peterz, efault

On Tue, 12 Oct 2010 21:16:45 Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:> > It's a fun feature I've been playing with that was going to make it into
> > the next -ck, albeit disabled by default. Here's what the patch
> > changelog was
> 
> > going to say:
> Find below the reply Peter sent to William's v5 patch. I suspect there
> will be a v6 to address those problems :)
> 
> (William: please Cc: Con too to future updates of your patch.)
> 
> Thanks,

In case it helps, here's the full patch as it would be for BFS357.
It's a simple deadline *= fork_depth (more comments than code.)

---

Make it possible to have interactivity and responsiveness at very high load
levels by having a hierarchical tree based penalty. This is achieved by
making deadlines offset by the fork depth from init. This has a similar effect
to 'nice'ing loads that are fork heavy (such as 'make'), and biases CPU and
latency towards threaded desktop applications.

When a new process is forked, its fork depth is inherited from its parent
across fork() and then is incremented by one. That fork_depth is then used
to cause a relative offset of its deadline. Threads keep the same fork_depth
as their parent process as these tend to belong to threaded desktop apps.

Using a dual core machine as an example, and running the "browser benchmark"
at http://service.futuremark.com/peacekeeper/index.action shows the effect
this patch has.

The benchmark runs a number of different browser based workloads, and gives
a score in points, where higher is better.

Running the benchmark under various different loads with the feature enabled/
disabled:

Load            Disabled        Enabled
None            2437            2437
make -j2        1642            2293
make -j24       208             2187
make -j42       failed          1626

As can be seen, on the dual core machine, a load of 2 makes the benchmark run
almost precisely 1/3 slower as would be expected with BFS' fair CPU
distribution of 3 processes between 2 CPUs. Enabling this feature makes this
benchmark progress almost unaffected at this load, and only once the load is
more than 20 times higher does it hinder the benchmark to the same degree.

Other side effects of this patch are that it weakly partitions CPU entitlement
to different users, and provides some protection against fork bombs.

Note that this drastically affects CPU distribution,  No assumption as to CPU
distribution should be made based on past behaviour. It can be difficult to
apportion a lot of CPU to a fork heavy workload with this enabled, and the
effects of 'nice' are compounded.

Unlike other approaches to improving latency under load of smaller timeslices,
enabling this feature has no detrimental effect on throughput under load.

This feature is disabled in this patch by default as it may lead to unexpected
changes in CPU distribution and there may be real world regressions.

There is a sysctl to enable/disable this feature in
/proc/sys/kernel/fork_depth_penalty

---
 Documentation/sysctl/kernel.txt |   16 ++++++++++++++++
 include/linux/sched.h           |    2 ++
 kernel/sched_bfs.c              |   37 +++++++++++++++++++++++++++++++++----
 kernel/sysctl.c                 |   10 ++++++++++
 4 files changed, 61 insertions(+), 4 deletions(-)

Index: linux-2.6.36-rc7-ck2/Documentation/sysctl/kernel.txt
===================================================================
--- linux-2.6.36-rc7-ck2.orig/Documentation/sysctl/kernel.txt	2010-10-12 11:44:12.294066709 +1100
+++ linux-2.6.36-rc7-ck2/Documentation/sysctl/kernel.txt	2010-10-12 11:49:36.916804219 +1100
@@ -29,6 +29,7 @@ show up in /proc/sys/kernel:
 - ctrl-alt-del
 - dentry-state
 - domainname
+- fork_depth_penalty
 - hostname
 - hotplug
 - iso_cpu
@@ -235,6 +236,21 @@ see the hostname(1) man page.
 
 ==============================================================
 
+fork_depth_penalty: (BFS CPU scheduler only)
+
+Whether to penalise CPU according to fork depth. This ends up favouring CPU
+distribution and latency greatly towards threaded desktop applications over
+heavily forked workloads such as compilation, and has the added side-effect
+of weakly partitioning CPU by user. Enabling this changes normal concepts of
+fairness as it has the effect of 'nice'ing workloads such as compilation with
+make.
+
+0: Disabled.
+
+1: Enabled.
+
+==============================================================
+
 hotplug:
 
 Path for the hotplug policy agent.
Index: linux-2.6.36-rc7-ck2/include/linux/sched.h
===================================================================
--- linux-2.6.36-rc7-ck2.orig/include/linux/sched.h	2010-10-12 11:44:12.281066921 +1100
+++ linux-2.6.36-rc7-ck2/include/linux/sched.h	2010-10-12 11:46:46.541564134 +1100
@@ -1188,6 +1188,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_BFS
 	int time_slice;
 	u64 deadline;
+	/* Depth of forks from init */
+	int fork_depth;
 	struct list_head run_list;
 	u64 last_ran;
 	u64 sched_time; /* sched_clock time spent running */
Index: linux-2.6.36-rc7-ck2/kernel/sched_bfs.c
===================================================================
--- linux-2.6.36-rc7-ck2.orig/kernel/sched_bfs.c	2010-10-12 11:44:12.257067311 +1100
+++ linux-2.6.36-rc7-ck2/kernel/sched_bfs.c	2010-10-12 12:02:19.062552716 +1100
@@ -138,6 +138,10 @@ int rr_interval __read_mostly = 6;
  */
 int sched_iso_cpu __read_mostly = 70;
 
+
+/* fork_depth_penalty - Whether to penalise CPU according to fork depth. */
+int fork_depth_penalty __read_mostly;
+
 /*
  * The relative length of deadline for each priority(nice) level.
  */
@@ -1635,6 +1639,12 @@ void wake_up_new_task(struct task_struct
 	parent = p->parent;
 	/* Unnecessary but small chance that the parent changed CPU */
 	set_task_cpu(p, task_cpu(parent));
+	/*
+	 * fork_depth is inherited and incremented when we spawn a new
+	 * process and not a thread.
+	 */
+	if (!(clone_flags & CLONE_THREAD))
+		p->fork_depth++;
 	activate_task(p, rq);
 	trace_sched_wakeup_new(p, 1);
 	if (!(clone_flags & CLONE_VM) && rq->curr == parent &&
@@ -2526,7 +2536,21 @@ static inline u64 prio_deadline_diff(int
 
 static inline u64 task_deadline_diff(struct task_struct *p)
 {
-	return prio_deadline_diff(TASK_USER_PRIO(p));
+	u64 pdd = prio_deadline_diff(TASK_USER_PRIO(p));
+
+	if (fork_depth_penalty) {
+		/*
+		 * With fork_depth_penalty enabled, we offset deadline
+		 * according to how deeply the process has forked from init,
+		 * thus creating a hierarchical tree-based penalty. This has
+		 * the effect of distributing better latency and CPU more to
+		 * "foreground" type threaded desktop applications. It also
+		 * softly partitions CPU by user. Only init has a depth of 0.
+		 */
+		if (likely(p->fork_depth > 1))
+			pdd *= p->fork_depth;
+	}
+	return pdd;
 }
 
 static inline u64 static_deadline_diff(int static_prio)
@@ -3513,7 +3537,7 @@ SYSCALL_DEFINE1(nice, int, increment)
  *
  * This is the priority value as seen by users in /proc.
  * RT tasks are offset by -100. Normal tasks are centered around 1, value goes
- * from 0 (SCHED_ISO) up to 82 (nice +19 SCHED_IDLEPRIO).
+ * from 0 (SCHED_ISO) upwards (nice +19 SCHED_IDLEPRIO).
  */
 int task_prio(const struct task_struct *p)
 {
@@ -3525,8 +3549,13 @@ int task_prio(const struct task_struct *
 
 	/* Convert to ms to avoid overflows */
 	delta = NS_TO_MS(p->deadline - grq.niffies);
-	delta = delta * 40 / ms_longest_deadline_diff();
-	if (delta > 0 && delta <= 80)
+	/* Deadlines are approximately 10x larger with fork_depth_penalty */
+	if (fork_depth_penalty)
+		delta *= 4;
+	else
+		delta *= 40;
+	delta /= ms_longest_deadline_diff();
+	if (delta > 0)
 		prio += delta;
 	if (idleprio_task(p))
 		prio += 40;
Index: linux-2.6.36-rc7-ck2/kernel/sysctl.c
===================================================================
--- linux-2.6.36-rc7-ck2.orig/kernel/sysctl.c	2010-10-12 11:44:12.269067116 +1100
+++ linux-2.6.36-rc7-ck2/kernel/sysctl.c	2010-10-12 12:03:10.238592352 +1100
@@ -121,6 +121,7 @@ static int __maybe_unused one_hundred = 
 #ifdef CONFIG_SCHED_BFS
 extern int rr_interval;
 extern int sched_iso_cpu;
+extern int fork_depth_penalty;
 static int __read_mostly one_thousand = 1000;
 #endif
 #ifdef CONFIG_PRINTK
@@ -834,6 +835,15 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+	{
+		.procname	= "fork_depth_penalty",
+		.data		= &fork_depth_penalty,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 #endif
 #if defined(CONFIG_S390) && defined(CONFIG_SMP)
 	{

-- 
-ck

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-10-12 10:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-12  5:32 [PATCH try 5] CFS: Add hierarchical tree-based penalty William Pitcock
2010-10-12  9:30 ` Ingo Molnar
2010-10-12  9:35   ` Peter Zijlstra
2010-10-12  9:39   ` William Pitcock
2010-10-12  9:47     ` Ingo Molnar
2010-10-12  9:57       ` Con Kolivas
2010-10-12 10:16         ` Ingo Molnar
2010-10-12 10:26           ` Con Kolivas
  -- strict thread matches above, loose matches on Subject: below --
2010-10-12  5:32 William Pitcock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox