[PATCH try 3] CFS: Add hierarchical tree-based penalty.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH try 3] CFS: Add hierarchical tree-based penalty.
@ 2010-10-12  5:32 William Pitcock
  2010-10-12  8:45 ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: William Pitcock @ 2010-10-12  5:32 UTC (permalink / raw)
  To: linux-kernel

Inspired by the recent change to BFS by Con Kolivas, this patch causes
vruntime to be penalized based on parent depth from their root task
group.

I have, for the moment, decided to make it a default feature since the
design of CFS ensures that broken applications depending on task enqueue
behaviour behaving traditionally will continue to work.

My method for applying the penalty is different than that of BFS, it
divides the vruntime by the number of parents the process has.

Signed-off-by: William Pitcock <nenolod@dereferenced.org>
---
 include/linux/sched.h   |    2 ++
 kernel/sched.c          |    4 ++++
 kernel/sched_fair.c     |    8 ++++++++
 kernel/sched_features.h |   12 ++++++++++++
 4 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..7b44f98 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1494,6 +1494,8 @@ struct task_struct {
 		unsigned long memsw_bytes; /* uncharged mem+swap usage */
 	} memcg_batch;
 #endif
+
+	int parent_count;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..16ad939 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2621,6 +2621,10 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 #endif
 
 	rq = task_rq_lock(p, &flags);
+
+	if (!(clone_flags & CLONE_THREAD))
+		p->parent_count++;
+
 	activate_task(rq, p, 0);
 	trace_sched_wakeup_new(p, 1);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..3f17ec1 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -737,6 +737,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	if (initial && sched_feat(START_DEBIT))
 		vruntime += sched_vslice(cfs_rq, se);
 
+	if (sched_feat(HIERARCHICAL_PENALTY)) {
+		struct task_struct *tsk = task_of(se);
+
+		if (tsk->parent_count > 1)
+			vruntime = max_vruntime(vruntime,
+						vruntime / tsk->parent_count);
+	}
+
 	/* sleeps up to a single latency don't count. */
 	if (!initial) {
 		unsigned long thresh = sysctl_sched_latency;
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 83c66e8..cf17f97 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -45,6 +45,18 @@ SCHED_FEAT(LAST_BUDDY, 1)
 SCHED_FEAT(CACHE_HOT_BUDDY, 1)
 
 /*
+ * Hierarchical tree-based penalty: penalize service deficit by
+ * an order of magnitude for each parent process in the process
+ * tree.  This has the natural effect of forcing preference towards
+ * processes that are not fork()-hungry, like make(1), which helps
+ * to preserve good latency.
+ *
+ * This also has the side effect of providing in a limited way,
+ * per-user CPU entitlement partitioning.
+ */
+SCHED_FEAT(HIERARCHICAL_PENALTY, 1)
+
+/*
  * Use arch dependent cpu power functions
  */
 SCHED_FEAT(ARCH_POWER, 0)
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH try 3] CFS: Add hierarchical tree-based penalty.
  2010-10-12  5:32 William Pitcock
@ 2010-10-12  8:45 ` Peter Zijlstra
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2010-10-12  8:45 UTC (permalink / raw)
  To: William Pitcock; +Cc: linux-kernel, Ingo Molnar, Mike Galbraith

It's customary to CC people who work on the code you patch...

On Tue, 2010-10-12 at 00:32 -0500, William Pitcock wrote:
> Inspired by the recent change to BFS by Con Kolivas, this patch causes
> vruntime to be penalized based on parent depth from their root task
> group.
> 
> I have, for the moment, decided to make it a default feature since the
> design of CFS ensures that broken applications depending on task enqueue
> behaviour behaving traditionally will continue to work.
> 
> My method for applying the penalty is different than that of BFS, it
> divides the vruntime by the number of parents the process has.

Aside from the funny semantic error in your sentence there (every
process can, by definition, only have one parent), the patch doesn't
look quite right either.


> Signed-off-by: William Pitcock <nenolod@dereferenced.org>
> ---
>  include/linux/sched.h   |    2 ++
>  kernel/sched.c          |    4 ++++
>  kernel/sched_fair.c     |    8 ++++++++
>  kernel/sched_features.h |   12 ++++++++++++
>  4 files changed, 26 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1e2a6db..7b44f98 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1494,6 +1494,8 @@ struct task_struct {
>  		unsigned long memsw_bytes; /* uncharged mem+swap usage */
>  	} memcg_batch;
>  #endif
> +
> +	int parent_count;
>  };

so copy_process() will make a child inherit the parent_count from the
parent.

>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> diff --git a/kernel/sched.c b/kernel/sched.c
> index dc85ceb..16ad939 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2621,6 +2621,10 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
>  #endif
>  
>  	rq = task_rq_lock(p, &flags);
> +
> +	if (!(clone_flags & CLONE_THREAD))
> +		p->parent_count++;
> +

And we increment it for every new !THREAD child, so its basically a task
depth counter, except it doesn't take re-parenting into account.

>  	activate_task(rq, p, 0);
>  	trace_sched_wakeup_new(p, 1);
>  	check_preempt_curr(rq, p, WF_FORK);
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index db3f674..3f17ec1 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -737,6 +737,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  	if (initial && sched_feat(START_DEBIT))
>  		vruntime += sched_vslice(cfs_rq, se);
>  
> +	if (sched_feat(HIERARCHICAL_PENALTY)) {
> +		struct task_struct *tsk = task_of(se);
> +

And here you have a bug,.. there is no guarantee se is actually a task.
Which means you're dereferencing random memory here:

> +		if (tsk->parent_count > 1)
> +			vruntime = max_vruntime(vruntime,
> +						vruntime / tsk->parent_count);

Aside from the wrap issue, that is a NOP statement.

x / y < x : y > 1, therefore, max(x, x/y) == x and you wrote, x = x;

Furthermore, vruntime here is an absolute measure of service, dividing
that by anything reduces the total amount of service the task received
during its history, doing so for y > 1 means at least halving it, which
again means that you basically end up with:

  se->vruntime = se->vruntime;

Due to the final few statements in place_entity():

  se->vruntime = max_vruntime(se->vruntime, vruntime);

What I think you meant to do was proportionally decrease the relative
placement of the new task by the task-depth, or something like that.

> +	}
> +
>  	/* sleeps up to a single latency don't count. */
>  	if (!initial) {
>  		unsigned long thresh = sysctl_sched_latency;
> diff --git a/kernel/sched_features.h b/kernel/sched_features.h
> index 83c66e8..cf17f97 100644
> --- a/kernel/sched_features.h
> +++ b/kernel/sched_features.h
> @@ -45,6 +45,18 @@ SCHED_FEAT(LAST_BUDDY, 1)
>  SCHED_FEAT(CACHE_HOT_BUDDY, 1)
>  
>  /*
> + * Hierarchical tree-based penalty: penalize service deficit by
> + * an order of magnitude for each parent process in the process
> + * tree.  This has the natural effect of forcing preference towards
> + * processes that are not fork()-hungry, like make(1), which helps
> + * to preserve good latency.
> + *
> + * This also has the side effect of providing in a limited way,
> + * per-user CPU entitlement partitioning.
> + */

But that doesn't quite match your description here, because decreasing
the relative placement for deep tasks will actually give those a benefit
over the shallow tasks...

> +SCHED_FEAT(HIERARCHICAL_PENALTY, 1)
> +
> +/*
>   * Use arch dependent cpu power functions
>   */
>  SCHED_FEAT(ARCH_POWER, 0)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH try 3] CFS: Add hierarchical tree-based penalty.
       [not found] <29349823.1671286875856124.JavaMail.root@ifrit.dereferenced.org>
@ 2010-10-12  9:34 ` William Pitcock
  2010-10-12  9:46   ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: William Pitcock @ 2010-10-12  9:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Mike Galbraith

Hi,

----- "Peter Zijlstra" <peterz@infradead.org> wrote:

> It's customary to CC people who work on the code you patch...

Please take a look at the revised patch I just sent to the list.

> 
> On Tue, 2010-10-12 at 00:32 -0500, William Pitcock wrote:
> > Inspired by the recent change to BFS by Con Kolivas, this patch
> causes
> > vruntime to be penalized based on parent depth from their root task
> > group.
> > 
> > I have, for the moment, decided to make it a default feature since
> the
> > design of CFS ensures that broken applications depending on task
> enqueue
> > behaviour behaving traditionally will continue to work.
> > 
> > My method for applying the penalty is different than that of BFS,
> it
> > divides the vruntime by the number of parents the process has.
> 
> Aside from the funny semantic error in your sentence there (every
> process can, by definition, only have one parent), the patch doesn't
> look quite right either.
> 
> 
> > Signed-off-by: William Pitcock <nenolod@dereferenced.org>
> > ---
> >  include/linux/sched.h   |    2 ++
> >  kernel/sched.c          |    4 ++++
> >  kernel/sched_fair.c     |    8 ++++++++
> >  kernel/sched_features.h |   12 ++++++++++++
> >  4 files changed, 26 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 1e2a6db..7b44f98 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1494,6 +1494,8 @@ struct task_struct {
> >  		unsigned long memsw_bytes; /* uncharged mem+swap usage */
> >  	} memcg_batch;
> >  #endif
> > +
> > +	int parent_count;
> >  };
> 
> so copy_process() will make a child inherit the parent_count from the
> parent.
> 
> >  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index dc85ceb..16ad939 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2621,6 +2621,10 @@ void wake_up_new_task(struct task_struct *p,
> unsigned long clone_flags)
> >  #endif
> >  
> >  	rq = task_rq_lock(p, &flags);
> > +
> > +	if (!(clone_flags & CLONE_THREAD))
> > +		p->parent_count++;
> > +
> 
> And we increment it for every new !THREAD child, so its basically a
> task
> depth counter, except it doesn't take re-parenting into account.

I renamed it to fork_depth, which is what BFS uses to describe this
variable.  It may be better to put this in sched_entity, however.

> 
> >  	activate_task(rq, p, 0);
> >  	trace_sched_wakeup_new(p, 1);
> >  	check_preempt_curr(rq, p, WF_FORK);
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index db3f674..3f17ec1 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -737,6 +737,14 @@ place_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *se, int initial)
> >  	if (initial && sched_feat(START_DEBIT))
> >  		vruntime += sched_vslice(cfs_rq, se);
> >  
> > +	if (sched_feat(HIERARCHICAL_PENALTY)) {
> > +		struct task_struct *tsk = task_of(se);
> > +
> 
> And here you have a bug,.. there is no guarantee se is actually a
> task.
> Which means you're dereferencing random memory here:

I am now checking this with entity_is_task() in the newest version.

> 
> > +		if (tsk->parent_count > 1)
> > +			vruntime = max_vruntime(vruntime,
> > +						vruntime / tsk->parent_count);
> 
> Aside from the wrap issue, that is a NOP statement.
> 
> x / y < x : y > 1, therefore, max(x, x/y) == x and you wrote, x = x;
> 
> Furthermore, vruntime here is an absolute measure of service,
> dividing
> that by anything reduces the total amount of service the task
> received
> during its history, doing so for y > 1 means at least halving it,
> which
> again means that you basically end up with:
> 
>   se->vruntime = se->vruntime;
> 
> Due to the final few statements in place_entity():
> 
>   se->vruntime = max_vruntime(se->vruntime, vruntime);
> 
> What I think you meant to do was proportionally decrease the relative
> placement of the new task by the task-depth, or something like that.

Yes, this should be a multiplication I believe, not a divide.  My original
code had this as a multiplication, not a division, as does the new patch.

However, I think:

    vruntime >>= tsk->fork_depth;

would do the job just as well and be faster.

I would be glad to know what you think about this.

William

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH try 3] CFS: Add hierarchical tree-based penalty.
  2010-10-12  9:34 ` [PATCH try 3] CFS: Add hierarchical tree-based penalty William Pitcock
@ 2010-10-12  9:46   ` Peter Zijlstra
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2010-10-12  9:46 UTC (permalink / raw)
  To: William Pitcock; +Cc: linux-kernel, Ingo Molnar, Mike Galbraith

On Tue, 2010-10-12 at 13:34 +0400, William Pitcock wrote:
> Yes, this should be a multiplication I believe, not a divide.  My original
> code had this as a multiplication, not a division, as does the new patch.
> 
> However, I think:
> 
>     vruntime >>= tsk->fork_depth;
> 
> would do the job just as well and be faster. 

That's still somewhat iffy as explained, vruntime is the absolute
service level, multiplying that by 2 (or even more) will utterly upset
things.

Imagine two runnable tasks of weight 1, say both have a vruntime of 3
million, seconds (there being two, vruntime will advance at 1/2
wall-time).

Now, suppose you wake a third, it too had a vruntime of around 3 million
seconds (it only slept for a little while), if you then multiply that
with 2 and place it at 6 mil, it will have to wait for 6 mil seconds
before it gets serviced (twice the time of the 3 mil difference in
service time between this new and the old tasks).

So, theory says the fair thing to do is place new tasks at the weighted
average of the existing tasks, but computing that is expensive, so what
we do is place it somewhere near the leftmost task in the tree.

Now, you don't want to push it out too far to the right, otherwise we
get starvation issues and people get upset.

So you have to somehow determine a window in which you want to place
this task and then vary in that depending on your fork_depth.

Simply manipulating the absolute service levels like you propose isn't
going to work.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH try 3] CFS: Add hierarchical tree-based penalty.
       [not found] <31330936.1751286877737150.JavaMail.root@ifrit.dereferenced.org>
@ 2010-10-12 10:02 ` William Pitcock
  0 siblings, 0 replies; 5+ messages in thread
From: William Pitcock @ 2010-10-12 10:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Mike Galbraith

Hi,

----- "Peter Zijlstra" <peterz@infradead.org> wrote:

> On Tue, 2010-10-12 at 13:34 +0400, William Pitcock wrote:
> > Yes, this should be a multiplication I believe, not a divide.  My
> original
> > code had this as a multiplication, not a division, as does the new
> patch.
> > 
> > However, I think:
> > 
> >     vruntime >>= tsk->fork_depth;
> > 
> > would do the job just as well and be faster. 
> 
> That's still somewhat iffy as explained, vruntime is the absolute
> service level, multiplying that by 2 (or even more) will utterly
> upset
> things.
> 

Yes, this is why I thought that doing a division would be better.

> Imagine two runnable tasks of weight 1, say both have a vruntime of 3
> million, seconds (there being two, vruntime will advance at 1/2
> wall-time).
> 
> Now, suppose you wake a third, it too had a vruntime of around 3
> million
> seconds (it only slept for a little while), if you then multiply that
> with 2 and place it at 6 mil, it will have to wait for 6 mil seconds
> before it gets serviced (twice the time of the 3 mil difference in
> service time between this new and the old tasks).
> 
> So, theory says the fair thing to do is place new tasks at the
> weighted
> average of the existing tasks, but computing that is expensive, so
> what
> we do is place it somewhere near the leftmost task in the tree.
> 
> Now, you don't want to push it out too far to the right, otherwise we
> get starvation issues and people get upset.
> 
> So you have to somehow determine a window in which you want to place
> this task and then vary in that depending on your fork_depth.
> 
> Simply manipulating the absolute service levels like you propose
> isn't
> going to work.

I think I have a solution to this.  Presently waiting on a kernel rebuild
on my laptop so I can test.

William

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-10-12 10:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <29349823.1671286875856124.JavaMail.root@ifrit.dereferenced.org>
2010-10-12  9:34 ` [PATCH try 3] CFS: Add hierarchical tree-based penalty William Pitcock
2010-10-12  9:46   ` Peter Zijlstra
     [not found] <31330936.1751286877737150.JavaMail.root@ifrit.dereferenced.org>
2010-10-12 10:02 ` William Pitcock
2010-10-12  5:32 William Pitcock
2010-10-12  8:45 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox