[PATCH 0/2] Enforce hierarchical order of leaf CFS runqueues

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Enforce hierarchical order of leaf CFS runqueues
@ 2011-07-18 10:50 Jan H. Schönherr
  2011-07-18 10:50 ` [PATCH 1/2] sched: Enforce " Jan H. Schönherr
  2011-07-18 10:50 ` [PATCH 2/2] sched: Prevent removal of leaf CFS runqueues with on_list children Jan H. Schönherr
  0 siblings, 2 replies; 14+ messages in thread
From: Jan H. Schönherr @ 2011-07-18 10:50 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Jan H. Schönherr

Code review showed, that the hierarchical order of the leaf
CFS runqueues introduced by commit 67e8625 ("sched: Introduce 
hierarchal order on shares update list") is not yet achieved.
(See description of first patch for an example.)

These two patches (against v3.0-rc7) try to fix that.

Patch 1 causes all leaves that are inserted during one 
enqueue_task_fair() call to be inserted right next to each other
at the beginning of the list.

Patch 2 prevents the existing code from punching holes
in the leaf-hierarchy as this would cause patch 1 to
re-insert them at possibly wrong positions.

What remains are some multiprocessor aspects: the leaf_cfs_rq_list
is traversed without holding rq->lock. So it can be modified while
a traversal is in progress. That means: while users of the list are 
guaranteed to traverse all elements in hierarchical order, they
are not guaranteed that they will see the parent of an already visited
child. But as far as I currently understand the code, this seems not
to be a problem as the code only relies on the former guarantee.

Regards
Jan

Jan H. Schönherr (2):
  sched: Enforce order of leaf CFS runqueues
  sched: Prevent removal of leaf CFS runqueues with on_list children

 kernel/sched.c      |    1 +
 kernel/sched_fair.c |   45 +++++++++++++++++++++++++++++----------------
 2 files changed, 30 insertions(+), 16 deletions(-)

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] sched: Enforce order of leaf CFS runqueues
  2011-07-18 10:50 [PATCH 0/2] Enforce hierarchical order of leaf CFS runqueues Jan H. Schönherr
@ 2011-07-18 10:50 ` Jan H. Schönherr
  2011-07-18 23:24   ` Paul Turner
  2011-07-18 10:50 ` [PATCH 2/2] sched: Prevent removal of leaf CFS runqueues with on_list children Jan H. Schönherr
  1 sibling, 1 reply; 14+ messages in thread
From: Jan H. Schönherr @ 2011-07-18 10:50 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

Currently, it is still possible that the ordering constraint
is violated. Consider a hierarchy of at least three task groups:

tg1 --> tg2 (parent of tg1)--> tg3 (grandparent of tg1)

And the following actions:
1. Enqueue task in tg3
2. Enqueue task in tg1

Due to step 1 tg3 will be on the list somewhere.

However, in step 2 tg1 will be inserted at the end of the list, because
its parent tg2 is not yet on the list. tg2 will then be inserted at the
beginning. So we get:

tg2 --> tg3 --> tg1

This patch addresses this by carrying the information of the
previously inserted leaf during bottom-up enqueuing.

This allows us to insert everything without a child at the beginning of
the list and all not yet enqueued parents of that runqueue right after it.
This way, every series of enqueues is guaranteed to be inserted before
older series. (Note, that this requires that a runqueue is not removed
from the list if it has some children that are still on the list.)

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
---
 kernel/sched_fair.c |   35 ++++++++++++++++++++---------------
 1 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 433491c..d021c75 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -143,23 +143,27 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
 	return cfs_rq->tg->cfs_rq[this_cpu];
 }
 
-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+					struct cfs_rq *prev_cfs_rq)
 {
+	struct list_head *prev_leaf_cfs_rq;
+
 	if (!cfs_rq->on_list) {
 		/*
-		 * Ensure we either appear before our parent (if already
-		 * enqueued) or force our parent to appear after us when it is
-		 * enqueued.  The fact that we always enqueue bottom-up
-		 * reduces this to two cases.
+		 * Ensure that children appear before their parent. The fact
+		 * that we always enqueue from some point in the hierarchie
+		 * until the top reduces this to two cases:
+		 * Either we are in the middle of one of these enqueue series,
+		 * then we enqueue directly after the last enqueued runqueue;
+		 * or we are starting such a sequence in which case we insert
+		 * the runqueue at the beginning of the list.
 		 */
-		if (cfs_rq->tg->parent &&
-		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		} else {
-			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		}
+		if (prev_cfs_rq)
+			prev_leaf_cfs_rq = &prev_cfs_rq->leaf_cfs_rq_list;
+		else
+			prev_leaf_cfs_rq = &rq_of(cfs_rq)->leaf_cfs_rq_list;
+
+		list_add_rcu(&cfs_rq->leaf_cfs_rq_list, prev_leaf_cfs_rq);
 
 		cfs_rq->on_list = 1;
 	}
@@ -276,7 +280,8 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
 	return &cpu_rq(this_cpu)->cfs;
 }
 
-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+					struct cfs_rq *prev_cfs_rq)
 {
 }
 
@@ -999,7 +1004,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	se->on_rq = 1;
 
 	if (cfs_rq->nr_running == 1)
-		list_add_leaf_cfs_rq(cfs_rq);
+		list_add_leaf_cfs_rq(cfs_rq, group_cfs_rq(se));
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues
  2011-07-18 10:50 ` [PATCH 1/2] sched: Enforce " Jan H. Schönherr
@ 2011-07-18 23:24   ` Paul Turner
  2011-07-19 13:08     ` Peter Zijlstra
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Paul Turner @ 2011-07-18 23:24 UTC (permalink / raw)
  To: Jan H. Schönherr; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel

On Mon, Jul 18, 2011 at 3:50 AM, Jan H. Schönherr
<schnhrr@cs.tu-berlin.de> wrote:
> From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
>
> Currently, it is still possible that the ordering constraint
> is violated. Consider a hierarchy of at least three task groups:
>
> tg1 --> tg2 (parent of tg1)--> tg3 (grandparent of tg1)
>
> And the following actions:
> 1. Enqueue task in tg3
> 2. Enqueue task in tg1
>
> Due to step 1 tg3 will be on the list somewhere.

Argh, yeah nice catch.

What we really want is whether the node at the *highest* level in the
tree, "the root parent", is on queue as opposed to the immediate
parent as there may be a discontinuity in this state as your example
presents.

Although this would still be subject to deletion punching holes in things..

>
> However, in step 2 tg1 will be inserted at the end of the list, because
> its parent tg2 is not yet on the list. tg2 will then be inserted at the
> beginning. So we get:
>
> tg2 --> tg3 --> tg1
>
> This patch addresses this by carrying the information of the
> previously inserted leaf during bottom-up enqueuing.
>
> This allows us to insert everything without a child at the beginning of
> the list and all not yet enqueued parents of that runqueue right after it.
> This way, every series of enqueues is guaranteed to be inserted before
> older series. (Note, that this requires that a runqueue is not removed
> from the list if it has some children that are still on the list.)
>

One thing that bothers me about this is that it adds an implicit
positional dependency in enqueue_entity.  It's no longer safe to
enqueue/dequeue an arbitrary entity; this leads to situation 1 in the
follow-up patch requiring in-order removal.  If the enqueue is always
safe then this is not required.

I wonder if we could move this logic to the enqueue_task_fair path()
as that is the gating condition to become a leaf for load tracking.
Ideally we could build up a list of the untracked entities and then
splice it, but alas we have to use an rculist to track leaf cfs_rq's
so that we can walk it without locks in update shares.

hmmm, what about something like the below (only boot tested), it
should make the insert case always safe meaning we don't need to do
anything funky around delete:

---
Subject: [PATCH] sched: handle on_list ancestor in leaf_add_cfs_rq()
From: Paul Turner <pjt@google.com>
Date: Mon, 18 Jul 2011 16:08:10 -0700

Jan H. Schönherr found that in the case of an on_list ancestor we may
incorrectly place the child to the right of a great-ancestor on the list.

Consider:

      A
     / \     Here, t1A results in A->cfs_rq being on_list, however when
    B  t1A   we start enqueuing from C this will not be visible.  This is
   /         compounded by the fact that on_list expiration may also be out
  C          of order, punching holes in the tree.
 /
t1C

Prevent this by making additions to the leaf_cfs_rq_list position independent.
This is done by maintaining additions to this list within the
enqueue_task_fair() path, which allows us to always enqueue against the
current entity's first on_list ancestor.

Reported-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched_fair.c |   55 +++++++++++++++++++++++++++++++-------------------
 1 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index eb98f77..a7e0966 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -143,26 +143,39 @@ static inline struct cfs_rq *cpu_cfs_rq(struct
cfs_rq *cfs_rq, int this_cpu)
 	return cfs_rq->tg->cfs_rq[this_cpu];
 }

-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+/*
+ * rq->leaf_cfs_rq_list has an order constraint that specifies children must
+ * appear before parents.  For the (!on_list) chain starting at cfs_rq this
+ * finds a satisfactory insertion point.  If no ancestor is yet on_list, this
+ * choice is arbitrary.
+ */
+static inline struct list_head *find_leaf_cfs_rq_insertion(struct
cfs_rq *cfs_rq)
 {
-	if (!cfs_rq->on_list) {
-		/*
-		 * Ensure we either appear before our parent (if already
-		 * enqueued) or force our parent to appear after us when it is
-		 * enqueued.  The fact that we always enqueue bottom-up
-		 * reduces this to two cases.
-		 */
-		if (cfs_rq->tg->parent &&
-		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		} else {
-			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		}
+	struct sched_entity *se;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+	for_each_sched_entity(se)
+		if (cfs_rq->on_list)
+			return &cfs_rq->leaf_cfs_rq_list;

-		cfs_rq->on_list = 1;
+	return &rq_of(cfs_rq)->leaf_cfs_rq_list;
+}
+
+static inline void enqueue_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+				       struct list_head **leaf_insertion)
+{
+	if (cfs_rq->on_list) {
+		/* make sure we search again when !on_list follows on_list */
+		*leaf_insertion = NULL;
+		return;
 	}
+
+	if (!*leaf_insertion)
+		*leaf_insertion = find_leaf_cfs_rq_insertion(cfs_rq);
+
+	/* use insertion point as a queue head => children appear first */
+	list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, *leaf_insertion);
+	cfs_rq->on_list = 1;
 }

 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
@@ -276,7 +289,8 @@ static inline struct cfs_rq *cpu_cfs_rq(struct
cfs_rq *cfs_rq, int this_cpu)
 	return &cpu_rq(this_cpu)->cfs;
 }

-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void enqueue_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+				       struct list_head **leaf_insertion)
 {
 }

@@ -997,9 +1011,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct
sched_entity *se, int flags)
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
-
-	if (cfs_rq->nr_running == 1)
-		list_add_leaf_cfs_rq(cfs_rq);
 }

 static void __clear_buddies_last(struct sched_entity *se)
@@ -1326,12 +1337,14 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	struct list_head *leaf_insertion_point = NULL;

 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		enqueue_leaf_cfs_rq(cfs_rq, &leaf_insertion_point);
 		flags = ENQUEUE_WAKEUP;
 	}

-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues
  2011-07-18 23:24   ` Paul Turner
@ 2011-07-19 13:08     ` Peter Zijlstra
  2011-07-19 17:48       ` Paul Turner
  2011-07-19 15:17     ` Jan Schönherr
  2011-07-21 13:20     ` Jan H. Schönherr
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2011-07-19 13:08 UTC (permalink / raw)
  To: Paul Turner; +Cc: Jan H. Schönherr, Ingo Molnar, linux-kernel

On Mon, 2011-07-18 at 16:24 -0700, Paul Turner wrote:

> Subject: [PATCH] sched: handle on_list ancestor in leaf_add_cfs_rq()
> From: Paul Turner <pjt@google.com>
> Date: Mon, 18 Jul 2011 16:08:10 -0700
> 
> Jan H. Schönherr found that in the case of an on_list ancestor we may
> incorrectly place the child to the right of a great-ancestor on the list.
> 
> Consider:
> 
>       A
>      / \     Here, t1A results in A->cfs_rq being on_list, however when
>     B  t1A   we start enqueuing from C this will not be visible.  This is
>    /         compounded by the fact that on_list expiration may also be out
>   C          of order, punching holes in the tree.
>  /
> t1C
> 
> Prevent this by making additions to the leaf_cfs_rq_list position independent.
> This is done by maintaining additions to this list within the
> enqueue_task_fair() path, which allows us to always enqueue against the
> current entity's first on_list ancestor.

The problem I have with this is that it makes the enqueue more
expensive. We're now optimizing a relatively slow path (load-balance) at
the cost of the hottest path in the kernel (enqueue/dequeue).



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues
  2011-07-19 13:08     ` Peter Zijlstra
@ 2011-07-19 17:48       ` Paul Turner
  0 siblings, 0 replies; 14+ messages in thread
From: Paul Turner @ 2011-07-19 17:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jan H., Ingo Molnar, linux-kernel

On Tue, Jul 19, 2011 at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-07-18 at 16:24 -0700, Paul Turner wrote:
>
>> Subject: [PATCH] sched: handle on_list ancestor in leaf_add_cfs_rq()
>> From: Paul Turner <pjt@google.com>
>> Date: Mon, 18 Jul 2011 16:08:10 -0700
>>
>> Jan H. Schönherr found that in the case of an on_list ancestor we may
>> incorrectly place the child to the right of a great-ancestor on the list.
>>
>> Consider:
>>
>>       A
>>      / \     Here, t1A results in A->cfs_rq being on_list, however when
>>     B  t1A   we start enqueuing from C this will not be visible.  This is
>>    /         compounded by the fact that on_list expiration may also be out
>>   C          of order, punching holes in the tree.
>>  /
>> t1C
>>
>> Prevent this by making additions to the leaf_cfs_rq_list position independent.
>> This is done by maintaining additions to this list within the
>> enqueue_task_fair() path, which allows us to always enqueue against the
>> current entity's first on_list ancestor.
>
> The problem I have with this is that it makes the enqueue more
> expensive. We're now optimizing a relatively slow path (load-balance) at
> the cost of the hottest path in the kernel (enqueue/dequeue).
>

So this is a concern that I kept in mind.  I don't think this should
actually add any measurable overhead:

- on_rq always implies on_list so we are guaranteed never to go past
where we have to go for enqueuing; moreover we're touching the same
lines that enqueue will use for continuing it s walk
- it takes on_list ~40ms+ to expire so the extra walk above can only
happen on this order.

I can't think of any benchmark that's going to measure anything for
this.  pipe-bench et al are just going to always hit the single
on_list branch which should be the same cost as the previous == 1
check.

>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues
  2011-07-18 23:24   ` Paul Turner
  2011-07-19 13:08     ` Peter Zijlstra
@ 2011-07-19 15:17     ` Jan Schönherr
  2011-07-19 17:53       ` Paul Turner
  2011-07-21 13:20     ` Jan H. Schönherr
  2 siblings, 1 reply; 14+ messages in thread
From: Jan Schönherr @ 2011-07-19 15:17 UTC (permalink / raw)
  To: Paul Turner; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel

Am 19.07.2011 01:24, schrieb Paul Turner:
> hmmm, what about something like the below (only boot tested), it
> should make the insert case always safe meaning we don't need to do
> anything funky around delete:

Seems to work, too, with two modifications...


> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index eb98f77..a7e0966 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -143,26 +143,39 @@ static inline struct cfs_rq *cpu_cfs_rq(struct
> cfs_rq *cfs_rq, int this_cpu)
>  	return cfs_rq->tg->cfs_rq[this_cpu];
>  }
>
> -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +/*
> + * rq->leaf_cfs_rq_list has an order constraint that specifies children must
> + * appear before parents.  For the (!on_list) chain starting at cfs_rq this
> + * finds a satisfactory insertion point.  If no ancestor is yet on_list, this
> + * choice is arbitrary.
> + */
> +static inline struct list_head *find_leaf_cfs_rq_insertion(struct
> cfs_rq *cfs_rq)
>  {
> -	if (!cfs_rq->on_list) {
> -		/*
> -		 * Ensure we either appear before our parent (if already
> -		 * enqueued) or force our parent to appear after us when it is
> -		 * enqueued.  The fact that we always enqueue bottom-up
> -		 * reduces this to two cases.
> -		 */
> -		if (cfs_rq->tg->parent &&
> -		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
> -			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
> -				&rq_of(cfs_rq)->leaf_cfs_rq_list);
> -		} else {
> -			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
> -				&rq_of(cfs_rq)->leaf_cfs_rq_list);
> -		}
> +	struct sched_entity *se;
> +
> +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> +	for_each_sched_entity(se)
> +		if (cfs_rq->on_list)
> +			return &cfs_rq->leaf_cfs_rq_list;

Need to use cfs_rq corresponding to current se:

-	for_each_sched_entity(se)
-		if (cfs_rq->on_list)
-			return &cfs_rq->leaf_cfs_rq_list;
+	for_each_sched_entity(se) {
+		struct cfs_rq *se_cfs_rq = cfs_rq_of(se);
+		if (se_cfs_rq->on_list)
+			return &se_cfs_rq->leaf_cfs_rq_list;
+	}

>
> -		cfs_rq->on_list = 1;
> +	return &rq_of(cfs_rq)->leaf_cfs_rq_list;
> +}


And something like the following hack to prevent the removal
of the leaf_insertion_point itself during
	enqueue_entity()
		update_cfs_load()

(Obviously not for production:)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 2df33d4..947257d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -832,7 +832,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
        }

        /* consider updating load contribution on each fold or truncate */
-       if (global_update || cfs_rq->load_period > period
+       if (global_update==1 || cfs_rq->load_period > period
            || !cfs_rq->load_period)
                update_cfs_rq_load_contribution(cfs_rq, global_update);

@@ -847,7 +847,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
                cfs_rq->load_avg /= 2;
        }

-       if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
+       if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg && global_update!=2)
                list_del_leaf_cfs_rq(cfs_rq);
 }

@@ -1063,7 +1063,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
         * Update run-time statistics of the 'current'.
         */
        update_curr(cfs_rq);
-       update_cfs_load(cfs_rq, 0);
+       update_cfs_load(cfs_rq, 2);
        account_entity_enqueue(cfs_rq, se);
        update_cfs_shares(cfs_rq);



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues
  2011-07-19 15:17     ` Jan Schönherr
@ 2011-07-19 17:53       ` Paul Turner
  0 siblings, 0 replies; 14+ messages in thread
From: Paul Turner @ 2011-07-19 17:53 UTC (permalink / raw)
  To: Jan Schönherr; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel

On Tue, Jul 19, 2011 at 8:17 AM, Jan Schönherr <schnhrr@cs.tu-berlin.de> wrote:
> Am 19.07.2011 01:24, schrieb Paul Turner:
>> hmmm, what about something like the below (only boot tested), it
>> should make the insert case always safe meaning we don't need to do
>> anything funky around delete:
>
> Seems to work, too, with two modifications...
>
>
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index eb98f77..a7e0966 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -143,26 +143,39 @@ static inline struct cfs_rq *cpu_cfs_rq(struct
>> cfs_rq *cfs_rq, int this_cpu)
>>       return cfs_rq->tg->cfs_rq[this_cpu];
>>  }
>>
>> -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>> +/*
>> + * rq->leaf_cfs_rq_list has an order constraint that specifies children must
>> + * appear before parents.  For the (!on_list) chain starting at cfs_rq this
>> + * finds a satisfactory insertion point.  If no ancestor is yet on_list, this
>> + * choice is arbitrary.
>> + */
>> +static inline struct list_head *find_leaf_cfs_rq_insertion(struct
>> cfs_rq *cfs_rq)
>>  {
>> -     if (!cfs_rq->on_list) {
>> -             /*
>> -              * Ensure we either appear before our parent (if already
>> -              * enqueued) or force our parent to appear after us when it is
>> -              * enqueued.  The fact that we always enqueue bottom-up
>> -              * reduces this to two cases.
>> -              */
>> -             if (cfs_rq->tg->parent &&
>> -                 cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
>> -                     list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
>> -                             &rq_of(cfs_rq)->leaf_cfs_rq_list);
>> -             } else {
>> -                     list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
>> -                             &rq_of(cfs_rq)->leaf_cfs_rq_list);
>> -             }
>> +     struct sched_entity *se;
>> +
>> +     se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>> +     for_each_sched_entity(se)
>> +             if (cfs_rq->on_list)
>> +                     return &cfs_rq->leaf_cfs_rq_list;
>
> Need to use cfs_rq corresponding to current se:
>
> -       for_each_sched_entity(se)
> -               if (cfs_rq->on_list)
> -                       return &cfs_rq->leaf_cfs_rq_list;
> +       for_each_sched_entity(se) {
> +               struct cfs_rq *se_cfs_rq = cfs_rq_of(se);
> +               if (se_cfs_rq->on_list)
> +                       return &se_cfs_rq->leaf_cfs_rq_list;
> +       }

Yeah... that WOULD be a good idea wouldn't it :)

>
>>
>> -             cfs_rq->on_list = 1;
>> +     return &rq_of(cfs_rq)->leaf_cfs_rq_list;
>> +}
>
>
> And something like the following hack to prevent the removal
> of the leaf_insertion_point itself during
>        enqueue_entity()
>                update_cfs_load()
>
> (Obviously not for production:)

Oh.. yeah.. ew..

I don't think this needs a new global_update state to track.  We
should just change the call out to list_del_leaf_cfs_rq to only occur
on global updates (e.g. == 1 case) and let them fall out via
update_shares.

>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 2df33d4..947257d 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -832,7 +832,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
>        }
>
>        /* consider updating load contribution on each fold or truncate */
> -       if (global_update || cfs_rq->load_period > period
> +       if (global_update==1 || cfs_rq->load_period > period
>            || !cfs_rq->load_period)
>                update_cfs_rq_load_contribution(cfs_rq, global_update);
>
> @@ -847,7 +847,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
>                cfs_rq->load_avg /= 2;
>        }
>
> -       if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
> +       if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg && global_update!=2)
>                list_del_leaf_cfs_rq(cfs_rq);
>  }
>
> @@ -1063,7 +1063,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>         * Update run-time statistics of the 'current'.
>         */
>        update_curr(cfs_rq);
> -       update_cfs_load(cfs_rq, 0);
> +       update_cfs_load(cfs_rq, 2);
>        account_entity_enqueue(cfs_rq, se);
>        update_cfs_shares(cfs_rq);
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues
  2011-07-18 23:24   ` Paul Turner
  2011-07-19 13:08     ` Peter Zijlstra
  2011-07-19 15:17     ` Jan Schönherr
@ 2011-07-21 13:20     ` Jan H. Schönherr
  2011-07-21 13:20       ` [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr
                         ` (2 more replies)
  2 siblings, 3 replies; 14+ messages in thread
From: Jan H. Schönherr @ 2011-07-21 13:20 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Jan H. Schönherr

Am 19.07.2011 01:24, schrieb Paul Turner:
> I wonder if we could move this logic to the enqueue_task_fair path()
> as that is the gating condition to become a leaf for load tracking.
> Ideally we could build up a list of the untracked entities and then
> splice it, but alas we have to use an rculist to track leaf cfs_rq's
> so that we can walk it without locks in update shares.

Out of curiosity I took a closer look at your "ideal" approach.

If I am not mistaken, this should actually work.
The results come as a reply to this mail.

The problem with the rcu-list is, that we cannot put individual
leaves on a different list, as this would irritate concurrent
readers. But what we can do without problems is to assemble
the leaves into something like a "detached list". That is:
while we modify the next-pointers of deleted leaves, we only
let them point to other (possibly deleted) leaves of the same 
list, so that concurrent readers will still find their way back.

As long as there is no error in my RCU reasoning, we get your
ideal approach that is also not subject to Peter's "We do more
work than before" statement.

What do you think of this?

Patch 1: After inventing __list_link(), I realized, that this
	 function already existed, but with a different name.

	 This patch just does some renaming. Not really needed,
	 but if I use the old naming in patch 2+3 it's really 
	 hard to understand what's actually going on.

Patch 2: This introduces new list functions to splice RCU lists.

Patch 3: The actual change to the leaf handling.

Regards
Jan

Jan H. Schönherr (3):
  list, treewide: Rename __list_del() to __list_link()
  rcu: More rcu-variants for list manipulation
  sched: Handle on_list ancestor in list_add_leaf_cfs_rq()

 drivers/firewire/core-topology.c     |    2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c |    4 +-
 include/linux/list.h                 |    6 ++--
 include/linux/rculist.h              |   28 +++++++++++++++-
 kernel/sched_fair.c                  |   59 ++++++++++++++++++++++-----------
 lib/list_debug.c                     |    2 +-
 6 files changed, 73 insertions(+), 28 deletions(-)

-- 
1.7.6

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link()
  2011-07-21 13:20     ` Jan H. Schönherr
@ 2011-07-21 13:20       ` Jan H. Schönherr
  2011-07-21 13:20       ` [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation Jan H. Schönherr
  2011-07-21 13:20       ` [PATCH RFC 3/3] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
  2 siblings, 0 replies; 14+ messages in thread
From: Jan H. Schönherr @ 2011-07-21 13:20 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

The purpose of __list_del() is to let two list entries point to each other.
It does not actually remove anything. It can be used to delete something
as it is done from within __list_del_entry().

This commit renames __list_del() to __list_link() as this new name better
reflects what the function is actually doing.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

---
We could now rephrase some of the existing list-functions with
__list_link(). Might be more readable. Also, there are quite a few
other follow-up patches when looking longer at this.
---
 drivers/firewire/core-topology.c     |    2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c |    4 ++--
 include/linux/list.h                 |    6 +++---
 include/linux/rculist.h              |    2 +-
 lib/list_debug.c                     |    2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/firewire/core-topology.c b/drivers/firewire/core-topology.c
index 193ed92..b7cd1db 100644
--- a/drivers/firewire/core-topology.c
+++ b/drivers/firewire/core-topology.c
@@ -290,7 +290,7 @@ static struct fw_node *build_tree(struct fw_card *card,
 		}
 
 		/* Pop the child nodes off the stack and push the new node. */
-		__list_del(h->prev, &stack);
+		__list_link(h->prev, &stack);
 		list_add_tail(&node->link, &stack);
 		stack_depth += 1 - child_port_count;
 
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index d948575..fd4443a 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -331,7 +331,7 @@ restart:
 		/* We can only remove NUM_PAGES_TO_ALLOC at a time. */
 		if (freed_pages >= NUM_PAGES_TO_ALLOC) {
 			/* remove range of pages from the pool */
-			__list_del(p->lru.prev, &pool->list);
+			__list_link(p->lru.prev, &pool->list);
 
 			ttm_pool_update_free_locked(pool, freed_pages);
 			/**
@@ -366,7 +366,7 @@ restart:
 
 	/* remove range of pages from the pool */
 	if (freed_pages) {
-		__list_del(&p->lru, &pool->list);
+		__list_link(&p->lru, &pool->list);
 
 		ttm_pool_update_free_locked(pool, freed_pages);
 		nr_free -= freed_pages;
diff --git a/include/linux/list.h b/include/linux/list.h
index cc6d2aa..025b883 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -77,13 +77,13 @@ static inline void list_add_tail(struct list_head *new, struct list_head *head)
 }
 
 /*
- * Delete a list entry by making the prev/next entries
+ * Link two list entries by making the prev/next entries
  * point to each other.
  *
  * This is only for internal list manipulation where we know
  * the prev/next entries already!
  */
-static inline void __list_del(struct list_head * prev, struct list_head * next)
+static inline void __list_link(struct list_head *prev, struct list_head *next)
 {
 	next->prev = prev;
 	prev->next = next;
@@ -98,7 +98,7 @@ static inline void __list_del(struct list_head * prev, struct list_head * next)
 #ifndef CONFIG_DEBUG_LIST
 static inline void __list_del_entry(struct list_head *entry)
 {
-	__list_del(entry->prev, entry->next);
+	__list_link(entry->prev, entry->next);
 }
 
 static inline void list_del(struct list_head *entry)
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index e3beb31..748401b 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -108,7 +108,7 @@ static inline void list_add_tail_rcu(struct list_head *new,
  */
 static inline void list_del_rcu(struct list_head *entry)
 {
-	__list_del(entry->prev, entry->next);
+	__list_link(entry->prev, entry->next);
 	entry->prev = LIST_POISON2;
 }
 
diff --git a/lib/list_debug.c b/lib/list_debug.c
index b8029a5..7e37695 100644
--- a/lib/list_debug.c
+++ b/lib/list_debug.c
@@ -56,7 +56,7 @@ void __list_del_entry(struct list_head *entry)
 		"but was %p\n", entry, next->prev))
 		return;
 
-	__list_del(prev, next);
+	__list_link(prev, next);
 }
 EXPORT_SYMBOL(__list_del_entry);
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation
  2011-07-21 13:20     ` Jan H. Schönherr
  2011-07-21 13:20       ` [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr
@ 2011-07-21 13:20       ` Jan H. Schönherr
  2011-07-22 16:21         ` Paul E. McKenney
  2011-07-21 13:20       ` [PATCH RFC 3/3] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
  2 siblings, 1 reply; 14+ messages in thread
From: Jan H. Schönherr @ 2011-07-21 13:20 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

This commit introduces some more *_rcu variants of some list
manipulation functions:

__list_link_rcu()
__list_splice_rcu()
list_splice_tail_init_rcu()

An important difference of the new rcu-splice functions
compared to the existing function is that they require
that there are either no concurrent readers on the list to
be spliced or that these readers are still from the list
where the list is spliced into.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

---
TODO:
- add comments
- rephrase existing splice function using __list_splice_rcu()
  and probably add "_sync" to the name
---
 include/linux/rculist.h |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 748401b..56385c8 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -24,6 +24,15 @@
  */
 #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
 
+
+static inline void __list_link_rcu(struct list_head *prev,
+					struct list_head *next)
+{
+	rcu_assign_pointer(list_next_rcu(prev), next);
+	next->prev = prev;
+}
+
+
 /*
  * Insert a new entry between two known consecutive entries.
  *
@@ -158,6 +167,23 @@ static inline void list_replace_rcu(struct list_head *old,
 	old->prev = LIST_POISON2;
 }
 
+static inline void __list_splice_rcu(struct list_head *list,
+					struct list_head *prev,
+					struct list_head *next)
+{
+	__list_link(list->prev, next);
+	__list_link_rcu(prev, list->next);
+}
+
+static inline void list_splice_tail_init_rcu(struct list_head *list,
+						struct list_head *head)
+{
+	if (!list_empty(list)) {
+		__list_splice_rcu(list, head->prev, head);
+		INIT_LIST_HEAD(list);
+	}
+}
+
 /**
  * list_splice_init_rcu - splice an RCU-protected list into an existing list.
  * @list:	the RCU-protected list to splice
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation
  2011-07-21 13:20       ` [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation Jan H. Schönherr
@ 2011-07-22 16:21         ` Paul E. McKenney
  2011-07-23 18:41           ` Jan Schönherr
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2011-07-22 16:21 UTC (permalink / raw)
  To: Jan H. Schönherr
  Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, linux-kernel

On Thu, Jul 21, 2011 at 03:20:22PM +0200, Jan H. Schönherr wrote:
> From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
> 
> This commit introduces some more *_rcu variants of some list
> manipulation functions:
> 
> __list_link_rcu()
> __list_splice_rcu()
> list_splice_tail_init_rcu()
> 
> An important difference of the new rcu-splice functions
> compared to the existing function is that they require
> that there are either no concurrent readers on the list to
> be spliced or that these readers are still from the list
> where the list is spliced into.

Please see below for some comments on issues you are probably already
well aware of -- just commenting, the code itself looks OK.  Once this
patch is in good shape, it should probably go with the following patch
(which uses it) rather than up -rcu.

						Thanx, Paul

> Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
> 
> ---
> TODO:
> - add comments
> - rephrase existing splice function using __list_splice_rcu()
>   and probably add "_sync" to the name
> ---
>  include/linux/rculist.h |   26 ++++++++++++++++++++++++++
>  1 files changed, 26 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 748401b..56385c8 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -24,6 +24,15 @@
>   */
>  #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
> 

Badly need a header comment here!!!  This needs to include locking
constraints and RCU reader constraints.

Including why you have "prev" then "next" rather than the other
way around.  (Which I believe I figured out below.)

> +static inline void __list_link_rcu(struct list_head *prev,
> +					struct list_head *next)
> +{
> +	rcu_assign_pointer(list_next_rcu(prev), next);
> +	next->prev = prev;
> +}

OK, so this one links a single list element to another single list
element.  You therefore need two calls to this to splice a pair of list
together, right?  If so, there will be a time when you have a "6" shaped
list, so that readers from one of the list heads will never come back
to that list head.  This is probably what you are getting at with your
"require that there are either no concurrent readers on the list to be
spliced or that these readers are still from the list where the list is
spliced into".

However, this is quite subtle and needs to be spelled out carefully.

As I understand it, when you splice together a list A and a list B,
then list A cannot have any concurrent readers, but list B can.
And then you have to do the __list_link_rcu() operations (along maybe
with __list_link() operations) in a particular order.  These constraints
need to be carefully documented, and uses need to be commented.

> +
> +
>  /*
>   * Insert a new entry between two known consecutive entries.
>   *
> @@ -158,6 +167,23 @@ static inline void list_replace_rcu(struct list_head *old,
>  	old->prev = LIST_POISON2;
>  }

Really badly need a header comment here!!!  Which of these lists can
tolerate concurrent readers?  Why are there three arguments?  My guess
from the code is that "list" is the list being spliced, and it cannot
tolerate concurrent readers, while "prev" is one end of the place to
splice into the list and "next" is the other.  I believe that both "prev"
and "next" can tolerate concurrent readers.

Did I get it right?

Of course, if I got it wrong, that just proves how important the
documentation is!  ;-)

And I still don't understand why the argument order isn't list, next,
then prev -- after all, isn't that the order that the elements will
appear on the spliced list?  Ah, you are worried about the local order
at the splice point...  OK, then please include this in the comment.

Yeah, I know, the include/linux/list.h comments aren't all that detailed.
On the other hand, the include/linux/list.h primitives don't have to
worry about concurrent readers.  So please add detailed comments.  ;-)

> +static inline void __list_splice_rcu(struct list_head *list,
> +					struct list_head *prev,
> +					struct list_head *next)
> +{
> +	__list_link(list->prev, next);

At this point, RCU readers traversing the list containing "prev" and
"next" see no change, which is why it is OK to use __list_link() rather
than __list_link_rcu().  If there are readers traversing "list", they
will loop infinitely failing to find the old list header.

> +	__list_link_rcu(prev, list->next);

At this point, new RCU readers traversing the list containing
"next" and "prev" suddenly see the new elements from "list".  The
rcu_assign_pointer() in __lists_link_rcu() should force proper ordering.

OK, so looks reasonable.  Assuming that my guesses about what this is
supposed to do are correct, anyway.

> +}

And badly need a comment here, especially regarding which list is
being spliced into which.  The usual convention would be that
list is being spliced into head.

> +static inline void list_splice_tail_init_rcu(struct list_head *list,
> +						struct list_head *head)
> +{
> +	if (!list_empty(list)) {
> +		__list_splice_rcu(list, head->prev, head);
> +		INIT_LIST_HEAD(list);
> +	}
> +}

OK, so if "list" is non-empty, we splice its elements to precede
"head".

							Thanx, Paul

>  /**
>   * list_splice_init_rcu - splice an RCU-protected list into an existing list.
>   * @list:	the RCU-protected list to splice
> -- 
> 1.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation
  2011-07-22 16:21         ` Paul E. McKenney
@ 2011-07-23 18:41           ` Jan Schönherr
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Schönherr @ 2011-07-23 18:41 UTC (permalink / raw)
  To: paulmck; +Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, linux-kernel

Am 22.07.2011 18:21, schrieb Paul E. McKenney:
> Please see below for some comments on issues you are probably already
> well aware of -- just commenting, the code itself looks OK.

Nice to hear.

> Once this
> patch is in good shape, it should probably go with the following patch
> (which uses it) rather than up -rcu.

Okay. Whether I'll put this code into good shape or not depends
a bit on the last patch using this functionality.

You are of course right with the missing comments, but if this way of solving
the problem is rejected (or has some other problems that I might have
overlooked), then there is no point in writing all the comments...
(That is also why I did not put you on CC yet.)

If we decide to do it that way, you will get an updated version, hopefully
with all the comments you want to see in there. :)

>> +static inline void __list_link_rcu(struct list_head *prev,
>> +					struct list_head *next)
>> +{
>> +	rcu_assign_pointer(list_next_rcu(prev), next);
>> +	next->prev = prev;
>> +}
>
> OK, so this one links a single list element to another single list
> element.  You therefore need two calls to this to splice a pair of list
> together, right?

Yes (or to the non-rcu version).

> If so, there will be a time when you have a "6" shaped
> list, so that readers from one of the list heads will never come back
> to that list head.  This is probably what you are getting at with your
> "require that there are either no concurrent readers on the list to be
> spliced or that these readers are still from the list where the list is
> spliced into".

Exactly. No reader must get trapped in a circle. They all must be able
to return to their list head.

> However, this is quite subtle and needs to be spelled out carefully.

I will try to do that.

> Really badly need a header comment here!!!  Which of these lists can
> tolerate concurrent readers?  Why are there three arguments?  My guess
> from the code is that "list" is the list being spliced, and it cannot
> tolerate concurrent readers, while "prev" is one end of the place to
> splice into the list and "next" is the other.  I believe that both "prev"
> and "next" can tolerate concurrent readers.
>
> Did I get it right?

Everything. :)

Regarding the function arguments, I tried to mimic those of the non-rcu
variants.

>> +static inline void __list_splice_rcu(struct list_head *list,
>> +					struct list_head *prev,
>> +					struct list_head *next)
>> +{
>> +	__list_link(list->prev, next);
>
> At this point, RCU readers traversing the list containing "prev" and
> "next" see no change, which is why it is OK to use __list_link() rather
> than __list_link_rcu().  If there are readers traversing "list", they
> will loop infinitely failing to find the old list header.

Either that, or they do something worse as they mistake the new list
header for a normal node and try to access some of the elements of the
surrounding struct, which does not exist in this case.

>
>> +	__list_link_rcu(prev, list->next);
>
> At this point, new RCU readers traversing the list containing
> "next" and "prev" suddenly see the new elements from "list".  The
> rcu_assign_pointer() in __lists_link_rcu() should force proper ordering.
>
> OK, so looks reasonable.  Assuming that my guesses about what this is
> supposed to do are correct, anyway.

I should probably say something about possible the use cases of this:

One could of course rewrite the already existing rcu-splice function to use
this low level funtionality.

But it gets more interesting if you want to add multiple elements to an
rcu-protected list: Instead of doing that one by one with list_add_rcu()
you can assemble them in a non-rcu list and splice them (avoiding quite
a few memory barriers).

The use case in the 3rd patch is even more special: The elements that
are spliced were removed from the very same list previously and there
might still be readers on them from before the removal. Due to this we
cannot collect this batch in a new list. But we can link one deleted
entry to the next (using the non-rcu(!) __list_link) building up a chain
of deleted entries. If there was a reader, it will find its way back
to the list over the last element in that chain. That chain is then
passed to the splice function.

(That chain-handling is currently in the third patch. I wonder whether
we should move that to one of the header files, too?!)

Regards
Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 3/3] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-07-21 13:20     ` Jan H. Schönherr
  2011-07-21 13:20       ` [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr
  2011-07-21 13:20       ` [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation Jan H. Schönherr
@ 2011-07-21 13:20       ` Jan H. Schönherr
  2 siblings, 0 replies; 14+ messages in thread
From: Jan H. Schönherr @ 2011-07-21 13:20 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

In the case of an on_list ancestor we may incorrectly place the child to
the right of a great-ancestor on the list.

Consider:

      A
     / \     Here, t1A results in A->cfs_rq being on_list, however when
    B  t1A   we start enqueuing from C this will not be visible.  This is
   /         compounded by the fact that on_list expiration may also be out
  C          of order, punching holes in the tree.
 /
t1C

Prevent this by making additions to the leaf_cfs_rq_list position
independent.

This is done by collecting additions to this list within the
enqueue_task_fair() path until we reach the top or find an on_list
ancestor. All additions are then spliced into the leaf_cfs_rq_list at
once.

The additions cannot be collected in a list with the normal means as
this violates the RCU guarantee that the next pointer of an element
always leads back to the list as long as there could be concurrent
readers.  However, we are still allowed to modify the next pointer of
deleted entries as long as we make sure that they eventually return to
the list. That is, if we have to collect more than one entry in a row,
it is legal to set the next-pointer of the first collected entry to the
next one. (We can do this even without using *_rcu functions, as there
are no new elements on that path.)

Suggested-by: Paul Turner <pjt@google.com>
Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
---
 kernel/sched_fair.c |   59 +++++++++++++++++++++++++++++++++-----------------
 1 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 433491c..f98b424 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -143,26 +143,41 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
 	return cfs_rq->tg->cfs_rq[this_cpu];
 }
 
-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+					struct list_head *leaf_segment)
 {
-	if (!cfs_rq->on_list) {
-		/*
-		 * Ensure we either appear before our parent (if already
-		 * enqueued) or force our parent to appear after us when it is
-		 * enqueued.  The fact that we always enqueue bottom-up
-		 * reduces this to two cases.
-		 */
-		if (cfs_rq->tg->parent &&
-		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		} else {
-			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		}
+	if (cfs_rq->on_list)
+		return;
 
-		cfs_rq->on_list = 1;
+	if (list_empty(leaf_segment)) {
+		leaf_segment->next = &cfs_rq->leaf_cfs_rq_list;
+		leaf_segment->prev = &cfs_rq->leaf_cfs_rq_list;
+	} else {
+		__list_link(leaf_segment->prev, &cfs_rq->leaf_cfs_rq_list);
+		leaf_segment->prev = leaf_segment->prev->next;
 	}
+
+	/*
+	 * If our parent is on_list or if there is no parent, then splice all
+	 * entries collected so far at the correct position into the
+	 * leaf_cfs_rq_list.
+	 *
+	 * If our parent is not on the list, it will be collected during the
+	 * next call to this function.
+	 */
+	if (cfs_rq->tg->parent) {
+		int cpu = cpu_of(rq_of(cfs_rq));
+		struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu];
+		if (parent_cfs_rq->on_list) {
+			list_splice_tail_init_rcu(leaf_segment,
+					&parent_cfs_rq->leaf_cfs_rq_list);
+		}
+	} else {
+		list_splice_tail_init_rcu(leaf_segment,
+					&rq_of(cfs_rq)->leaf_cfs_rq_list);
+	}
+
+	cfs_rq->on_list = 1;
 }
 
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
@@ -276,7 +291,8 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
 	return &cpu_rq(this_cpu)->cfs;
 }
 
-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+					struct list_head *leaf_segment)
 {
 }
 
@@ -998,8 +1014,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1)
-		list_add_leaf_cfs_rq(cfs_rq);
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
@@ -1326,12 +1340,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	struct list_head leaf_segment;
+
+	INIT_LIST_HEAD(&leaf_segment);
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		if (cfs_rq->nr_running == 1)
+			list_add_leaf_cfs_rq(cfs_rq, &leaf_segment);
 		flags = ENQUEUE_WAKEUP;
 	}
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] sched: Prevent removal of leaf CFS runqueues with on_list children
  2011-07-18 10:50 [PATCH 0/2] Enforce hierarchical order of leaf CFS runqueues Jan H. Schönherr
  2011-07-18 10:50 ` [PATCH 1/2] sched: Enforce " Jan H. Schönherr
@ 2011-07-18 10:50 ` Jan H. Schönherr
  1 sibling, 0 replies; 14+ messages in thread
From: Jan H. Schönherr @ 2011-07-18 10:50 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

Currently there are (at least) two situations where a parent gets removed
from the list of leaf CFS runqueues although some of its children are
still on the list.

This patch adds a counter for children depending on a parent, preventing
the parent from being removed too early.

Consider three task groups with these parent pointers:
tg1 --> tg2 --> tg3

Situation 1:

1. Enqueue task A in tg1
2. Dequeue task A
3. Enqueue task B in tg2

In step 1 all three runqueues are added as leaves.

In step 2 the primary reason for being on the list vanishes, but all
three runqueues are kept on the list.

In step 3 we call enqueue_entity() for tg2. There, we call
update_cfs_load() before increasing nr_running. Therefore tg2 can be
removed from the leaf list. Note that enqueue_entity() will re-add tg2
at the end, but this will mix up the order as we do not provide tg1
as prev_cfs_rq.

Situation 2:

There is no guarantee that cfs_rq->load_avg of tg1 drops to zero before
that of tg2. Hence, tg2 can be removed before tg1. If something is
enqueued in tg2 after its removal, it would be added at the wrong
position. Enqueuing something in tg1 before it is removed would further
prevent the situation from sorting out itself.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
---
 kernel/sched.c      |    1 +
 kernel/sched_fair.c |   10 +++++++++-
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 9769c75..88c83b8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -344,6 +344,7 @@ struct cfs_rq {
 	 * list is used during load balance.
 	 */
 	int on_list;
+	int children_on_list;
 	struct list_head leaf_cfs_rq_list;
 	struct task_group *tg;	/* group that "owns" this runqueue */

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index d021c75..1c7dd1d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -166,14 +166,22 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
 		list_add_rcu(&cfs_rq->leaf_cfs_rq_list, prev_leaf_cfs_rq);

 		cfs_rq->on_list = 1;
+		if (cfs_rq->tg->parent) {
+			int cpu = cpu_of(rq_of(cfs_rq));
+			cfs_rq->tg->parent->cfs_rq[cpu]->children_on_list++;
+		}
 	}
 }

 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
-	if (cfs_rq->on_list) {
+	if (cfs_rq->on_list && !cfs_rq->children_on_list) {
 		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
 		cfs_rq->on_list = 0;
+		if (cfs_rq->tg->parent) {
+			int cpu = cpu_of(rq_of(cfs_rq));
+			cfs_rq->tg->parent->cfs_rq[cpu]->children_on_list--;
+		}
 	}
 }

-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-07-23 18:41 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-18 10:50 [PATCH 0/2] Enforce hierarchical order of leaf CFS runqueues Jan H. Schönherr
2011-07-18 10:50 ` [PATCH 1/2] sched: Enforce " Jan H. Schönherr
2011-07-18 23:24   ` Paul Turner
2011-07-19 13:08     ` Peter Zijlstra
2011-07-19 17:48       ` Paul Turner
2011-07-19 15:17     ` Jan Schönherr
2011-07-19 17:53       ` Paul Turner
2011-07-21 13:20     ` Jan H. Schönherr
2011-07-21 13:20       ` [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr
2011-07-21 13:20       ` [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation Jan H. Schönherr
2011-07-22 16:21         ` Paul E. McKenney
2011-07-23 18:41           ` Jan Schönherr
2011-07-21 13:20       ` [PATCH RFC 3/3] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
2011-07-18 10:50 ` [PATCH 2/2] sched: Prevent removal of leaf CFS runqueues with on_list children Jan H. Schönherr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox