[PATCH v3 0/2] sched: Enforce order of leaf CFS runqueues

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/2] sched: Enforce order of leaf CFS runqueues
@ 2011-08-16 14:07 Jan H. Schönherr
  2011-08-16 14:07 ` [PATCH v3 1/2] rcu: More rcu-variants for list manipulation Jan H. Schönherr
  2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
  0 siblings, 2 replies; 10+ messages in thread
From: Jan H. Schönherr @ 2011-08-16 14:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Dipankar Sarma, Paul E. McKenney, linux-kernel,
	Jan H. Schönherr

Code review showed, that the hierarchical order of the leaf CFS runqueues
introduced by commit 67e8625 ("sched: Introduce hierarchal order on shares
update list") is not yet achieved.  (See description of second patch for an
example.)

During the discussion of possible solutions [1], Paul Turner mentioned an
"ideal approach" to solve that.

This is the third iteration of the patch-set that tries to realize this ideal
approach. 

Changes since v2 (http://lkml.org/lkml/2011/7/27/348):
- rebased against v3.1-rc2
- dropped patches that are unrelated to the actual bug fix
- fixed a race where freed memory might be accessed
- addressed Peter's comment about *not* providing this kind of
  RCU list manipulation as a separate function

Changes since v1 (http://lkml.org/lkml/2011/7/21/169):
- rebased against v3.0
- included follow-up patches 4 to 8 (demonstrating the purpose of patch 1)
- patch 1 should be complete this time
- moved more functionality to rculist.h (see patch 2+3)
- more comments everywhere


Patch 1: New functions to splice RCU lists.

Patch 2: The actual bugfix.


Regards
Jan

[1] Original discussion: http://lkml.org/lkml/2011/7/18/86


Jan H. Schönherr (2):
  rcu: More rcu-variants for list manipulation
  sched: Handle on_list ancestor in list_add_leaf_cfs_rq()

 include/linux/list.h    |   12 +++++++
 include/linux/rculist.h |   74 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c     |   79 ++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 147 insertions(+), 18 deletions(-)

-- 
1.7.6


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 1/2] rcu: More rcu-variants for list manipulation
  2011-08-16 14:07 [PATCH v3 0/2] sched: Enforce order of leaf CFS runqueues Jan H. Schönherr
@ 2011-08-16 14:07 ` Jan H. Schönherr
  2011-08-23 15:37   ` Paul E. McKenney
  2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
  1 sibling, 1 reply; 10+ messages in thread
From: Jan H. Schönherr @ 2011-08-16 14:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Dipankar Sarma, Paul E. McKenney, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

This commit introduces __list_link() and some more *_rcu variants of some
list manipulation functions:

* __list_link()

  Takes two list elements and link them via their next/prev pointers.

* __list_link_rcu()

  RCU-aware counterpart to __list_link().

* __list_splice_rcu(), list_splice_tail_init_rcu()

  RCU-aware non-blocking splicing.

  The existing rcu-splice function allows concurrent readers on the
  list to be spliced that are not from the destination list. Therefore,
  it has to block to get these readers off the list before the splicing
  can be done. The new functions forbid this and can, thus, work without
  blocking.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

---
Changes since v2:
- merged __list_link() from a dropped patch
- dropped list_add_tail_nobackref() again

Changes since v1:
- added comments
- added list_add_tail_nobackref()
---
 include/linux/list.h    |   12 +++++++
 include/linux/rculist.h |   74 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index cc6d2aa..0a7b057 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -21,6 +21,18 @@
 #define LIST_HEAD(name) \
 	struct list_head name = LIST_HEAD_INIT(name)
 
+/*
+ * Link two list entries by making the prev/next entries
+ * point to each other.
+ *
+ * This is only for internal list manipulation.
+ */
+static inline void __list_link(struct list_head *prev, struct list_head *next)
+{
+	next->prev = prev;
+	prev->next = next;
+}
+
 static inline void INIT_LIST_HEAD(struct list_head *list)
 {
 	list->next = list;
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index d079290..a9be1e0 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -25,6 +25,29 @@
 #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
 
 /*
+ * Link two list entries by making the prev/next entries
+ * point to each other.
+ *
+ * The next pointer of @prev is assigned RCU-aware. So, this
+ * function is primarily intended to publish new nodes starting
+ * at @next within the RCU-protected list @prev.
+ *
+ * This is only for internal list manipulation, when everything
+ * else, i. e. a link from the nodes given by @next back to the
+ * list of @prev, is already set up.
+ *
+ * Concurrent readers are basically allowed, concurrent writers
+ * are forbidden.
+ */
+static inline void __list_link_rcu(struct list_head *prev,
+					struct list_head *next)
+{
+	rcu_assign_pointer(list_next_rcu(prev), next);
+	next->prev = prev;
+}
+
+
+/*
  * Insert a new entry between two known consecutive entries.
  *
  * This is only for internal list manipulation where we know
@@ -158,6 +181,57 @@ static inline void list_replace_rcu(struct list_head *old,
 	old->prev = LIST_POISON2;
 }
 
+/*
+ * Low level function for splicing.
+ *
+ * @prev and @next must point into the same RCU-protected list.
+ *
+ * All elements between (not including) @prev and @next are replaced
+ * by the nodes given by @list.
+ *
+ * It is safe to have concurrent readers of the @prev/@next list
+ * on any of the referenced nodes. There must be no reader not
+ * from @prev/@next within @list.
+ * There must be no concurrent writers on @list or @prev/@next.
+ *
+ * This is only for internal list manipulation. If elements of
+ * @prev/@next are deleted, their prev-pointers are not poisoned.
+ */
+static inline void __list_splice_rcu(struct list_head *list,
+					struct list_head *prev,
+					struct list_head *next)
+{
+	/* Prepare link back to @prev/@next */
+	__list_link(list->prev, next);
+
+	/*
+	 * Publish new nodes, implicitly deleting everything between
+	 * @prev and @next.
+	 */
+	__list_link_rcu(prev, list->next);
+}
+
+/**
+ * list_splice_tail_init_rcu - splice a list into a RCU-protected list
+ * @list: the list to be spliced
+ * @head: the RCU-protected list to splice into
+ *
+ * The contents of @list are inserted before @head. After completion
+ * @list is empty.
+ *
+ * It is safe to have concurrent readers of @head. There must be no
+ * concurrent readers not from @head on @list.
+ * There must be no concurrent writers on @list or @head.
+ */
+static inline void list_splice_tail_init_rcu(struct list_head *list,
+						struct list_head *head)
+{
+	if (!list_empty(list)) {
+		__list_splice_rcu(list, head->prev, head);
+		INIT_LIST_HEAD(list);
+	}
+}
+
 /**
  * list_splice_init_rcu - splice an RCU-protected list into an existing list.
  * @list:	the RCU-protected list to splice
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] rcu: More rcu-variants for list manipulation
  2011-08-16 14:07 ` [PATCH v3 1/2] rcu: More rcu-variants for list manipulation Jan H. Schönherr
@ 2011-08-23 15:37   ` Paul E. McKenney
  0 siblings, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2011-08-23 15:37 UTC (permalink / raw)
  To: Jan H. Schönherr
  Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Dipankar Sarma,
	linux-kernel

On Tue, Aug 16, 2011 at 04:07:45PM +0200, Jan H. Schönherr wrote:
> From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
> 
> This commit introduces __list_link() and some more *_rcu variants of some
> list manipulation functions:
> 
> * __list_link()
> 
>   Takes two list elements and link them via their next/prev pointers.
> 
> * __list_link_rcu()
> 
>   RCU-aware counterpart to __list_link().
> 
> * __list_splice_rcu(), list_splice_tail_init_rcu()
> 
>   RCU-aware non-blocking splicing.
> 
>   The existing rcu-splice function allows concurrent readers on the
>   list to be spliced that are not from the destination list. Therefore,
>   it has to block to get these readers off the list before the splicing
>   can be done. The new functions forbid this and can, thus, work without
>   blocking.

The code looks good to me, just a couple of changes needed in comments.
With those changes:

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
> 
> ---
> Changes since v2:
> - merged __list_link() from a dropped patch
> - dropped list_add_tail_nobackref() again
> 
> Changes since v1:
> - added comments
> - added list_add_tail_nobackref()
> ---
>  include/linux/list.h    |   12 +++++++
>  include/linux/rculist.h |   74 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 86 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index cc6d2aa..0a7b057 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -21,6 +21,18 @@
>  #define LIST_HEAD(name) \
>  	struct list_head name = LIST_HEAD_INIT(name)
> 
> +/*
> + * Link two list entries by making the prev/next entries
> + * point to each other.
> + *

Concurrent readers and writers are forbidden, right?

> + * This is only for internal list manipulation.
> + */
> +static inline void __list_link(struct list_head *prev, struct list_head *next)
> +{
> +	next->prev = prev;
> +	prev->next = next;
> +}
> +
>  static inline void INIT_LIST_HEAD(struct list_head *list)
>  {
>  	list->next = list;
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index d079290..a9be1e0 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -25,6 +25,29 @@
>  #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
> 
>  /*
> + * Link two list entries by making the prev/next entries
> + * point to each other.
> + *
> + * The next pointer of @prev is assigned RCU-aware. So, this
> + * function is primarily intended to publish new nodes starting
> + * at @next within the RCU-protected list @prev.
> + *
> + * This is only for internal list manipulation, when everything
> + * else, i. e. a link from the nodes given by @next back to the
> + * list of @prev, is already set up.
> + *
> + * Concurrent readers are basically allowed, concurrent writers
> + * are forbidden.

"Concurrent readers are basically allowed" means that readers are
allowed in the list pointed to by prev, but not within the list
pointed to by next, right?  (I could imagine some strange situations
where it might be safe to have concurrent readers in the list
pointed to by next, but they are all rather strange, so probably
best to prohibit it.)

> + */
> +static inline void __list_link_rcu(struct list_head *prev,
> +					struct list_head *next)
> +{
> +	rcu_assign_pointer(list_next_rcu(prev), next);
> +	next->prev = prev;
> +}
> +
> +
> +/*
>   * Insert a new entry between two known consecutive entries.
>   *
>   * This is only for internal list manipulation where we know
> @@ -158,6 +181,57 @@ static inline void list_replace_rcu(struct list_head *old,
>  	old->prev = LIST_POISON2;
>  }
> 
> +/*
> + * Low level function for splicing.
> + *
> + * @prev and @next must point into the same RCU-protected list.
> + *
> + * All elements between (not including) @prev and @next are replaced
> + * by the nodes given by @list.
> + *
> + * It is safe to have concurrent readers of the @prev/@next list
> + * on any of the referenced nodes. There must be no reader not
> + * from @prev/@next within @list.

OK, I did figure out what you meant by this last sentence, but it took
me several time reading it.  How about something like the following:
"The caller must make sure that there are no readers traversing @list."

Yes, there might be readers traversing the nodes that used to be in
@list as soon as the __list_link_rcu() below completes, but there had
better not be any at the time that __list_splice_rcu() is called.  ;-)

> + * There must be no concurrent writers on @list or @prev/@next.
> + *
> + * This is only for internal list manipulation. If elements of
> + * @prev/@next are deleted, their prev-pointers are not poisoned.
> + */
> +static inline void __list_splice_rcu(struct list_head *list,
> +					struct list_head *prev,
> +					struct list_head *next)
> +{
> +	/* Prepare link back to @prev/@next */
> +	__list_link(list->prev, next);
> +
> +	/*
> +	 * Publish new nodes, implicitly deleting everything between
> +	 * @prev and @next.
> +	 */
> +	__list_link_rcu(prev, list->next);
> +}
> +
> +/**
> + * list_splice_tail_init_rcu - splice a list into a RCU-protected list
> + * @list: the list to be spliced
> + * @head: the RCU-protected list to splice into
> + *
> + * The contents of @list are inserted before @head. After completion
> + * @list is empty.
> + *
> + * It is safe to have concurrent readers of @head. There must be no
> + * concurrent readers not from @head on @list.

Again, treat this as a precondition: "The caller must ensure that there
are no concurrent readers traversing @list."  This summarizes the reader's
responsibility, and the fact that there might be readers on that list
as soon as __list_splice_rcu() does its work is just muddying the waters.

> + * There must be no concurrent writers on @list or @head.
> + */
> +static inline void list_splice_tail_init_rcu(struct list_head *list,
> +						struct list_head *head)
> +{
> +	if (!list_empty(list)) {
> +		__list_splice_rcu(list, head->prev, head);
> +		INIT_LIST_HEAD(list);
> +	}
> +}
> +
>  /**
>   * list_splice_init_rcu - splice an RCU-protected list into an existing list.
>   * @list:	the RCU-protected list to splice
> -- 
> 1.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-08-16 14:07 [PATCH v3 0/2] sched: Enforce order of leaf CFS runqueues Jan H. Schönherr
  2011-08-16 14:07 ` [PATCH v3 1/2] rcu: More rcu-variants for list manipulation Jan H. Schönherr
@ 2011-08-16 14:07 ` Jan H. Schönherr
  2011-08-23 18:53   ` Peter Zijlstra
  2011-08-23 18:56   ` Peter Zijlstra
  1 sibling, 2 replies; 10+ messages in thread
From: Jan H. Schönherr @ 2011-08-16 14:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Dipankar Sarma, Paul E. McKenney, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

In the case of an on_list ancestor we may incorrectly place the child to
the right of a great-ancestor on the list.

Consider:

      A
     / \     Here, t1A results in A->cfs_rq being on_list, however when
    B  t1A   we start enqueuing from C this will not be visible.  This is
   /         compounded by the fact that on_list expiration may also be out
  C          of order, punching holes in the tree.
 /
t1C

Prevent this by making additions to the leaf_cfs_rq_list position
independent.

This is done by collecting additions to this list within the
enqueue_task_fair() path until we reach the top or find an on_list
ancestor. All additions are then spliced into the leaf_cfs_rq_list at
once.

The additions cannot be collected in a list with the normal means as
this violates the RCU guarantee that the next pointer of an element
always leads back to the list as long as there could be concurrent
readers.  However, we are still allowed to modify the next pointer of
deleted entries as long as we make sure that they eventually return to
the list. That is, if we have to collect more than one entry in a row,
it is legal to set the next-pointer of the first collected entry to the
next one. (We only need to make sure not to hit any physically deleted
list entries.)

Suggested-by: Paul Turner <pjt@google.com>
Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

---
Changes since v2:
- undo code movement from v1
- reset next pointer of collected entries to avoid traversing
  physically deleted nodes

Changes since v1:
- moved some more code to include/linux/rculist.h
- added comment regarding RCU
---
 kernel/sched_fair.c |   79 +++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bc8ee99..1317791 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -135,26 +135,65 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 	return grp->my_q;
 }
 
-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+					struct list_head *leaf_cfs_rqs)
 {
-	if (!cfs_rq->on_list) {
+	if (cfs_rq->on_list)
+		return;
+
+	/*
+	 * Carefully collect leaf-cfs entries:
+	 *
+	 * We might still have concurrent readers on these entries from before
+	 * the previous delete operation. Therefore we cannot collect them in a
+	 * totally independent list. Instead, we reorganize the deleted entries
+	 * within the deleted list, making sure that the next pointers always
+	 * lead back to the list.
+	 */
+	if (list_empty(leaf_cfs_rqs)) {
 		/*
-		 * Ensure we either appear before our parent (if already
-		 * enqueued) or force our parent to appear after us when it is
-		 * enqueued.  The fact that we always enqueue bottom-up
-		 * reduces this to two cases.
+		 * Nothing has been collected so far. Make this cfs_rq the
+		 * first entry. There is no need to take care of its next
+		 * pointer.
 		 */
-		if (cfs_rq->tg->parent &&
-		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		} else {
-			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		}
+		leaf_cfs_rqs->next = &cfs_rq->leaf_cfs_rq_list;
+		leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list;
 
-		cfs_rq->on_list = 1;
+	} else {
+		/*
+		 * We already collected at least one entry. Add this cfs_rq
+		 * after the collected ones. Before that, however, we need to
+		 * set its next pointer to a not deleted list entry so that
+		 * concurrent readers of already collected elements cannot run
+		 * into physically deleted elements.
+		 */
+		cfs_rq->leaf_cfs_rq_list.next =
+					&rq_of(cfs_rq)->leaf_cfs_rq_list;
+		__list_link_rcu(leaf_cfs_rqs->prev, &cfs_rq->leaf_cfs_rq_list);
+		leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list;
 	}
+
+	/*
+	 * If our parent is on_list or if there is no parent, then splice all
+	 * entries collected so far at the correct position into the
+	 * leaf_cfs_rq_list.
+	 *
+	 * If our parent is not on the list, it will be collected during the
+	 * next call to this function.
+	 */
+	if (cfs_rq->tg->parent) {
+		int cpu = cpu_of(rq_of(cfs_rq));
+		struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu];
+		if (parent_cfs_rq->on_list) {
+			list_splice_tail_init_rcu(leaf_cfs_rqs,
+					&parent_cfs_rq->leaf_cfs_rq_list);
+		}
+	} else {
+		list_splice_tail_init_rcu(leaf_cfs_rqs,
+					&rq_of(cfs_rq)->leaf_cfs_rq_list);
+	}
+
+	cfs_rq->on_list = 1;
 }
 
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
@@ -263,7 +302,8 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 	return NULL;
 }
 
-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+					struct list_head *leaf_cfs_rqs)
 {
 }
 
@@ -979,8 +1019,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1)
-		list_add_leaf_cfs_rq(cfs_rq);
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
@@ -1307,12 +1345,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	struct list_head leaf_cfs_rqs;
+
+	INIT_LIST_HEAD(&leaf_cfs_rqs);
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		if (cfs_rq->nr_running == 1)
+			list_add_leaf_cfs_rq(cfs_rq, &leaf_cfs_rqs);
 		flags = ENQUEUE_WAKEUP;
 	}
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
@ 2011-08-23 18:53   ` Peter Zijlstra
  2011-08-24 21:27     ` "Jan H. Schönherr"
  2011-08-23 18:56   ` Peter Zijlstra
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2011-08-23 18:53 UTC (permalink / raw)
  To: Jan H. Schönherr
  Cc: Ingo Molnar, Paul Turner, Dipankar Sarma, Paul E. McKenney,
	linux-kernel

On Tue, 2011-08-16 at 16:07 +0200, Jan H. Schönherr wrote:
> From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
> 
> In the case of an on_list ancestor we may incorrectly place the child to
> the right of a great-ancestor on the list.
> 
> Consider:
> 
>       A
>      / \     Here, t1A results in A->cfs_rq being on_list, however when
>     B  t1A   we start enqueuing from C this will not be visible.  This is
>    /         compounded by the fact that on_list expiration may also be out
>   C          of order, punching holes in the tree.
>  /
> t1C
> 
> Prevent this by making additions to the leaf_cfs_rq_list position
> independent.
> 
> This is done by collecting additions to this list within the
> enqueue_task_fair() path until we reach the top or find an on_list
> ancestor. All additions are then spliced into the leaf_cfs_rq_list at
> once.
> 
> The additions cannot be collected in a list with the normal means as
> this violates the RCU guarantee that the next pointer of an element
> always leads back to the list as long as there could be concurrent
> readers.  However, we are still allowed to modify the next pointer of
> deleted entries as long as we make sure that they eventually return to
> the list. That is, if we have to collect more than one entry in a row,
> it is legal to set the next-pointer of the first collected entry to the
> next one. (We only need to make sure not to hit any physically deleted
> list entries.)
> 
> Suggested-by: Paul Turner <pjt@google.com>
> Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
> ---
>  kernel/sched_fair.c |   79 +++++++++++++++++++++++++++++++++++++++-----------
>  1 files changed, 61 insertions(+), 18 deletions(-)
> 
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index bc8ee99..1317791 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -135,26 +135,65 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
>  	return grp->my_q;
>  }


It would be good to have a comment here about why we're doing things,
much like the changelog in fact, the comments in the function explain
the how of things, but not really the why of things.

> +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
> +					struct list_head *leaf_cfs_rqs)
>  {
> +	if (cfs_rq->on_list)
> +		return;
> +
> +	/*
> +	 * Carefully collect leaf-cfs entries:
> +	 *
> +	 * We might still have concurrent readers on these entries from before
> +	 * the previous delete operation. Therefore we cannot collect them in a
> +	 * totally independent list. Instead, we reorganize the deleted entries
> +	 * within the deleted list, making sure that the next pointers always
> +	 * lead back to the list.
> +	 */
> +	if (list_empty(leaf_cfs_rqs)) {
>  		/*
> +		 * Nothing has been collected so far. Make this cfs_rq the
> +		 * first entry. There is no need to take care of its next
> +		 * pointer.
>  		 */
> +		leaf_cfs_rqs->next = &cfs_rq->leaf_cfs_rq_list;
> +		leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list;
>  
> +	} else {
> +		/*
> +		 * We already collected at least one entry. Add this cfs_rq
> +		 * after the collected ones. Before that, however, we need to
> +		 * set its next pointer to a not deleted list entry so that
> +		 * concurrent readers of already collected elements cannot run
> +		 * into physically deleted elements.
> +		 */
> +		cfs_rq->leaf_cfs_rq_list.next =
> +					&rq_of(cfs_rq)->leaf_cfs_rq_list;

So __list_link_rcu() adds a smp_wmb() right here, I somehow can't seem
to make my mind up on how that's important..

> +		__list_link_rcu(leaf_cfs_rqs->prev, &cfs_rq->leaf_cfs_rq_list);
> +		leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list;
>  	}

OK, so the above part queues the passed cfs_rq on the private list in
bottom up fashion. Existing (lingering) iterators can get trapped on
this list and go round and round for a while.

> +
> +	/*
> +	 * If our parent is on_list or if there is no parent, then splice all
> +	 * entries collected so far at the correct position into the
> +	 * leaf_cfs_rq_list.
> +	 *
> +	 * If our parent is not on the list, it will be collected during the
> +	 * next call to this function.
> +	 */
> +	if (cfs_rq->tg->parent) {
> +		int cpu = cpu_of(rq_of(cfs_rq));
> +		struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu];
> +		if (parent_cfs_rq->on_list) {
> +			list_splice_tail_init_rcu(leaf_cfs_rqs,
> +					&parent_cfs_rq->leaf_cfs_rq_list);
> +		}
> +	} else {
> +		list_splice_tail_init_rcu(leaf_cfs_rqs,
> +					&rq_of(cfs_rq)->leaf_cfs_rq_list);
> +	}
> +
> +	cfs_rq->on_list = 1;

And this part is where we splice the queue into the bigger list, and can
be simplified (see below), at this point the trapped iterators are
released and will complete their traversal 'up' and reach the
rq->leaf_cfs_rq_list list head for termination.

Since all this is done with IRQs disabled the delay imposed on the
trapped iterators is bounded by our own runtime, which again should be
limited because we're in the middle of the scheduler (bounded by the
cgroup hierarchy depth).

>  }
>  
>  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)

> @@ -1307,12 +1345,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
> +	struct list_head leaf_cfs_rqs;
> +
> +	INIT_LIST_HEAD(&leaf_cfs_rqs);

>  	for_each_sched_entity(se) {
>  		if (se->on_rq)
>  			break;
>  		cfs_rq = cfs_rq_of(se);
>  		enqueue_entity(cfs_rq, se, flags);
> +		if (cfs_rq->nr_running == 1)
> +			list_add_leaf_cfs_rq(cfs_rq, &leaf_cfs_rqs);
>  		flags = ENQUEUE_WAKEUP;
>  	}


I think splitting the function into two parts would make the thing
saner, something like:


	LIST_HEAD(leaf_queue);

	for_each_sched_entity(se) {
		if (se->on_rq)
			break;
		cfs_rq = cfs_rq_of(se);
		enqueue_entity(cfs_rq, se, flags);
		flags = ENQUEUE_WAKEUP;
		if (cfs_rq->nr_running == 1)
			leaf_add_queue(cfs_rq, &leaf_queue);
	}
	/* XXX does ->on_rq imply ->on_list ? */
	if (se->on_list)
		leaf_splice_queue(cfs_rq, &leaf_queue);

that splits the add to queue and add queue to list part and avoids the
parent_cfs_rq peeking.

Now I don't really like the above because its hard to make the code go
away in the !FAIR_GROUP case, but maybe we can come up with something
for that.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-08-23 18:53   ` Peter Zijlstra
@ 2011-08-24 21:27     ` "Jan H. Schönherr"
  2011-08-24 21:32       ` Paul Turner
  0 siblings, 1 reply; 10+ messages in thread
From: "Jan H. Schönherr" @ 2011-08-24 21:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Turner, Dipankar Sarma, Paul E. McKenney,
	linux-kernel

Am 23.08.2011 20:53, schrieb Peter Zijlstra:
> On Tue, 2011-08-16 at 16:07 +0200, Jan H. Schönherr wrote:
[...]
> It would be good to have a comment here about why we're doing things,
> much like the changelog in fact, the comments in the function explain
> the how of things, but not really the why of things.

I'll try to do this.

>> +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
>> +					struct list_head *leaf_cfs_rqs)
>>  {
>> +	if (cfs_rq->on_list)
>> +		return;
>> +
>> +	/*
>> +	 * Carefully collect leaf-cfs entries:
>> +	 *
>> +	 * We might still have concurrent readers on these entries from before
>> +	 * the previous delete operation. Therefore we cannot collect them in a
>> +	 * totally independent list. Instead, we reorganize the deleted entries
>> +	 * within the deleted list, making sure that the next pointers always
>> +	 * lead back to the list.
>> +	 */
>> +	if (list_empty(leaf_cfs_rqs)) {
>>  		/*
>> +		 * Nothing has been collected so far. Make this cfs_rq the
>> +		 * first entry. There is no need to take care of its next
>> +		 * pointer.
>>  		 */
>> +		leaf_cfs_rqs->next = &cfs_rq->leaf_cfs_rq_list;
>> +		leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list;
>>  
>> +	} else {
>> +		/*
>> +		 * We already collected at least one entry. Add this cfs_rq
>> +		 * after the collected ones. Before that, however, we need to
>> +		 * set its next pointer to a not deleted list entry so that
>> +		 * concurrent readers of already collected elements cannot run
>> +		 * into physically deleted elements.
>> +		 */
>> +		cfs_rq->leaf_cfs_rq_list.next =
>> +					&rq_of(cfs_rq)->leaf_cfs_rq_list;
> 
> So __list_link_rcu() adds a smp_wmb() right here, I somehow can't seem
> to make my mind up on how that's important..

It makes sure that a reader from the private list cannot see
the old next pointer of the current element that was overwritten
just above. The old next pointer might lead to memory that is
freed.

>> +		__list_link_rcu(leaf_cfs_rqs->prev, &cfs_rq->leaf_cfs_rq_list);
>> +		leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list;
>>  	}
> 
> OK, so the above part queues the passed cfs_rq on the private list in
> bottom up fashion. Existing (lingering) iterators can get trapped on
> this list and go round and round for a while.

They are not trapped: the next-pointer of the last element of the
private list does not point to the head of the private list, but to
the head of the leaf-list.

>> +
>> +	/*
>> +	 * If our parent is on_list or if there is no parent, then splice all
>> +	 * entries collected so far at the correct position into the
>> +	 * leaf_cfs_rq_list.
>> +	 *
>> +	 * If our parent is not on the list, it will be collected during the
>> +	 * next call to this function.
>> +	 */
>> +	if (cfs_rq->tg->parent) {
>> +		int cpu = cpu_of(rq_of(cfs_rq));
>> +		struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu];
>> +		if (parent_cfs_rq->on_list) {
>> +			list_splice_tail_init_rcu(leaf_cfs_rqs,
>> +					&parent_cfs_rq->leaf_cfs_rq_list);
>> +		}
>> +	} else {
>> +		list_splice_tail_init_rcu(leaf_cfs_rqs,
>> +					&rq_of(cfs_rq)->leaf_cfs_rq_list);
>> +	}
>> +
>> +	cfs_rq->on_list = 1;
> 
> And this part is where we splice the queue into the bigger list, and can
> be simplified (see below), at this point the trapped iterators are
> released and will complete their traversal 'up' and reach the
> rq->leaf_cfs_rq_list list head for termination.
> 
> Since all this is done with IRQs disabled the delay imposed on the
> trapped iterators is bounded by our own runtime, which again should be
> limited because we're in the middle of the scheduler (bounded by the
> cgroup hierarchy depth).

Luckily, there is no trapping, so we don't need to consider the effects. :)

>>  }
>>  
>>  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> 
>> @@ -1307,12 +1345,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  {
>>  	struct cfs_rq *cfs_rq;
>>  	struct sched_entity *se = &p->se;
>> +	struct list_head leaf_cfs_rqs;
>> +
>> +	INIT_LIST_HEAD(&leaf_cfs_rqs);
> 
>>  	for_each_sched_entity(se) {
>>  		if (se->on_rq)
>>  			break;
>>  		cfs_rq = cfs_rq_of(se);
>>  		enqueue_entity(cfs_rq, se, flags);
>> +		if (cfs_rq->nr_running == 1)
>> +			list_add_leaf_cfs_rq(cfs_rq, &leaf_cfs_rqs);
>>  		flags = ENQUEUE_WAKEUP;
>>  	}
> 
> 
> I think splitting the function into two parts would make the thing
> saner, something like:
> 
> 
> 	LIST_HEAD(leaf_queue);
> 
> 	for_each_sched_entity(se) {
> 		if (se->on_rq)
> 			break;
> 		cfs_rq = cfs_rq_of(se);
> 		enqueue_entity(cfs_rq, se, flags);
> 		flags = ENQUEUE_WAKEUP;
> 		if (cfs_rq->nr_running == 1)
> 			leaf_add_queue(cfs_rq, &leaf_queue);
> 	}
> 	/* XXX does ->on_rq imply ->on_list ? */
> 	if (se->on_list)
> 		leaf_splice_queue(cfs_rq, &leaf_queue);
> 
> that splits the add to queue and add queue to list part and avoids the
> parent_cfs_rq peeking.

Unfortunately that won't work. The problem here is that some of the
traversed SEs are on_list and others aren't. And the on_list status
of one SE is independent from other SEs. So, if we don't want to remove
on_list elements during the traversal, we need to splice collected
entries as soon as we find a SE that is on_list.

We might get away with collecting all entries always (removing on_list entries
temporarily) and splice them after the loop, but that would
introduce more work than normally necessary. And we should double check
for new concurrency issues...


> Now I don't really like the above because its hard to make the code go
> away in the !FAIR_GROUP case, but maybe we can come up with something
> for that.

Hmmm... you might want to reconsider my original approach to solve this:
http://lkml.org/lkml/2011/7/18/86

That might have been the cleanest one in this respect.

Paul Turner did not like the introduced in-order removal, but the
out-of-order removal is causing most problems.


Regards
Jan (not quite sure which path to follow)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-08-24 21:27     ` "Jan H. Schönherr"
@ 2011-08-24 21:32       ` Paul Turner
  2011-08-24 22:40         ` "Jan H. Schönherr"
  0 siblings, 1 reply; 10+ messages in thread
From: Paul Turner @ 2011-08-24 21:32 UTC (permalink / raw)
  To: Jan H. Schönherr
  Cc: Peter Zijlstra, Ingo Molnar, Dipankar Sarma, Paul E. McKenney,
	linux-kernel

>>
>> I think splitting the function into two parts would make the thing
>> saner, something like:
>>
>>
>>       LIST_HEAD(leaf_queue);
>>
>>       for_each_sched_entity(se) {
>>               if (se->on_rq)
>>                       break;
>>               cfs_rq = cfs_rq_of(se);
>>               enqueue_entity(cfs_rq, se, flags);
>>               flags = ENQUEUE_WAKEUP;
>>               if (cfs_rq->nr_running == 1)
>>                       leaf_add_queue(cfs_rq, &leaf_queue);
>>       }
>>       /* XXX does ->on_rq imply ->on_list ? */
>>       if (se->on_list)
>>               leaf_splice_queue(cfs_rq, &leaf_queue);
>>
>> that splits the add to queue and add queue to list part and avoids the
>> parent_cfs_rq peeking.
>
> Unfortunately that won't work. The problem here is that some of the
> traversed SEs are on_list and others aren't. And the on_list status
> of one SE is independent from other SEs. So, if we don't want to remove
> on_list elements during the traversal, we need to splice collected
> entries as soon as we find a SE that is on_list.
>
> We might get away with collecting all entries always (removing on_list entries
> temporarily) and splice them after the loop, but that would
> introduce more work than normally necessary. And we should double check
> for new concurrency issues...
>

Let me consider this part more carefully.

>
>> Now I don't really like the above because its hard to make the code go
>> away in the !FAIR_GROUP case, but maybe we can come up with something
>> for that.
>
> Hmmm... you might want to reconsider my original approach to solve this:
> http://lkml.org/lkml/2011/7/18/86
>
> That might have been the cleanest one in this respect.
>
> Paul Turner did not like the introduced in-order removal, but the
> out-of-order removal is causing most problems.
>

Sorry for the delayed reply -- I owe you some feedback on the updated
versions but have been buried with other work.

What I didn't like about the original approach was specifically the
positional dependence on enqueue/dequeue.  If we can't do the splicing
properly then I think we want something like:
https://lkml.org/lkml/2011/7/18/348 to avoid shooting ourselves in the
future later.

See: https://lkml.org/lkml/2011/7/19/178 for why this should be cheap.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-08-24 21:32       ` Paul Turner
@ 2011-08-24 22:40         ` "Jan H. Schönherr"
  2011-08-30 13:35           ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: "Jan H. Schönherr" @ 2011-08-24 22:40 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Ingo Molnar, Dipankar Sarma, Paul E. McKenney,
	linux-kernel

Am 24.08.2011 23:32, schrieb Paul Turner:
>>> Now I don't really like the above because its hard to make the code go
>>> away in the !FAIR_GROUP case, but maybe we can come up with something
>>> for that.
>>
>> Hmmm... you might want to reconsider my original approach to solve this:
>> http://lkml.org/lkml/2011/7/18/86
>>
>> That might have been the cleanest one in this respect.
>>
>> Paul Turner did not like the introduced in-order removal, but the
>> out-of-order removal is causing most problems.
>>
> 
> Sorry for the delayed reply -- I owe you some feedback on the updated
> versions but have been buried with other work.

No problem.

> What I didn't like about the original approach was specifically the
> positional dependence on enqueue/dequeue. 

Maybe I misunderstood you, then.

If we can guarantee in-order removal of leaf_cfs_rqs, then there is
no positional dependency. Any SE can be enqueued and dequeued anytime.

OTOH, the RCU splice variant has a positional dependence: calling
enqueue_entity() outside of enqueue_task_fair() can go wrong easily as it
depends on being called bottom-up and requires its caller to maintain state.

This is also partly true for the leaf_insertion_point variant: if a caller
maintains state, then the pair enqueue_entity/enqueue_leaf_cfs_rq() also
depends on being called bottom up.

> If we can't do the splicing
> properly then I think we want something like:
> https://lkml.org/lkml/2011/7/18/348 to avoid shooting ourselves in the
> future later.
> 
> See: https://lkml.org/lkml/2011/7/19/178 for why this should be cheap.

As far as I can tell, all three variants proposed so far work.

It is probably a matter of taste in the end. I'll happily help with
whatever version tastes best. :)

Regards
Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-08-24 22:40         ` "Jan H. Schönherr"
@ 2011-08-30 13:35           ` Peter Zijlstra
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2011-08-30 13:35 UTC (permalink / raw)
  To: "Jan H. Schönherr"
  Cc: Paul Turner, Ingo Molnar, Dipankar Sarma, Paul E. McKenney,
	linux-kernel

On Thu, 2011-08-25 at 00:40 +0200, "Jan H. Schönherr" wrote:
> Am 24.08.2011 23:32, schrieb Paul Turner:
> >>> Now I don't really like the above because its hard to make the code go
> >>> away in the !FAIR_GROUP case, but maybe we can come up with something
> >>> for that.
> >>
> >> Hmmm... you might want to reconsider my original approach to solve this:
> >> http://lkml.org/lkml/2011/7/18/86
> >>
> >> That might have been the cleanest one in this respect.
> >>
> >> Paul Turner did not like the introduced in-order removal, but the
> >> out-of-order removal is causing most problems.
> >>
> > 
> > Sorry for the delayed reply -- I owe you some feedback on the updated
> > versions but have been buried with other work.
> 
> No problem.
> 
> > What I didn't like about the original approach was specifically the
> > positional dependence on enqueue/dequeue. 
> 
> Maybe I misunderstood you, then.
> 
> If we can guarantee in-order removal of leaf_cfs_rqs, then there is
> no positional dependency. Any SE can be enqueued and dequeued anytime.
> 
> OTOH, the RCU splice variant has a positional dependence: calling
> enqueue_entity() outside of enqueue_task_fair() can go wrong easily as it
> depends on being called bottom-up and requires its caller to maintain state.
> 
> This is also partly true for the leaf_insertion_point variant: if a caller
> maintains state, then the pair enqueue_entity/enqueue_leaf_cfs_rq() also
> depends on being called bottom up.
> 
> 
> > If we can't do the splicing
> > properly then I think we want something like:
> > https://lkml.org/lkml/2011/7/18/348 to avoid shooting ourselves in the
> > future later.
> > 
> > See: https://lkml.org/lkml/2011/7/19/178 for why this should be cheap.
> 
> As far as I can tell, all three variants proposed so far work.
> 
> It is probably a matter of taste in the end. I'll happily help with
> whatever version tastes best. :)


pjt, you said you were rewriting the cgroup stuff yet-again, any
preference for which solution would least get in your way?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
  2011-08-23 18:53   ` Peter Zijlstra
@ 2011-08-23 18:56   ` Peter Zijlstra
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2011-08-23 18:56 UTC (permalink / raw)
  To: Jan H. Schönherr
  Cc: Ingo Molnar, Paul Turner, Dipankar Sarma, Paul E. McKenney,
	linux-kernel

On Tue, 2011-08-23 at 20:53 +0200, Peter Zijlstra wrote:
>         LIST_HEAD(leaf_queue);
> 
>         for_each_sched_entity(se) {
>                 if (se->on_rq)
>                         break;
>                 cfs_rq = cfs_rq_of(se);
>                 enqueue_entity(cfs_rq, se, flags);
>                 flags = ENQUEUE_WAKEUP;
>                 if (cfs_rq->nr_running == 1)
>                         leaf_add_queue(cfs_rq, &leaf_queue);
>         }
>         /* XXX does ->on_rq imply ->on_list ? */
>         if (se->on_list)
>                 leaf_splice_queue(cfs_rq, &leaf_queue); 

Bah, se can be NULL here, still needing some extra foo.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-08-30 13:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-16 14:07 [PATCH v3 0/2] sched: Enforce order of leaf CFS runqueues Jan H. Schönherr
2011-08-16 14:07 ` [PATCH v3 1/2] rcu: More rcu-variants for list manipulation Jan H. Schönherr
2011-08-23 15:37   ` Paul E. McKenney
2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
2011-08-23 18:53   ` Peter Zijlstra
2011-08-24 21:27     ` "Jan H. Schönherr"
2011-08-24 21:32       ` Paul Turner
2011-08-24 22:40         ` "Jan H. Schönherr"
2011-08-30 13:35           ` Peter Zijlstra
2011-08-23 18:56   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox