* [PATCH v3 0/2] sched: Enforce order of leaf CFS runqueues
@ 2011-08-16 14:07 Jan H. Schönherr
2011-08-16 14:07 ` [PATCH v3 1/2] rcu: More rcu-variants for list manipulation Jan H. Schönherr
2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
0 siblings, 2 replies; 10+ messages in thread
From: Jan H. Schönherr @ 2011-08-16 14:07 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Paul Turner, Dipankar Sarma, Paul E. McKenney, linux-kernel,
Jan H. Schönherr
Code review showed, that the hierarchical order of the leaf CFS runqueues
introduced by commit 67e8625 ("sched: Introduce hierarchal order on shares
update list") is not yet achieved. (See description of second patch for an
example.)
During the discussion of possible solutions [1], Paul Turner mentioned an
"ideal approach" to solve that.
This is the third iteration of the patch-set that tries to realize this ideal
approach.
Changes since v2 (http://lkml.org/lkml/2011/7/27/348):
- rebased against v3.1-rc2
- dropped patches that are unrelated to the actual bug fix
- fixed a race where freed memory might be accessed
- addressed Peter's comment about *not* providing this kind of
RCU list manipulation as a separate function
Changes since v1 (http://lkml.org/lkml/2011/7/21/169):
- rebased against v3.0
- included follow-up patches 4 to 8 (demonstrating the purpose of patch 1)
- patch 1 should be complete this time
- moved more functionality to rculist.h (see patch 2+3)
- more comments everywhere
Patch 1: New functions to splice RCU lists.
Patch 2: The actual bugfix.
Regards
Jan
[1] Original discussion: http://lkml.org/lkml/2011/7/18/86
Jan H. Schönherr (2):
rcu: More rcu-variants for list manipulation
sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
include/linux/list.h | 12 +++++++
include/linux/rculist.h | 74 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_fair.c | 79 ++++++++++++++++++++++++++++++++++++-----------
3 files changed, 147 insertions(+), 18 deletions(-)
--
1.7.6
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH v3 1/2] rcu: More rcu-variants for list manipulation 2011-08-16 14:07 [PATCH v3 0/2] sched: Enforce order of leaf CFS runqueues Jan H. Schönherr @ 2011-08-16 14:07 ` Jan H. Schönherr 2011-08-23 15:37 ` Paul E. McKenney 2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr 1 sibling, 1 reply; 10+ messages in thread From: Jan H. Schönherr @ 2011-08-16 14:07 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Paul Turner, Dipankar Sarma, Paul E. McKenney, linux-kernel, Jan H. Schönherr From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> This commit introduces __list_link() and some more *_rcu variants of some list manipulation functions: * __list_link() Takes two list elements and link them via their next/prev pointers. * __list_link_rcu() RCU-aware counterpart to __list_link(). * __list_splice_rcu(), list_splice_tail_init_rcu() RCU-aware non-blocking splicing. The existing rcu-splice function allows concurrent readers on the list to be spliced that are not from the destination list. Therefore, it has to block to get these readers off the list before the splicing can be done. The new functions forbid this and can, thus, work without blocking. Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> --- Changes since v2: - merged __list_link() from a dropped patch - dropped list_add_tail_nobackref() again Changes since v1: - added comments - added list_add_tail_nobackref() --- include/linux/list.h | 12 +++++++ include/linux/rculist.h | 74 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 86 insertions(+), 0 deletions(-) diff --git a/include/linux/list.h b/include/linux/list.h index cc6d2aa..0a7b057 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -21,6 +21,18 @@ #define LIST_HEAD(name) \ struct list_head name = LIST_HEAD_INIT(name) +/* + * Link two list entries by making the prev/next entries + * point to each other. + * + * This is only for internal list manipulation. + */ +static inline void __list_link(struct list_head *prev, struct list_head *next) +{ + next->prev = prev; + prev->next = next; +} + static inline void INIT_LIST_HEAD(struct list_head *list) { list->next = list; diff --git a/include/linux/rculist.h b/include/linux/rculist.h index d079290..a9be1e0 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -25,6 +25,29 @@ #define list_next_rcu(list) (*((struct list_head __rcu **)(&(list)->next))) /* + * Link two list entries by making the prev/next entries + * point to each other. + * + * The next pointer of @prev is assigned RCU-aware. So, this + * function is primarily intended to publish new nodes starting + * at @next within the RCU-protected list @prev. + * + * This is only for internal list manipulation, when everything + * else, i. e. a link from the nodes given by @next back to the + * list of @prev, is already set up. + * + * Concurrent readers are basically allowed, concurrent writers + * are forbidden. + */ +static inline void __list_link_rcu(struct list_head *prev, + struct list_head *next) +{ + rcu_assign_pointer(list_next_rcu(prev), next); + next->prev = prev; +} + + +/* * Insert a new entry between two known consecutive entries. * * This is only for internal list manipulation where we know @@ -158,6 +181,57 @@ static inline void list_replace_rcu(struct list_head *old, old->prev = LIST_POISON2; } +/* + * Low level function for splicing. + * + * @prev and @next must point into the same RCU-protected list. + * + * All elements between (not including) @prev and @next are replaced + * by the nodes given by @list. + * + * It is safe to have concurrent readers of the @prev/@next list + * on any of the referenced nodes. There must be no reader not + * from @prev/@next within @list. + * There must be no concurrent writers on @list or @prev/@next. + * + * This is only for internal list manipulation. If elements of + * @prev/@next are deleted, their prev-pointers are not poisoned. + */ +static inline void __list_splice_rcu(struct list_head *list, + struct list_head *prev, + struct list_head *next) +{ + /* Prepare link back to @prev/@next */ + __list_link(list->prev, next); + + /* + * Publish new nodes, implicitly deleting everything between + * @prev and @next. + */ + __list_link_rcu(prev, list->next); +} + +/** + * list_splice_tail_init_rcu - splice a list into a RCU-protected list + * @list: the list to be spliced + * @head: the RCU-protected list to splice into + * + * The contents of @list are inserted before @head. After completion + * @list is empty. + * + * It is safe to have concurrent readers of @head. There must be no + * concurrent readers not from @head on @list. + * There must be no concurrent writers on @list or @head. + */ +static inline void list_splice_tail_init_rcu(struct list_head *list, + struct list_head *head) +{ + if (!list_empty(list)) { + __list_splice_rcu(list, head->prev, head); + INIT_LIST_HEAD(list); + } +} + /** * list_splice_init_rcu - splice an RCU-protected list into an existing list. * @list: the RCU-protected list to splice -- 1.7.6 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 1/2] rcu: More rcu-variants for list manipulation 2011-08-16 14:07 ` [PATCH v3 1/2] rcu: More rcu-variants for list manipulation Jan H. Schönherr @ 2011-08-23 15:37 ` Paul E. McKenney 0 siblings, 0 replies; 10+ messages in thread From: Paul E. McKenney @ 2011-08-23 15:37 UTC (permalink / raw) To: Jan H. Schönherr Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Dipankar Sarma, linux-kernel On Tue, Aug 16, 2011 at 04:07:45PM +0200, Jan H. Schönherr wrote: > From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> > > This commit introduces __list_link() and some more *_rcu variants of some > list manipulation functions: > > * __list_link() > > Takes two list elements and link them via their next/prev pointers. > > * __list_link_rcu() > > RCU-aware counterpart to __list_link(). > > * __list_splice_rcu(), list_splice_tail_init_rcu() > > RCU-aware non-blocking splicing. > > The existing rcu-splice function allows concurrent readers on the > list to be spliced that are not from the destination list. Therefore, > it has to block to get these readers off the list before the splicing > can be done. The new functions forbid this and can, thus, work without > blocking. The code looks good to me, just a couple of changes needed in comments. With those changes: Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> > > --- > Changes since v2: > - merged __list_link() from a dropped patch > - dropped list_add_tail_nobackref() again > > Changes since v1: > - added comments > - added list_add_tail_nobackref() > --- > include/linux/list.h | 12 +++++++ > include/linux/rculist.h | 74 +++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 86 insertions(+), 0 deletions(-) > > diff --git a/include/linux/list.h b/include/linux/list.h > index cc6d2aa..0a7b057 100644 > --- a/include/linux/list.h > +++ b/include/linux/list.h > @@ -21,6 +21,18 @@ > #define LIST_HEAD(name) \ > struct list_head name = LIST_HEAD_INIT(name) > > +/* > + * Link two list entries by making the prev/next entries > + * point to each other. > + * Concurrent readers and writers are forbidden, right? > + * This is only for internal list manipulation. > + */ > +static inline void __list_link(struct list_head *prev, struct list_head *next) > +{ > + next->prev = prev; > + prev->next = next; > +} > + > static inline void INIT_LIST_HEAD(struct list_head *list) > { > list->next = list; > diff --git a/include/linux/rculist.h b/include/linux/rculist.h > index d079290..a9be1e0 100644 > --- a/include/linux/rculist.h > +++ b/include/linux/rculist.h > @@ -25,6 +25,29 @@ > #define list_next_rcu(list) (*((struct list_head __rcu **)(&(list)->next))) > > /* > + * Link two list entries by making the prev/next entries > + * point to each other. > + * > + * The next pointer of @prev is assigned RCU-aware. So, this > + * function is primarily intended to publish new nodes starting > + * at @next within the RCU-protected list @prev. > + * > + * This is only for internal list manipulation, when everything > + * else, i. e. a link from the nodes given by @next back to the > + * list of @prev, is already set up. > + * > + * Concurrent readers are basically allowed, concurrent writers > + * are forbidden. "Concurrent readers are basically allowed" means that readers are allowed in the list pointed to by prev, but not within the list pointed to by next, right? (I could imagine some strange situations where it might be safe to have concurrent readers in the list pointed to by next, but they are all rather strange, so probably best to prohibit it.) > + */ > +static inline void __list_link_rcu(struct list_head *prev, > + struct list_head *next) > +{ > + rcu_assign_pointer(list_next_rcu(prev), next); > + next->prev = prev; > +} > + > + > +/* > * Insert a new entry between two known consecutive entries. > * > * This is only for internal list manipulation where we know > @@ -158,6 +181,57 @@ static inline void list_replace_rcu(struct list_head *old, > old->prev = LIST_POISON2; > } > > +/* > + * Low level function for splicing. > + * > + * @prev and @next must point into the same RCU-protected list. > + * > + * All elements between (not including) @prev and @next are replaced > + * by the nodes given by @list. > + * > + * It is safe to have concurrent readers of the @prev/@next list > + * on any of the referenced nodes. There must be no reader not > + * from @prev/@next within @list. OK, I did figure out what you meant by this last sentence, but it took me several time reading it. How about something like the following: "The caller must make sure that there are no readers traversing @list." Yes, there might be readers traversing the nodes that used to be in @list as soon as the __list_link_rcu() below completes, but there had better not be any at the time that __list_splice_rcu() is called. ;-) > + * There must be no concurrent writers on @list or @prev/@next. > + * > + * This is only for internal list manipulation. If elements of > + * @prev/@next are deleted, their prev-pointers are not poisoned. > + */ > +static inline void __list_splice_rcu(struct list_head *list, > + struct list_head *prev, > + struct list_head *next) > +{ > + /* Prepare link back to @prev/@next */ > + __list_link(list->prev, next); > + > + /* > + * Publish new nodes, implicitly deleting everything between > + * @prev and @next. > + */ > + __list_link_rcu(prev, list->next); > +} > + > +/** > + * list_splice_tail_init_rcu - splice a list into a RCU-protected list > + * @list: the list to be spliced > + * @head: the RCU-protected list to splice into > + * > + * The contents of @list are inserted before @head. After completion > + * @list is empty. > + * > + * It is safe to have concurrent readers of @head. There must be no > + * concurrent readers not from @head on @list. Again, treat this as a precondition: "The caller must ensure that there are no concurrent readers traversing @list." This summarizes the reader's responsibility, and the fact that there might be readers on that list as soon as __list_splice_rcu() does its work is just muddying the waters. > + * There must be no concurrent writers on @list or @head. > + */ > +static inline void list_splice_tail_init_rcu(struct list_head *list, > + struct list_head *head) > +{ > + if (!list_empty(list)) { > + __list_splice_rcu(list, head->prev, head); > + INIT_LIST_HEAD(list); > + } > +} > + > /** > * list_splice_init_rcu - splice an RCU-protected list into an existing list. > * @list: the RCU-protected list to splice > -- > 1.7.6 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() 2011-08-16 14:07 [PATCH v3 0/2] sched: Enforce order of leaf CFS runqueues Jan H. Schönherr 2011-08-16 14:07 ` [PATCH v3 1/2] rcu: More rcu-variants for list manipulation Jan H. Schönherr @ 2011-08-16 14:07 ` Jan H. Schönherr 2011-08-23 18:53 ` Peter Zijlstra 2011-08-23 18:56 ` Peter Zijlstra 1 sibling, 2 replies; 10+ messages in thread From: Jan H. Schönherr @ 2011-08-16 14:07 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Paul Turner, Dipankar Sarma, Paul E. McKenney, linux-kernel, Jan H. Schönherr From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> In the case of an on_list ancestor we may incorrectly place the child to the right of a great-ancestor on the list. Consider: A / \ Here, t1A results in A->cfs_rq being on_list, however when B t1A we start enqueuing from C this will not be visible. This is / compounded by the fact that on_list expiration may also be out C of order, punching holes in the tree. / t1C Prevent this by making additions to the leaf_cfs_rq_list position independent. This is done by collecting additions to this list within the enqueue_task_fair() path until we reach the top or find an on_list ancestor. All additions are then spliced into the leaf_cfs_rq_list at once. The additions cannot be collected in a list with the normal means as this violates the RCU guarantee that the next pointer of an element always leads back to the list as long as there could be concurrent readers. However, we are still allowed to modify the next pointer of deleted entries as long as we make sure that they eventually return to the list. That is, if we have to collect more than one entry in a row, it is legal to set the next-pointer of the first collected entry to the next one. (We only need to make sure not to hit any physically deleted list entries.) Suggested-by: Paul Turner <pjt@google.com> Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> --- Changes since v2: - undo code movement from v1 - reset next pointer of collected entries to avoid traversing physically deleted nodes Changes since v1: - moved some more code to include/linux/rculist.h - added comment regarding RCU --- kernel/sched_fair.c | 79 +++++++++++++++++++++++++++++++++++++++----------- 1 files changed, 61 insertions(+), 18 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index bc8ee99..1317791 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -135,26 +135,65 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) return grp->my_q; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq, + struct list_head *leaf_cfs_rqs) { - if (!cfs_rq->on_list) { + if (cfs_rq->on_list) + return; + + /* + * Carefully collect leaf-cfs entries: + * + * We might still have concurrent readers on these entries from before + * the previous delete operation. Therefore we cannot collect them in a + * totally independent list. Instead, we reorganize the deleted entries + * within the deleted list, making sure that the next pointers always + * lead back to the list. + */ + if (list_empty(leaf_cfs_rqs)) { /* - * Ensure we either appear before our parent (if already - * enqueued) or force our parent to appear after us when it is - * enqueued. The fact that we always enqueue bottom-up - * reduces this to two cases. + * Nothing has been collected so far. Make this cfs_rq the + * first entry. There is no need to take care of its next + * pointer. */ - if (cfs_rq->tg->parent && - cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) { - list_add_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); - } else { - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); - } + leaf_cfs_rqs->next = &cfs_rq->leaf_cfs_rq_list; + leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list; - cfs_rq->on_list = 1; + } else { + /* + * We already collected at least one entry. Add this cfs_rq + * after the collected ones. Before that, however, we need to + * set its next pointer to a not deleted list entry so that + * concurrent readers of already collected elements cannot run + * into physically deleted elements. + */ + cfs_rq->leaf_cfs_rq_list.next = + &rq_of(cfs_rq)->leaf_cfs_rq_list; + __list_link_rcu(leaf_cfs_rqs->prev, &cfs_rq->leaf_cfs_rq_list); + leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list; } + + /* + * If our parent is on_list or if there is no parent, then splice all + * entries collected so far at the correct position into the + * leaf_cfs_rq_list. + * + * If our parent is not on the list, it will be collected during the + * next call to this function. + */ + if (cfs_rq->tg->parent) { + int cpu = cpu_of(rq_of(cfs_rq)); + struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu]; + if (parent_cfs_rq->on_list) { + list_splice_tail_init_rcu(leaf_cfs_rqs, + &parent_cfs_rq->leaf_cfs_rq_list); + } + } else { + list_splice_tail_init_rcu(leaf_cfs_rqs, + &rq_of(cfs_rq)->leaf_cfs_rq_list); + } + + cfs_rq->on_list = 1; } static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) @@ -263,7 +302,8 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) return NULL; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq, + struct list_head *leaf_cfs_rqs) { } @@ -979,8 +1019,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) __enqueue_entity(cfs_rq, se); se->on_rq = 1; - if (cfs_rq->nr_running == 1) - list_add_leaf_cfs_rq(cfs_rq); } static void __clear_buddies_last(struct sched_entity *se) @@ -1307,12 +1345,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + struct list_head leaf_cfs_rqs; + + INIT_LIST_HEAD(&leaf_cfs_rqs); for_each_sched_entity(se) { if (se->on_rq) break; cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, flags); + if (cfs_rq->nr_running == 1) + list_add_leaf_cfs_rq(cfs_rq, &leaf_cfs_rqs); flags = ENQUEUE_WAKEUP; } -- 1.7.6 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() 2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr @ 2011-08-23 18:53 ` Peter Zijlstra 2011-08-24 21:27 ` "Jan H. Schönherr" 2011-08-23 18:56 ` Peter Zijlstra 1 sibling, 1 reply; 10+ messages in thread From: Peter Zijlstra @ 2011-08-23 18:53 UTC (permalink / raw) To: Jan H. Schönherr Cc: Ingo Molnar, Paul Turner, Dipankar Sarma, Paul E. McKenney, linux-kernel On Tue, 2011-08-16 at 16:07 +0200, Jan H. Schönherr wrote: > From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> > > In the case of an on_list ancestor we may incorrectly place the child to > the right of a great-ancestor on the list. > > Consider: > > A > / \ Here, t1A results in A->cfs_rq being on_list, however when > B t1A we start enqueuing from C this will not be visible. This is > / compounded by the fact that on_list expiration may also be out > C of order, punching holes in the tree. > / > t1C > > Prevent this by making additions to the leaf_cfs_rq_list position > independent. > > This is done by collecting additions to this list within the > enqueue_task_fair() path until we reach the top or find an on_list > ancestor. All additions are then spliced into the leaf_cfs_rq_list at > once. > > The additions cannot be collected in a list with the normal means as > this violates the RCU guarantee that the next pointer of an element > always leads back to the list as long as there could be concurrent > readers. However, we are still allowed to modify the next pointer of > deleted entries as long as we make sure that they eventually return to > the list. That is, if we have to collect more than one entry in a row, > it is legal to set the next-pointer of the first collected entry to the > next one. (We only need to make sure not to hit any physically deleted > list entries.) > > Suggested-by: Paul Turner <pjt@google.com> > Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> > --- > kernel/sched_fair.c | 79 +++++++++++++++++++++++++++++++++++++++----------- > 1 files changed, 61 insertions(+), 18 deletions(-) > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index bc8ee99..1317791 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -135,26 +135,65 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) > return grp->my_q; > } It would be good to have a comment here about why we're doing things, much like the changelog in fact, the comments in the function explain the how of things, but not really the why of things. > +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq, > + struct list_head *leaf_cfs_rqs) > { > + if (cfs_rq->on_list) > + return; > + > + /* > + * Carefully collect leaf-cfs entries: > + * > + * We might still have concurrent readers on these entries from before > + * the previous delete operation. Therefore we cannot collect them in a > + * totally independent list. Instead, we reorganize the deleted entries > + * within the deleted list, making sure that the next pointers always > + * lead back to the list. > + */ > + if (list_empty(leaf_cfs_rqs)) { > /* > + * Nothing has been collected so far. Make this cfs_rq the > + * first entry. There is no need to take care of its next > + * pointer. > */ > + leaf_cfs_rqs->next = &cfs_rq->leaf_cfs_rq_list; > + leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list; > > + } else { > + /* > + * We already collected at least one entry. Add this cfs_rq > + * after the collected ones. Before that, however, we need to > + * set its next pointer to a not deleted list entry so that > + * concurrent readers of already collected elements cannot run > + * into physically deleted elements. > + */ > + cfs_rq->leaf_cfs_rq_list.next = > + &rq_of(cfs_rq)->leaf_cfs_rq_list; So __list_link_rcu() adds a smp_wmb() right here, I somehow can't seem to make my mind up on how that's important.. > + __list_link_rcu(leaf_cfs_rqs->prev, &cfs_rq->leaf_cfs_rq_list); > + leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list; > } OK, so the above part queues the passed cfs_rq on the private list in bottom up fashion. Existing (lingering) iterators can get trapped on this list and go round and round for a while. > + > + /* > + * If our parent is on_list or if there is no parent, then splice all > + * entries collected so far at the correct position into the > + * leaf_cfs_rq_list. > + * > + * If our parent is not on the list, it will be collected during the > + * next call to this function. > + */ > + if (cfs_rq->tg->parent) { > + int cpu = cpu_of(rq_of(cfs_rq)); > + struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu]; > + if (parent_cfs_rq->on_list) { > + list_splice_tail_init_rcu(leaf_cfs_rqs, > + &parent_cfs_rq->leaf_cfs_rq_list); > + } > + } else { > + list_splice_tail_init_rcu(leaf_cfs_rqs, > + &rq_of(cfs_rq)->leaf_cfs_rq_list); > + } > + > + cfs_rq->on_list = 1; And this part is where we splice the queue into the bigger list, and can be simplified (see below), at this point the trapped iterators are released and will complete their traversal 'up' and reach the rq->leaf_cfs_rq_list list head for termination. Since all this is done with IRQs disabled the delay imposed on the trapped iterators is bounded by our own runtime, which again should be limited because we're in the middle of the scheduler (bounded by the cgroup hierarchy depth). > } > > static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) > @@ -1307,12 +1345,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > { > struct cfs_rq *cfs_rq; > struct sched_entity *se = &p->se; > + struct list_head leaf_cfs_rqs; > + > + INIT_LIST_HEAD(&leaf_cfs_rqs); > for_each_sched_entity(se) { > if (se->on_rq) > break; > cfs_rq = cfs_rq_of(se); > enqueue_entity(cfs_rq, se, flags); > + if (cfs_rq->nr_running == 1) > + list_add_leaf_cfs_rq(cfs_rq, &leaf_cfs_rqs); > flags = ENQUEUE_WAKEUP; > } I think splitting the function into two parts would make the thing saner, something like: LIST_HEAD(leaf_queue); for_each_sched_entity(se) { if (se->on_rq) break; cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, flags); flags = ENQUEUE_WAKEUP; if (cfs_rq->nr_running == 1) leaf_add_queue(cfs_rq, &leaf_queue); } /* XXX does ->on_rq imply ->on_list ? */ if (se->on_list) leaf_splice_queue(cfs_rq, &leaf_queue); that splits the add to queue and add queue to list part and avoids the parent_cfs_rq peeking. Now I don't really like the above because its hard to make the code go away in the !FAIR_GROUP case, but maybe we can come up with something for that. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() 2011-08-23 18:53 ` Peter Zijlstra @ 2011-08-24 21:27 ` "Jan H. Schönherr" 2011-08-24 21:32 ` Paul Turner 0 siblings, 1 reply; 10+ messages in thread From: "Jan H. Schönherr" @ 2011-08-24 21:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Paul Turner, Dipankar Sarma, Paul E. McKenney, linux-kernel Am 23.08.2011 20:53, schrieb Peter Zijlstra: > On Tue, 2011-08-16 at 16:07 +0200, Jan H. Schönherr wrote: [...] > It would be good to have a comment here about why we're doing things, > much like the changelog in fact, the comments in the function explain > the how of things, but not really the why of things. I'll try to do this. >> +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq, >> + struct list_head *leaf_cfs_rqs) >> { >> + if (cfs_rq->on_list) >> + return; >> + >> + /* >> + * Carefully collect leaf-cfs entries: >> + * >> + * We might still have concurrent readers on these entries from before >> + * the previous delete operation. Therefore we cannot collect them in a >> + * totally independent list. Instead, we reorganize the deleted entries >> + * within the deleted list, making sure that the next pointers always >> + * lead back to the list. >> + */ >> + if (list_empty(leaf_cfs_rqs)) { >> /* >> + * Nothing has been collected so far. Make this cfs_rq the >> + * first entry. There is no need to take care of its next >> + * pointer. >> */ >> + leaf_cfs_rqs->next = &cfs_rq->leaf_cfs_rq_list; >> + leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list; >> >> + } else { >> + /* >> + * We already collected at least one entry. Add this cfs_rq >> + * after the collected ones. Before that, however, we need to >> + * set its next pointer to a not deleted list entry so that >> + * concurrent readers of already collected elements cannot run >> + * into physically deleted elements. >> + */ >> + cfs_rq->leaf_cfs_rq_list.next = >> + &rq_of(cfs_rq)->leaf_cfs_rq_list; > > So __list_link_rcu() adds a smp_wmb() right here, I somehow can't seem > to make my mind up on how that's important.. It makes sure that a reader from the private list cannot see the old next pointer of the current element that was overwritten just above. The old next pointer might lead to memory that is freed. >> + __list_link_rcu(leaf_cfs_rqs->prev, &cfs_rq->leaf_cfs_rq_list); >> + leaf_cfs_rqs->prev = &cfs_rq->leaf_cfs_rq_list; >> } > > OK, so the above part queues the passed cfs_rq on the private list in > bottom up fashion. Existing (lingering) iterators can get trapped on > this list and go round and round for a while. They are not trapped: the next-pointer of the last element of the private list does not point to the head of the private list, but to the head of the leaf-list. >> + >> + /* >> + * If our parent is on_list or if there is no parent, then splice all >> + * entries collected so far at the correct position into the >> + * leaf_cfs_rq_list. >> + * >> + * If our parent is not on the list, it will be collected during the >> + * next call to this function. >> + */ >> + if (cfs_rq->tg->parent) { >> + int cpu = cpu_of(rq_of(cfs_rq)); >> + struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu]; >> + if (parent_cfs_rq->on_list) { >> + list_splice_tail_init_rcu(leaf_cfs_rqs, >> + &parent_cfs_rq->leaf_cfs_rq_list); >> + } >> + } else { >> + list_splice_tail_init_rcu(leaf_cfs_rqs, >> + &rq_of(cfs_rq)->leaf_cfs_rq_list); >> + } >> + >> + cfs_rq->on_list = 1; > > And this part is where we splice the queue into the bigger list, and can > be simplified (see below), at this point the trapped iterators are > released and will complete their traversal 'up' and reach the > rq->leaf_cfs_rq_list list head for termination. > > Since all this is done with IRQs disabled the delay imposed on the > trapped iterators is bounded by our own runtime, which again should be > limited because we're in the middle of the scheduler (bounded by the > cgroup hierarchy depth). Luckily, there is no trapping, so we don't need to consider the effects. :) >> } >> >> static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) > >> @@ -1307,12 +1345,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) >> { >> struct cfs_rq *cfs_rq; >> struct sched_entity *se = &p->se; >> + struct list_head leaf_cfs_rqs; >> + >> + INIT_LIST_HEAD(&leaf_cfs_rqs); > >> for_each_sched_entity(se) { >> if (se->on_rq) >> break; >> cfs_rq = cfs_rq_of(se); >> enqueue_entity(cfs_rq, se, flags); >> + if (cfs_rq->nr_running == 1) >> + list_add_leaf_cfs_rq(cfs_rq, &leaf_cfs_rqs); >> flags = ENQUEUE_WAKEUP; >> } > > > I think splitting the function into two parts would make the thing > saner, something like: > > > LIST_HEAD(leaf_queue); > > for_each_sched_entity(se) { > if (se->on_rq) > break; > cfs_rq = cfs_rq_of(se); > enqueue_entity(cfs_rq, se, flags); > flags = ENQUEUE_WAKEUP; > if (cfs_rq->nr_running == 1) > leaf_add_queue(cfs_rq, &leaf_queue); > } > /* XXX does ->on_rq imply ->on_list ? */ > if (se->on_list) > leaf_splice_queue(cfs_rq, &leaf_queue); > > that splits the add to queue and add queue to list part and avoids the > parent_cfs_rq peeking. Unfortunately that won't work. The problem here is that some of the traversed SEs are on_list and others aren't. And the on_list status of one SE is independent from other SEs. So, if we don't want to remove on_list elements during the traversal, we need to splice collected entries as soon as we find a SE that is on_list. We might get away with collecting all entries always (removing on_list entries temporarily) and splice them after the loop, but that would introduce more work than normally necessary. And we should double check for new concurrency issues... > Now I don't really like the above because its hard to make the code go > away in the !FAIR_GROUP case, but maybe we can come up with something > for that. Hmmm... you might want to reconsider my original approach to solve this: http://lkml.org/lkml/2011/7/18/86 That might have been the cleanest one in this respect. Paul Turner did not like the introduced in-order removal, but the out-of-order removal is causing most problems. Regards Jan (not quite sure which path to follow) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() 2011-08-24 21:27 ` "Jan H. Schönherr" @ 2011-08-24 21:32 ` Paul Turner 2011-08-24 22:40 ` "Jan H. Schönherr" 0 siblings, 1 reply; 10+ messages in thread From: Paul Turner @ 2011-08-24 21:32 UTC (permalink / raw) To: Jan H. Schönherr Cc: Peter Zijlstra, Ingo Molnar, Dipankar Sarma, Paul E. McKenney, linux-kernel >> >> I think splitting the function into two parts would make the thing >> saner, something like: >> >> >> LIST_HEAD(leaf_queue); >> >> for_each_sched_entity(se) { >> if (se->on_rq) >> break; >> cfs_rq = cfs_rq_of(se); >> enqueue_entity(cfs_rq, se, flags); >> flags = ENQUEUE_WAKEUP; >> if (cfs_rq->nr_running == 1) >> leaf_add_queue(cfs_rq, &leaf_queue); >> } >> /* XXX does ->on_rq imply ->on_list ? */ >> if (se->on_list) >> leaf_splice_queue(cfs_rq, &leaf_queue); >> >> that splits the add to queue and add queue to list part and avoids the >> parent_cfs_rq peeking. > > Unfortunately that won't work. The problem here is that some of the > traversed SEs are on_list and others aren't. And the on_list status > of one SE is independent from other SEs. So, if we don't want to remove > on_list elements during the traversal, we need to splice collected > entries as soon as we find a SE that is on_list. > > We might get away with collecting all entries always (removing on_list entries > temporarily) and splice them after the loop, but that would > introduce more work than normally necessary. And we should double check > for new concurrency issues... > Let me consider this part more carefully. > >> Now I don't really like the above because its hard to make the code go >> away in the !FAIR_GROUP case, but maybe we can come up with something >> for that. > > Hmmm... you might want to reconsider my original approach to solve this: > http://lkml.org/lkml/2011/7/18/86 > > That might have been the cleanest one in this respect. > > Paul Turner did not like the introduced in-order removal, but the > out-of-order removal is causing most problems. > Sorry for the delayed reply -- I owe you some feedback on the updated versions but have been buried with other work. What I didn't like about the original approach was specifically the positional dependence on enqueue/dequeue. If we can't do the splicing properly then I think we want something like: https://lkml.org/lkml/2011/7/18/348 to avoid shooting ourselves in the future later. See: https://lkml.org/lkml/2011/7/19/178 for why this should be cheap. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() 2011-08-24 21:32 ` Paul Turner @ 2011-08-24 22:40 ` "Jan H. Schönherr" 2011-08-30 13:35 ` Peter Zijlstra 0 siblings, 1 reply; 10+ messages in thread From: "Jan H. Schönherr" @ 2011-08-24 22:40 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Dipankar Sarma, Paul E. McKenney, linux-kernel Am 24.08.2011 23:32, schrieb Paul Turner: >>> Now I don't really like the above because its hard to make the code go >>> away in the !FAIR_GROUP case, but maybe we can come up with something >>> for that. >> >> Hmmm... you might want to reconsider my original approach to solve this: >> http://lkml.org/lkml/2011/7/18/86 >> >> That might have been the cleanest one in this respect. >> >> Paul Turner did not like the introduced in-order removal, but the >> out-of-order removal is causing most problems. >> > > Sorry for the delayed reply -- I owe you some feedback on the updated > versions but have been buried with other work. No problem. > What I didn't like about the original approach was specifically the > positional dependence on enqueue/dequeue. Maybe I misunderstood you, then. If we can guarantee in-order removal of leaf_cfs_rqs, then there is no positional dependency. Any SE can be enqueued and dequeued anytime. OTOH, the RCU splice variant has a positional dependence: calling enqueue_entity() outside of enqueue_task_fair() can go wrong easily as it depends on being called bottom-up and requires its caller to maintain state. This is also partly true for the leaf_insertion_point variant: if a caller maintains state, then the pair enqueue_entity/enqueue_leaf_cfs_rq() also depends on being called bottom up. > If we can't do the splicing > properly then I think we want something like: > https://lkml.org/lkml/2011/7/18/348 to avoid shooting ourselves in the > future later. > > See: https://lkml.org/lkml/2011/7/19/178 for why this should be cheap. As far as I can tell, all three variants proposed so far work. It is probably a matter of taste in the end. I'll happily help with whatever version tastes best. :) Regards Jan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() 2011-08-24 22:40 ` "Jan H. Schönherr" @ 2011-08-30 13:35 ` Peter Zijlstra 0 siblings, 0 replies; 10+ messages in thread From: Peter Zijlstra @ 2011-08-30 13:35 UTC (permalink / raw) To: "Jan H. Schönherr" Cc: Paul Turner, Ingo Molnar, Dipankar Sarma, Paul E. McKenney, linux-kernel On Thu, 2011-08-25 at 00:40 +0200, "Jan H. Schönherr" wrote: > Am 24.08.2011 23:32, schrieb Paul Turner: > >>> Now I don't really like the above because its hard to make the code go > >>> away in the !FAIR_GROUP case, but maybe we can come up with something > >>> for that. > >> > >> Hmmm... you might want to reconsider my original approach to solve this: > >> http://lkml.org/lkml/2011/7/18/86 > >> > >> That might have been the cleanest one in this respect. > >> > >> Paul Turner did not like the introduced in-order removal, but the > >> out-of-order removal is causing most problems. > >> > > > > Sorry for the delayed reply -- I owe you some feedback on the updated > > versions but have been buried with other work. > > No problem. > > > What I didn't like about the original approach was specifically the > > positional dependence on enqueue/dequeue. > > Maybe I misunderstood you, then. > > If we can guarantee in-order removal of leaf_cfs_rqs, then there is > no positional dependency. Any SE can be enqueued and dequeued anytime. > > OTOH, the RCU splice variant has a positional dependence: calling > enqueue_entity() outside of enqueue_task_fair() can go wrong easily as it > depends on being called bottom-up and requires its caller to maintain state. > > This is also partly true for the leaf_insertion_point variant: if a caller > maintains state, then the pair enqueue_entity/enqueue_leaf_cfs_rq() also > depends on being called bottom up. > > > > If we can't do the splicing > > properly then I think we want something like: > > https://lkml.org/lkml/2011/7/18/348 to avoid shooting ourselves in the > > future later. > > > > See: https://lkml.org/lkml/2011/7/19/178 for why this should be cheap. > > As far as I can tell, all three variants proposed so far work. > > It is probably a matter of taste in the end. I'll happily help with > whatever version tastes best. :) pjt, you said you were rewriting the cgroup stuff yet-again, any preference for which solution would least get in your way? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() 2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr 2011-08-23 18:53 ` Peter Zijlstra @ 2011-08-23 18:56 ` Peter Zijlstra 1 sibling, 0 replies; 10+ messages in thread From: Peter Zijlstra @ 2011-08-23 18:56 UTC (permalink / raw) To: Jan H. Schönherr Cc: Ingo Molnar, Paul Turner, Dipankar Sarma, Paul E. McKenney, linux-kernel On Tue, 2011-08-23 at 20:53 +0200, Peter Zijlstra wrote: > LIST_HEAD(leaf_queue); > > for_each_sched_entity(se) { > if (se->on_rq) > break; > cfs_rq = cfs_rq_of(se); > enqueue_entity(cfs_rq, se, flags); > flags = ENQUEUE_WAKEUP; > if (cfs_rq->nr_running == 1) > leaf_add_queue(cfs_rq, &leaf_queue); > } > /* XXX does ->on_rq imply ->on_list ? */ > if (se->on_list) > leaf_splice_queue(cfs_rq, &leaf_queue); Bah, se can be NULL here, still needing some extra foo. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-08-30 13:36 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-08-16 14:07 [PATCH v3 0/2] sched: Enforce order of leaf CFS runqueues Jan H. Schönherr 2011-08-16 14:07 ` [PATCH v3 1/2] rcu: More rcu-variants for list manipulation Jan H. Schönherr 2011-08-23 15:37 ` Paul E. McKenney 2011-08-16 14:07 ` [PATCH v3 2/2] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr 2011-08-23 18:53 ` Peter Zijlstra 2011-08-24 21:27 ` "Jan H. Schönherr" 2011-08-24 21:32 ` Paul Turner 2011-08-24 22:40 ` "Jan H. Schönherr" 2011-08-30 13:35 ` Peter Zijlstra 2011-08-23 18:56 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox