* [PATCH 0/2] Enforce hierarchical order of leaf CFS runqueues
@ 2011-07-18 10:50 Jan H. Schönherr
2011-07-18 10:50 ` [PATCH 1/2] sched: Enforce " Jan H. Schönherr
2011-07-18 10:50 ` [PATCH 2/2] sched: Prevent removal of leaf CFS runqueues with on_list children Jan H. Schönherr
0 siblings, 2 replies; 14+ messages in thread
From: Jan H. Schönherr @ 2011-07-18 10:50 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Paul Turner, linux-kernel, Jan H. Schönherr
Code review showed, that the hierarchical order of the leaf
CFS runqueues introduced by commit 67e8625 ("sched: Introduce
hierarchal order on shares update list") is not yet achieved.
(See description of first patch for an example.)
These two patches (against v3.0-rc7) try to fix that.
Patch 1 causes all leaves that are inserted during one
enqueue_task_fair() call to be inserted right next to each other
at the beginning of the list.
Patch 2 prevents the existing code from punching holes
in the leaf-hierarchy as this would cause patch 1 to
re-insert them at possibly wrong positions.
What remains are some multiprocessor aspects: the leaf_cfs_rq_list
is traversed without holding rq->lock. So it can be modified while
a traversal is in progress. That means: while users of the list are
guaranteed to traverse all elements in hierarchical order, they
are not guaranteed that they will see the parent of an already visited
child. But as far as I currently understand the code, this seems not
to be a problem as the code only relies on the former guarantee.
Regards
Jan
Jan H. Schönherr (2):
sched: Enforce order of leaf CFS runqueues
sched: Prevent removal of leaf CFS runqueues with on_list children
kernel/sched.c | 1 +
kernel/sched_fair.c | 45 +++++++++++++++++++++++++++++----------------
2 files changed, 30 insertions(+), 16 deletions(-)
--
1.7.3.4
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH 1/2] sched: Enforce order of leaf CFS runqueues 2011-07-18 10:50 [PATCH 0/2] Enforce hierarchical order of leaf CFS runqueues Jan H. Schönherr @ 2011-07-18 10:50 ` Jan H. Schönherr 2011-07-18 23:24 ` Paul Turner 2011-07-18 10:50 ` [PATCH 2/2] sched: Prevent removal of leaf CFS runqueues with on_list children Jan H. Schönherr 1 sibling, 1 reply; 14+ messages in thread From: Jan H. Schönherr @ 2011-07-18 10:50 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Paul Turner, linux-kernel, Jan H. Schönherr From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> Currently, it is still possible that the ordering constraint is violated. Consider a hierarchy of at least three task groups: tg1 --> tg2 (parent of tg1)--> tg3 (grandparent of tg1) And the following actions: 1. Enqueue task in tg3 2. Enqueue task in tg1 Due to step 1 tg3 will be on the list somewhere. However, in step 2 tg1 will be inserted at the end of the list, because its parent tg2 is not yet on the list. tg2 will then be inserted at the beginning. So we get: tg2 --> tg3 --> tg1 This patch addresses this by carrying the information of the previously inserted leaf during bottom-up enqueuing. This allows us to insert everything without a child at the beginning of the list and all not yet enqueued parents of that runqueue right after it. This way, every series of enqueues is guaranteed to be inserted before older series. (Note, that this requires that a runqueue is not removed from the list if it has some children that are still on the list.) Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> --- kernel/sched_fair.c | 35 ++++++++++++++++++++--------------- 1 files changed, 20 insertions(+), 15 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 433491c..d021c75 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -143,23 +143,27 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu) return cfs_rq->tg->cfs_rq[this_cpu]; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq, + struct cfs_rq *prev_cfs_rq) { + struct list_head *prev_leaf_cfs_rq; + if (!cfs_rq->on_list) { /* - * Ensure we either appear before our parent (if already - * enqueued) or force our parent to appear after us when it is - * enqueued. The fact that we always enqueue bottom-up - * reduces this to two cases. + * Ensure that children appear before their parent. The fact + * that we always enqueue from some point in the hierarchie + * until the top reduces this to two cases: + * Either we are in the middle of one of these enqueue series, + * then we enqueue directly after the last enqueued runqueue; + * or we are starting such a sequence in which case we insert + * the runqueue at the beginning of the list. */ - if (cfs_rq->tg->parent && - cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) { - list_add_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); - } else { - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); - } + if (prev_cfs_rq) + prev_leaf_cfs_rq = &prev_cfs_rq->leaf_cfs_rq_list; + else + prev_leaf_cfs_rq = &rq_of(cfs_rq)->leaf_cfs_rq_list; + + list_add_rcu(&cfs_rq->leaf_cfs_rq_list, prev_leaf_cfs_rq); cfs_rq->on_list = 1; } @@ -276,7 +280,8 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu) return &cpu_rq(this_cpu)->cfs; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq, + struct cfs_rq *prev_cfs_rq) { } @@ -999,7 +1004,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) se->on_rq = 1; if (cfs_rq->nr_running == 1) - list_add_leaf_cfs_rq(cfs_rq); + list_add_leaf_cfs_rq(cfs_rq, group_cfs_rq(se)); } static void __clear_buddies_last(struct sched_entity *se) -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues 2011-07-18 10:50 ` [PATCH 1/2] sched: Enforce " Jan H. Schönherr @ 2011-07-18 23:24 ` Paul Turner 2011-07-19 13:08 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Paul Turner @ 2011-07-18 23:24 UTC (permalink / raw) To: Jan H. Schönherr; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel On Mon, Jul 18, 2011 at 3:50 AM, Jan H. Schönherr <schnhrr@cs.tu-berlin.de> wrote: > From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> > > Currently, it is still possible that the ordering constraint > is violated. Consider a hierarchy of at least three task groups: > > tg1 --> tg2 (parent of tg1)--> tg3 (grandparent of tg1) > > And the following actions: > 1. Enqueue task in tg3 > 2. Enqueue task in tg1 > > Due to step 1 tg3 will be on the list somewhere. Argh, yeah nice catch. What we really want is whether the node at the *highest* level in the tree, "the root parent", is on queue as opposed to the immediate parent as there may be a discontinuity in this state as your example presents. Although this would still be subject to deletion punching holes in things.. > > However, in step 2 tg1 will be inserted at the end of the list, because > its parent tg2 is not yet on the list. tg2 will then be inserted at the > beginning. So we get: > > tg2 --> tg3 --> tg1 > > This patch addresses this by carrying the information of the > previously inserted leaf during bottom-up enqueuing. > > This allows us to insert everything without a child at the beginning of > the list and all not yet enqueued parents of that runqueue right after it. > This way, every series of enqueues is guaranteed to be inserted before > older series. (Note, that this requires that a runqueue is not removed > from the list if it has some children that are still on the list.) > One thing that bothers me about this is that it adds an implicit positional dependency in enqueue_entity. It's no longer safe to enqueue/dequeue an arbitrary entity; this leads to situation 1 in the follow-up patch requiring in-order removal. If the enqueue is always safe then this is not required. I wonder if we could move this logic to the enqueue_task_fair path() as that is the gating condition to become a leaf for load tracking. Ideally we could build up a list of the untracked entities and then splice it, but alas we have to use an rculist to track leaf cfs_rq's so that we can walk it without locks in update shares. hmmm, what about something like the below (only boot tested), it should make the insert case always safe meaning we don't need to do anything funky around delete: --- Subject: [PATCH] sched: handle on_list ancestor in leaf_add_cfs_rq() From: Paul Turner <pjt@google.com> Date: Mon, 18 Jul 2011 16:08:10 -0700 Jan H. Schönherr found that in the case of an on_list ancestor we may incorrectly place the child to the right of a great-ancestor on the list. Consider: A / \ Here, t1A results in A->cfs_rq being on_list, however when B t1A we start enqueuing from C this will not be visible. This is / compounded by the fact that on_list expiration may also be out C of order, punching holes in the tree. / t1C Prevent this by making additions to the leaf_cfs_rq_list position independent. This is done by maintaining additions to this list within the enqueue_task_fair() path, which allows us to always enqueue against the current entity's first on_list ancestor. Reported-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> Signed-off-by: Paul Turner <pjt@google.com> --- kernel/sched_fair.c | 55 +++++++++++++++++++++++++++++++------------------- 1 files changed, 34 insertions(+), 21 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index eb98f77..a7e0966 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -143,26 +143,39 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu) return cfs_rq->tg->cfs_rq[this_cpu]; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +/* + * rq->leaf_cfs_rq_list has an order constraint that specifies children must + * appear before parents. For the (!on_list) chain starting at cfs_rq this + * finds a satisfactory insertion point. If no ancestor is yet on_list, this + * choice is arbitrary. + */ +static inline struct list_head *find_leaf_cfs_rq_insertion(struct cfs_rq *cfs_rq) { - if (!cfs_rq->on_list) { - /* - * Ensure we either appear before our parent (if already - * enqueued) or force our parent to appear after us when it is - * enqueued. The fact that we always enqueue bottom-up - * reduces this to two cases. - */ - if (cfs_rq->tg->parent && - cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) { - list_add_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); - } else { - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); - } + struct sched_entity *se; + + se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]; + for_each_sched_entity(se) + if (cfs_rq->on_list) + return &cfs_rq->leaf_cfs_rq_list; - cfs_rq->on_list = 1; + return &rq_of(cfs_rq)->leaf_cfs_rq_list; +} + +static inline void enqueue_leaf_cfs_rq(struct cfs_rq *cfs_rq, + struct list_head **leaf_insertion) +{ + if (cfs_rq->on_list) { + /* make sure we search again when !on_list follows on_list */ + *leaf_insertion = NULL; + return; } + + if (!*leaf_insertion) + *leaf_insertion = find_leaf_cfs_rq_insertion(cfs_rq); + + /* use insertion point as a queue head => children appear first */ + list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, *leaf_insertion); + cfs_rq->on_list = 1; } static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) @@ -276,7 +289,8 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu) return &cpu_rq(this_cpu)->cfs; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +static inline void enqueue_leaf_cfs_rq(struct cfs_rq *cfs_rq, + struct list_head **leaf_insertion) { } @@ -997,9 +1011,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (se != cfs_rq->curr) __enqueue_entity(cfs_rq, se); se->on_rq = 1; - - if (cfs_rq->nr_running == 1) - list_add_leaf_cfs_rq(cfs_rq); } static void __clear_buddies_last(struct sched_entity *se) @@ -1326,12 +1337,14 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + struct list_head *leaf_insertion_point = NULL; for_each_sched_entity(se) { if (se->on_rq) break; cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, flags); + enqueue_leaf_cfs_rq(cfs_rq, &leaf_insertion_point); flags = ENQUEUE_WAKEUP; } -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues 2011-07-18 23:24 ` Paul Turner @ 2011-07-19 13:08 ` Peter Zijlstra 2011-07-19 17:48 ` Paul Turner 2011-07-19 15:17 ` Jan Schönherr 2011-07-21 13:20 ` Jan H. Schönherr 2 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2011-07-19 13:08 UTC (permalink / raw) To: Paul Turner; +Cc: Jan H. Schönherr, Ingo Molnar, linux-kernel On Mon, 2011-07-18 at 16:24 -0700, Paul Turner wrote: > Subject: [PATCH] sched: handle on_list ancestor in leaf_add_cfs_rq() > From: Paul Turner <pjt@google.com> > Date: Mon, 18 Jul 2011 16:08:10 -0700 > > Jan H. Schönherr found that in the case of an on_list ancestor we may > incorrectly place the child to the right of a great-ancestor on the list. > > Consider: > > A > / \ Here, t1A results in A->cfs_rq being on_list, however when > B t1A we start enqueuing from C this will not be visible. This is > / compounded by the fact that on_list expiration may also be out > C of order, punching holes in the tree. > / > t1C > > Prevent this by making additions to the leaf_cfs_rq_list position independent. > This is done by maintaining additions to this list within the > enqueue_task_fair() path, which allows us to always enqueue against the > current entity's first on_list ancestor. The problem I have with this is that it makes the enqueue more expensive. We're now optimizing a relatively slow path (load-balance) at the cost of the hottest path in the kernel (enqueue/dequeue). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues 2011-07-19 13:08 ` Peter Zijlstra @ 2011-07-19 17:48 ` Paul Turner 0 siblings, 0 replies; 14+ messages in thread From: Paul Turner @ 2011-07-19 17:48 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jan H., Ingo Molnar, linux-kernel On Tue, Jul 19, 2011 at 6:08 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, 2011-07-18 at 16:24 -0700, Paul Turner wrote: > >> Subject: [PATCH] sched: handle on_list ancestor in leaf_add_cfs_rq() >> From: Paul Turner <pjt@google.com> >> Date: Mon, 18 Jul 2011 16:08:10 -0700 >> >> Jan H. Schönherr found that in the case of an on_list ancestor we may >> incorrectly place the child to the right of a great-ancestor on the list. >> >> Consider: >> >> A >> / \ Here, t1A results in A->cfs_rq being on_list, however when >> B t1A we start enqueuing from C this will not be visible. This is >> / compounded by the fact that on_list expiration may also be out >> C of order, punching holes in the tree. >> / >> t1C >> >> Prevent this by making additions to the leaf_cfs_rq_list position independent. >> This is done by maintaining additions to this list within the >> enqueue_task_fair() path, which allows us to always enqueue against the >> current entity's first on_list ancestor. > > The problem I have with this is that it makes the enqueue more > expensive. We're now optimizing a relatively slow path (load-balance) at > the cost of the hottest path in the kernel (enqueue/dequeue). > So this is a concern that I kept in mind. I don't think this should actually add any measurable overhead: - on_rq always implies on_list so we are guaranteed never to go past where we have to go for enqueuing; moreover we're touching the same lines that enqueue will use for continuing it s walk - it takes on_list ~40ms+ to expire so the extra walk above can only happen on this order. I can't think of any benchmark that's going to measure anything for this. pipe-bench et al are just going to always hit the single on_list branch which should be the same cost as the previous == 1 check. > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues 2011-07-18 23:24 ` Paul Turner 2011-07-19 13:08 ` Peter Zijlstra @ 2011-07-19 15:17 ` Jan Schönherr 2011-07-19 17:53 ` Paul Turner 2011-07-21 13:20 ` Jan H. Schönherr 2 siblings, 1 reply; 14+ messages in thread From: Jan Schönherr @ 2011-07-19 15:17 UTC (permalink / raw) To: Paul Turner; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel Am 19.07.2011 01:24, schrieb Paul Turner: > hmmm, what about something like the below (only boot tested), it > should make the insert case always safe meaning we don't need to do > anything funky around delete: Seems to work, too, with two modifications... > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index eb98f77..a7e0966 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -143,26 +143,39 @@ static inline struct cfs_rq *cpu_cfs_rq(struct > cfs_rq *cfs_rq, int this_cpu) > return cfs_rq->tg->cfs_rq[this_cpu]; > } > > -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) > +/* > + * rq->leaf_cfs_rq_list has an order constraint that specifies children must > + * appear before parents. For the (!on_list) chain starting at cfs_rq this > + * finds a satisfactory insertion point. If no ancestor is yet on_list, this > + * choice is arbitrary. > + */ > +static inline struct list_head *find_leaf_cfs_rq_insertion(struct > cfs_rq *cfs_rq) > { > - if (!cfs_rq->on_list) { > - /* > - * Ensure we either appear before our parent (if already > - * enqueued) or force our parent to appear after us when it is > - * enqueued. The fact that we always enqueue bottom-up > - * reduces this to two cases. > - */ > - if (cfs_rq->tg->parent && > - cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) { > - list_add_rcu(&cfs_rq->leaf_cfs_rq_list, > - &rq_of(cfs_rq)->leaf_cfs_rq_list); > - } else { > - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, > - &rq_of(cfs_rq)->leaf_cfs_rq_list); > - } > + struct sched_entity *se; > + > + se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]; > + for_each_sched_entity(se) > + if (cfs_rq->on_list) > + return &cfs_rq->leaf_cfs_rq_list; Need to use cfs_rq corresponding to current se: - for_each_sched_entity(se) - if (cfs_rq->on_list) - return &cfs_rq->leaf_cfs_rq_list; + for_each_sched_entity(se) { + struct cfs_rq *se_cfs_rq = cfs_rq_of(se); + if (se_cfs_rq->on_list) + return &se_cfs_rq->leaf_cfs_rq_list; + } > > - cfs_rq->on_list = 1; > + return &rq_of(cfs_rq)->leaf_cfs_rq_list; > +} And something like the following hack to prevent the removal of the leaf_insertion_point itself during enqueue_entity() update_cfs_load() (Obviously not for production:) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 2df33d4..947257d 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -832,7 +832,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update) } /* consider updating load contribution on each fold or truncate */ - if (global_update || cfs_rq->load_period > period + if (global_update==1 || cfs_rq->load_period > period || !cfs_rq->load_period) update_cfs_rq_load_contribution(cfs_rq, global_update); @@ -847,7 +847,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update) cfs_rq->load_avg /= 2; } - if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg) + if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg && global_update!=2) list_del_leaf_cfs_rq(cfs_rq); } @@ -1063,7 +1063,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * Update run-time statistics of the 'current'. */ update_curr(cfs_rq); - update_cfs_load(cfs_rq, 0); + update_cfs_load(cfs_rq, 2); account_entity_enqueue(cfs_rq, se); update_cfs_shares(cfs_rq); ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues 2011-07-19 15:17 ` Jan Schönherr @ 2011-07-19 17:53 ` Paul Turner 0 siblings, 0 replies; 14+ messages in thread From: Paul Turner @ 2011-07-19 17:53 UTC (permalink / raw) To: Jan Schönherr; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel On Tue, Jul 19, 2011 at 8:17 AM, Jan Schönherr <schnhrr@cs.tu-berlin.de> wrote: > Am 19.07.2011 01:24, schrieb Paul Turner: >> hmmm, what about something like the below (only boot tested), it >> should make the insert case always safe meaning we don't need to do >> anything funky around delete: > > Seems to work, too, with two modifications... > > >> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c >> index eb98f77..a7e0966 100644 >> --- a/kernel/sched_fair.c >> +++ b/kernel/sched_fair.c >> @@ -143,26 +143,39 @@ static inline struct cfs_rq *cpu_cfs_rq(struct >> cfs_rq *cfs_rq, int this_cpu) >> return cfs_rq->tg->cfs_rq[this_cpu]; >> } >> >> -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) >> +/* >> + * rq->leaf_cfs_rq_list has an order constraint that specifies children must >> + * appear before parents. For the (!on_list) chain starting at cfs_rq this >> + * finds a satisfactory insertion point. If no ancestor is yet on_list, this >> + * choice is arbitrary. >> + */ >> +static inline struct list_head *find_leaf_cfs_rq_insertion(struct >> cfs_rq *cfs_rq) >> { >> - if (!cfs_rq->on_list) { >> - /* >> - * Ensure we either appear before our parent (if already >> - * enqueued) or force our parent to appear after us when it is >> - * enqueued. The fact that we always enqueue bottom-up >> - * reduces this to two cases. >> - */ >> - if (cfs_rq->tg->parent && >> - cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) { >> - list_add_rcu(&cfs_rq->leaf_cfs_rq_list, >> - &rq_of(cfs_rq)->leaf_cfs_rq_list); >> - } else { >> - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, >> - &rq_of(cfs_rq)->leaf_cfs_rq_list); >> - } >> + struct sched_entity *se; >> + >> + se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]; >> + for_each_sched_entity(se) >> + if (cfs_rq->on_list) >> + return &cfs_rq->leaf_cfs_rq_list; > > Need to use cfs_rq corresponding to current se: > > - for_each_sched_entity(se) > - if (cfs_rq->on_list) > - return &cfs_rq->leaf_cfs_rq_list; > + for_each_sched_entity(se) { > + struct cfs_rq *se_cfs_rq = cfs_rq_of(se); > + if (se_cfs_rq->on_list) > + return &se_cfs_rq->leaf_cfs_rq_list; > + } Yeah... that WOULD be a good idea wouldn't it :) > >> >> - cfs_rq->on_list = 1; >> + return &rq_of(cfs_rq)->leaf_cfs_rq_list; >> +} > > > And something like the following hack to prevent the removal > of the leaf_insertion_point itself during > enqueue_entity() > update_cfs_load() > > (Obviously not for production:) Oh.. yeah.. ew.. I don't think this needs a new global_update state to track. We should just change the call out to list_del_leaf_cfs_rq to only occur on global updates (e.g. == 1 case) and let them fall out via update_shares. > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index 2df33d4..947257d 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -832,7 +832,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update) > } > > /* consider updating load contribution on each fold or truncate */ > - if (global_update || cfs_rq->load_period > period > + if (global_update==1 || cfs_rq->load_period > period > || !cfs_rq->load_period) > update_cfs_rq_load_contribution(cfs_rq, global_update); > > @@ -847,7 +847,7 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update) > cfs_rq->load_avg /= 2; > } > > - if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg) > + if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg && global_update!=2) > list_del_leaf_cfs_rq(cfs_rq); > } > > @@ -1063,7 +1063,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > * Update run-time statistics of the 'current'. > */ > update_curr(cfs_rq); > - update_cfs_load(cfs_rq, 0); > + update_cfs_load(cfs_rq, 2); > account_entity_enqueue(cfs_rq, se); > update_cfs_shares(cfs_rq); > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched: Enforce order of leaf CFS runqueues 2011-07-18 23:24 ` Paul Turner 2011-07-19 13:08 ` Peter Zijlstra 2011-07-19 15:17 ` Jan Schönherr @ 2011-07-21 13:20 ` Jan H. Schönherr 2011-07-21 13:20 ` [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr ` (2 more replies) 2 siblings, 3 replies; 14+ messages in thread From: Jan H. Schönherr @ 2011-07-21 13:20 UTC (permalink / raw) To: Paul Turner Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Jan H. Schönherr Am 19.07.2011 01:24, schrieb Paul Turner: > I wonder if we could move this logic to the enqueue_task_fair path() > as that is the gating condition to become a leaf for load tracking. > Ideally we could build up a list of the untracked entities and then > splice it, but alas we have to use an rculist to track leaf cfs_rq's > so that we can walk it without locks in update shares. Out of curiosity I took a closer look at your "ideal" approach. If I am not mistaken, this should actually work. The results come as a reply to this mail. The problem with the rcu-list is, that we cannot put individual leaves on a different list, as this would irritate concurrent readers. But what we can do without problems is to assemble the leaves into something like a "detached list". That is: while we modify the next-pointers of deleted leaves, we only let them point to other (possibly deleted) leaves of the same list, so that concurrent readers will still find their way back. As long as there is no error in my RCU reasoning, we get your ideal approach that is also not subject to Peter's "We do more work than before" statement. What do you think of this? Patch 1: After inventing __list_link(), I realized, that this function already existed, but with a different name. This patch just does some renaming. Not really needed, but if I use the old naming in patch 2+3 it's really hard to understand what's actually going on. Patch 2: This introduces new list functions to splice RCU lists. Patch 3: The actual change to the leaf handling. Regards Jan Jan H. Schönherr (3): list, treewide: Rename __list_del() to __list_link() rcu: More rcu-variants for list manipulation sched: Handle on_list ancestor in list_add_leaf_cfs_rq() drivers/firewire/core-topology.c | 2 +- drivers/gpu/drm/ttm/ttm_page_alloc.c | 4 +- include/linux/list.h | 6 ++-- include/linux/rculist.h | 28 +++++++++++++++- kernel/sched_fair.c | 59 ++++++++++++++++++++++----------- lib/list_debug.c | 2 +- 6 files changed, 73 insertions(+), 28 deletions(-) -- 1.7.6 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link() 2011-07-21 13:20 ` Jan H. Schönherr @ 2011-07-21 13:20 ` Jan H. Schönherr 2011-07-21 13:20 ` [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation Jan H. Schönherr 2011-07-21 13:20 ` [PATCH RFC 3/3] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr 2 siblings, 0 replies; 14+ messages in thread From: Jan H. Schönherr @ 2011-07-21 13:20 UTC (permalink / raw) To: Paul Turner Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Jan H. Schönherr From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> The purpose of __list_del() is to let two list entries point to each other. It does not actually remove anything. It can be used to delete something as it is done from within __list_del_entry(). This commit renames __list_del() to __list_link() as this new name better reflects what the function is actually doing. Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> --- We could now rephrase some of the existing list-functions with __list_link(). Might be more readable. Also, there are quite a few other follow-up patches when looking longer at this. --- drivers/firewire/core-topology.c | 2 +- drivers/gpu/drm/ttm/ttm_page_alloc.c | 4 ++-- include/linux/list.h | 6 +++--- include/linux/rculist.h | 2 +- lib/list_debug.c | 2 +- 5 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/firewire/core-topology.c b/drivers/firewire/core-topology.c index 193ed92..b7cd1db 100644 --- a/drivers/firewire/core-topology.c +++ b/drivers/firewire/core-topology.c @@ -290,7 +290,7 @@ static struct fw_node *build_tree(struct fw_card *card, } /* Pop the child nodes off the stack and push the new node. */ - __list_del(h->prev, &stack); + __list_link(h->prev, &stack); list_add_tail(&node->link, &stack); stack_depth += 1 - child_port_count; diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c index d948575..fd4443a 100644 --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c @@ -331,7 +331,7 @@ restart: /* We can only remove NUM_PAGES_TO_ALLOC at a time. */ if (freed_pages >= NUM_PAGES_TO_ALLOC) { /* remove range of pages from the pool */ - __list_del(p->lru.prev, &pool->list); + __list_link(p->lru.prev, &pool->list); ttm_pool_update_free_locked(pool, freed_pages); /** @@ -366,7 +366,7 @@ restart: /* remove range of pages from the pool */ if (freed_pages) { - __list_del(&p->lru, &pool->list); + __list_link(&p->lru, &pool->list); ttm_pool_update_free_locked(pool, freed_pages); nr_free -= freed_pages; diff --git a/include/linux/list.h b/include/linux/list.h index cc6d2aa..025b883 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -77,13 +77,13 @@ static inline void list_add_tail(struct list_head *new, struct list_head *head) } /* - * Delete a list entry by making the prev/next entries + * Link two list entries by making the prev/next entries * point to each other. * * This is only for internal list manipulation where we know * the prev/next entries already! */ -static inline void __list_del(struct list_head * prev, struct list_head * next) +static inline void __list_link(struct list_head *prev, struct list_head *next) { next->prev = prev; prev->next = next; @@ -98,7 +98,7 @@ static inline void __list_del(struct list_head * prev, struct list_head * next) #ifndef CONFIG_DEBUG_LIST static inline void __list_del_entry(struct list_head *entry) { - __list_del(entry->prev, entry->next); + __list_link(entry->prev, entry->next); } static inline void list_del(struct list_head *entry) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index e3beb31..748401b 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -108,7 +108,7 @@ static inline void list_add_tail_rcu(struct list_head *new, */ static inline void list_del_rcu(struct list_head *entry) { - __list_del(entry->prev, entry->next); + __list_link(entry->prev, entry->next); entry->prev = LIST_POISON2; } diff --git a/lib/list_debug.c b/lib/list_debug.c index b8029a5..7e37695 100644 --- a/lib/list_debug.c +++ b/lib/list_debug.c @@ -56,7 +56,7 @@ void __list_del_entry(struct list_head *entry) "but was %p\n", entry, next->prev)) return; - __list_del(prev, next); + __list_link(prev, next); } EXPORT_SYMBOL(__list_del_entry); -- 1.7.6 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation 2011-07-21 13:20 ` Jan H. Schönherr 2011-07-21 13:20 ` [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr @ 2011-07-21 13:20 ` Jan H. Schönherr 2011-07-22 16:21 ` Paul E. McKenney 2011-07-21 13:20 ` [PATCH RFC 3/3] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr 2 siblings, 1 reply; 14+ messages in thread From: Jan H. Schönherr @ 2011-07-21 13:20 UTC (permalink / raw) To: Paul Turner Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Jan H. Schönherr From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> This commit introduces some more *_rcu variants of some list manipulation functions: __list_link_rcu() __list_splice_rcu() list_splice_tail_init_rcu() An important difference of the new rcu-splice functions compared to the existing function is that they require that there are either no concurrent readers on the list to be spliced or that these readers are still from the list where the list is spliced into. Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> --- TODO: - add comments - rephrase existing splice function using __list_splice_rcu() and probably add "_sync" to the name --- include/linux/rculist.h | 26 ++++++++++++++++++++++++++ 1 files changed, 26 insertions(+), 0 deletions(-) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 748401b..56385c8 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -24,6 +24,15 @@ */ #define list_next_rcu(list) (*((struct list_head __rcu **)(&(list)->next))) + +static inline void __list_link_rcu(struct list_head *prev, + struct list_head *next) +{ + rcu_assign_pointer(list_next_rcu(prev), next); + next->prev = prev; +} + + /* * Insert a new entry between two known consecutive entries. * @@ -158,6 +167,23 @@ static inline void list_replace_rcu(struct list_head *old, old->prev = LIST_POISON2; } +static inline void __list_splice_rcu(struct list_head *list, + struct list_head *prev, + struct list_head *next) +{ + __list_link(list->prev, next); + __list_link_rcu(prev, list->next); +} + +static inline void list_splice_tail_init_rcu(struct list_head *list, + struct list_head *head) +{ + if (!list_empty(list)) { + __list_splice_rcu(list, head->prev, head); + INIT_LIST_HEAD(list); + } +} + /** * list_splice_init_rcu - splice an RCU-protected list into an existing list. * @list: the RCU-protected list to splice -- 1.7.6 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation 2011-07-21 13:20 ` [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation Jan H. Schönherr @ 2011-07-22 16:21 ` Paul E. McKenney 2011-07-23 18:41 ` Jan Schönherr 0 siblings, 1 reply; 14+ messages in thread From: Paul E. McKenney @ 2011-07-22 16:21 UTC (permalink / raw) To: Jan H. Schönherr Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, linux-kernel On Thu, Jul 21, 2011 at 03:20:22PM +0200, Jan H. Schönherr wrote: > From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> > > This commit introduces some more *_rcu variants of some list > manipulation functions: > > __list_link_rcu() > __list_splice_rcu() > list_splice_tail_init_rcu() > > An important difference of the new rcu-splice functions > compared to the existing function is that they require > that there are either no concurrent readers on the list to > be spliced or that these readers are still from the list > where the list is spliced into. Please see below for some comments on issues you are probably already well aware of -- just commenting, the code itself looks OK. Once this patch is in good shape, it should probably go with the following patch (which uses it) rather than up -rcu. Thanx, Paul > Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> > > --- > TODO: > - add comments > - rephrase existing splice function using __list_splice_rcu() > and probably add "_sync" to the name > --- > include/linux/rculist.h | 26 ++++++++++++++++++++++++++ > 1 files changed, 26 insertions(+), 0 deletions(-) > > diff --git a/include/linux/rculist.h b/include/linux/rculist.h > index 748401b..56385c8 100644 > --- a/include/linux/rculist.h > +++ b/include/linux/rculist.h > @@ -24,6 +24,15 @@ > */ > #define list_next_rcu(list) (*((struct list_head __rcu **)(&(list)->next))) > Badly need a header comment here!!! This needs to include locking constraints and RCU reader constraints. Including why you have "prev" then "next" rather than the other way around. (Which I believe I figured out below.) > +static inline void __list_link_rcu(struct list_head *prev, > + struct list_head *next) > +{ > + rcu_assign_pointer(list_next_rcu(prev), next); > + next->prev = prev; > +} OK, so this one links a single list element to another single list element. You therefore need two calls to this to splice a pair of list together, right? If so, there will be a time when you have a "6" shaped list, so that readers from one of the list heads will never come back to that list head. This is probably what you are getting at with your "require that there are either no concurrent readers on the list to be spliced or that these readers are still from the list where the list is spliced into". However, this is quite subtle and needs to be spelled out carefully. As I understand it, when you splice together a list A and a list B, then list A cannot have any concurrent readers, but list B can. And then you have to do the __list_link_rcu() operations (along maybe with __list_link() operations) in a particular order. These constraints need to be carefully documented, and uses need to be commented. > + > + > /* > * Insert a new entry between two known consecutive entries. > * > @@ -158,6 +167,23 @@ static inline void list_replace_rcu(struct list_head *old, > old->prev = LIST_POISON2; > } Really badly need a header comment here!!! Which of these lists can tolerate concurrent readers? Why are there three arguments? My guess from the code is that "list" is the list being spliced, and it cannot tolerate concurrent readers, while "prev" is one end of the place to splice into the list and "next" is the other. I believe that both "prev" and "next" can tolerate concurrent readers. Did I get it right? Of course, if I got it wrong, that just proves how important the documentation is! ;-) And I still don't understand why the argument order isn't list, next, then prev -- after all, isn't that the order that the elements will appear on the spliced list? Ah, you are worried about the local order at the splice point... OK, then please include this in the comment. Yeah, I know, the include/linux/list.h comments aren't all that detailed. On the other hand, the include/linux/list.h primitives don't have to worry about concurrent readers. So please add detailed comments. ;-) > +static inline void __list_splice_rcu(struct list_head *list, > + struct list_head *prev, > + struct list_head *next) > +{ > + __list_link(list->prev, next); At this point, RCU readers traversing the list containing "prev" and "next" see no change, which is why it is OK to use __list_link() rather than __list_link_rcu(). If there are readers traversing "list", they will loop infinitely failing to find the old list header. > + __list_link_rcu(prev, list->next); At this point, new RCU readers traversing the list containing "next" and "prev" suddenly see the new elements from "list". The rcu_assign_pointer() in __lists_link_rcu() should force proper ordering. OK, so looks reasonable. Assuming that my guesses about what this is supposed to do are correct, anyway. > +} And badly need a comment here, especially regarding which list is being spliced into which. The usual convention would be that list is being spliced into head. > +static inline void list_splice_tail_init_rcu(struct list_head *list, > + struct list_head *head) > +{ > + if (!list_empty(list)) { > + __list_splice_rcu(list, head->prev, head); > + INIT_LIST_HEAD(list); > + } > +} OK, so if "list" is non-empty, we splice its elements to precede "head". Thanx, Paul > /** > * list_splice_init_rcu - splice an RCU-protected list into an existing list. > * @list: the RCU-protected list to splice > -- > 1.7.6 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation 2011-07-22 16:21 ` Paul E. McKenney @ 2011-07-23 18:41 ` Jan Schönherr 0 siblings, 0 replies; 14+ messages in thread From: Jan Schönherr @ 2011-07-23 18:41 UTC (permalink / raw) To: paulmck; +Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, linux-kernel Am 22.07.2011 18:21, schrieb Paul E. McKenney: > Please see below for some comments on issues you are probably already > well aware of -- just commenting, the code itself looks OK. Nice to hear. > Once this > patch is in good shape, it should probably go with the following patch > (which uses it) rather than up -rcu. Okay. Whether I'll put this code into good shape or not depends a bit on the last patch using this functionality. You are of course right with the missing comments, but if this way of solving the problem is rejected (or has some other problems that I might have overlooked), then there is no point in writing all the comments... (That is also why I did not put you on CC yet.) If we decide to do it that way, you will get an updated version, hopefully with all the comments you want to see in there. :) >> +static inline void __list_link_rcu(struct list_head *prev, >> + struct list_head *next) >> +{ >> + rcu_assign_pointer(list_next_rcu(prev), next); >> + next->prev = prev; >> +} > > OK, so this one links a single list element to another single list > element. You therefore need two calls to this to splice a pair of list > together, right? Yes (or to the non-rcu version). > If so, there will be a time when you have a "6" shaped > list, so that readers from one of the list heads will never come back > to that list head. This is probably what you are getting at with your > "require that there are either no concurrent readers on the list to be > spliced or that these readers are still from the list where the list is > spliced into". Exactly. No reader must get trapped in a circle. They all must be able to return to their list head. > However, this is quite subtle and needs to be spelled out carefully. I will try to do that. > Really badly need a header comment here!!! Which of these lists can > tolerate concurrent readers? Why are there three arguments? My guess > from the code is that "list" is the list being spliced, and it cannot > tolerate concurrent readers, while "prev" is one end of the place to > splice into the list and "next" is the other. I believe that both "prev" > and "next" can tolerate concurrent readers. > > Did I get it right? Everything. :) Regarding the function arguments, I tried to mimic those of the non-rcu variants. >> +static inline void __list_splice_rcu(struct list_head *list, >> + struct list_head *prev, >> + struct list_head *next) >> +{ >> + __list_link(list->prev, next); > > At this point, RCU readers traversing the list containing "prev" and > "next" see no change, which is why it is OK to use __list_link() rather > than __list_link_rcu(). If there are readers traversing "list", they > will loop infinitely failing to find the old list header. Either that, or they do something worse as they mistake the new list header for a normal node and try to access some of the elements of the surrounding struct, which does not exist in this case. > >> + __list_link_rcu(prev, list->next); > > At this point, new RCU readers traversing the list containing > "next" and "prev" suddenly see the new elements from "list". The > rcu_assign_pointer() in __lists_link_rcu() should force proper ordering. > > OK, so looks reasonable. Assuming that my guesses about what this is > supposed to do are correct, anyway. I should probably say something about possible the use cases of this: One could of course rewrite the already existing rcu-splice function to use this low level funtionality. But it gets more interesting if you want to add multiple elements to an rcu-protected list: Instead of doing that one by one with list_add_rcu() you can assemble them in a non-rcu list and splice them (avoiding quite a few memory barriers). The use case in the 3rd patch is even more special: The elements that are spliced were removed from the very same list previously and there might still be readers on them from before the removal. Due to this we cannot collect this batch in a new list. But we can link one deleted entry to the next (using the non-rcu(!) __list_link) building up a chain of deleted entries. If there was a reader, it will find its way back to the list over the last element in that chain. That chain is then passed to the splice function. (That chain-handling is currently in the third patch. I wonder whether we should move that to one of the header files, too?!) Regards Jan ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH RFC 3/3] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() 2011-07-21 13:20 ` Jan H. Schönherr 2011-07-21 13:20 ` [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr 2011-07-21 13:20 ` [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation Jan H. Schönherr @ 2011-07-21 13:20 ` Jan H. Schönherr 2 siblings, 0 replies; 14+ messages in thread From: Jan H. Schönherr @ 2011-07-21 13:20 UTC (permalink / raw) To: Paul Turner Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Jan H. Schönherr From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> In the case of an on_list ancestor we may incorrectly place the child to the right of a great-ancestor on the list. Consider: A / \ Here, t1A results in A->cfs_rq being on_list, however when B t1A we start enqueuing from C this will not be visible. This is / compounded by the fact that on_list expiration may also be out C of order, punching holes in the tree. / t1C Prevent this by making additions to the leaf_cfs_rq_list position independent. This is done by collecting additions to this list within the enqueue_task_fair() path until we reach the top or find an on_list ancestor. All additions are then spliced into the leaf_cfs_rq_list at once. The additions cannot be collected in a list with the normal means as this violates the RCU guarantee that the next pointer of an element always leads back to the list as long as there could be concurrent readers. However, we are still allowed to modify the next pointer of deleted entries as long as we make sure that they eventually return to the list. That is, if we have to collect more than one entry in a row, it is legal to set the next-pointer of the first collected entry to the next one. (We can do this even without using *_rcu functions, as there are no new elements on that path.) Suggested-by: Paul Turner <pjt@google.com> Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> --- kernel/sched_fair.c | 59 +++++++++++++++++++++++++++++++++----------------- 1 files changed, 39 insertions(+), 20 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 433491c..f98b424 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -143,26 +143,41 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu) return cfs_rq->tg->cfs_rq[this_cpu]; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq, + struct list_head *leaf_segment) { - if (!cfs_rq->on_list) { - /* - * Ensure we either appear before our parent (if already - * enqueued) or force our parent to appear after us when it is - * enqueued. The fact that we always enqueue bottom-up - * reduces this to two cases. - */ - if (cfs_rq->tg->parent && - cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) { - list_add_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); - } else { - list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); - } + if (cfs_rq->on_list) + return; - cfs_rq->on_list = 1; + if (list_empty(leaf_segment)) { + leaf_segment->next = &cfs_rq->leaf_cfs_rq_list; + leaf_segment->prev = &cfs_rq->leaf_cfs_rq_list; + } else { + __list_link(leaf_segment->prev, &cfs_rq->leaf_cfs_rq_list); + leaf_segment->prev = leaf_segment->prev->next; } + + /* + * If our parent is on_list or if there is no parent, then splice all + * entries collected so far at the correct position into the + * leaf_cfs_rq_list. + * + * If our parent is not on the list, it will be collected during the + * next call to this function. + */ + if (cfs_rq->tg->parent) { + int cpu = cpu_of(rq_of(cfs_rq)); + struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu]; + if (parent_cfs_rq->on_list) { + list_splice_tail_init_rcu(leaf_segment, + &parent_cfs_rq->leaf_cfs_rq_list); + } + } else { + list_splice_tail_init_rcu(leaf_segment, + &rq_of(cfs_rq)->leaf_cfs_rq_list); + } + + cfs_rq->on_list = 1; } static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) @@ -276,7 +291,8 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu) return &cpu_rq(this_cpu)->cfs; } -static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq, + struct list_head *leaf_segment) { } @@ -998,8 +1014,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) __enqueue_entity(cfs_rq, se); se->on_rq = 1; - if (cfs_rq->nr_running == 1) - list_add_leaf_cfs_rq(cfs_rq); } static void __clear_buddies_last(struct sched_entity *se) @@ -1326,12 +1340,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + struct list_head leaf_segment; + + INIT_LIST_HEAD(&leaf_segment); for_each_sched_entity(se) { if (se->on_rq) break; cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, flags); + if (cfs_rq->nr_running == 1) + list_add_leaf_cfs_rq(cfs_rq, &leaf_segment); flags = ENQUEUE_WAKEUP; } -- 1.7.6 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] sched: Prevent removal of leaf CFS runqueues with on_list children 2011-07-18 10:50 [PATCH 0/2] Enforce hierarchical order of leaf CFS runqueues Jan H. Schönherr 2011-07-18 10:50 ` [PATCH 1/2] sched: Enforce " Jan H. Schönherr @ 2011-07-18 10:50 ` Jan H. Schönherr 1 sibling, 0 replies; 14+ messages in thread From: Jan H. Schönherr @ 2011-07-18 10:50 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Paul Turner, linux-kernel, Jan H. Schönherr From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> Currently there are (at least) two situations where a parent gets removed from the list of leaf CFS runqueues although some of its children are still on the list. This patch adds a counter for children depending on a parent, preventing the parent from being removed too early. Consider three task groups with these parent pointers: tg1 --> tg2 --> tg3 Situation 1: 1. Enqueue task A in tg1 2. Dequeue task A 3. Enqueue task B in tg2 In step 1 all three runqueues are added as leaves. In step 2 the primary reason for being on the list vanishes, but all three runqueues are kept on the list. In step 3 we call enqueue_entity() for tg2. There, we call update_cfs_load() before increasing nr_running. Therefore tg2 can be removed from the leaf list. Note that enqueue_entity() will re-add tg2 at the end, but this will mix up the order as we do not provide tg1 as prev_cfs_rq. Situation 2: There is no guarantee that cfs_rq->load_avg of tg1 drops to zero before that of tg2. Hence, tg2 can be removed before tg1. If something is enqueued in tg2 after its removal, it would be added at the wrong position. Enqueuing something in tg1 before it is removed would further prevent the situation from sorting out itself. Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> --- kernel/sched.c | 1 + kernel/sched_fair.c | 10 +++++++++- 2 files changed, 10 insertions(+), 1 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 9769c75..88c83b8 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -344,6 +344,7 @@ struct cfs_rq { * list is used during load balance. */ int on_list; + int children_on_list; struct list_head leaf_cfs_rq_list; struct task_group *tg; /* group that "owns" this runqueue */ diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index d021c75..1c7dd1d 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -166,14 +166,22 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq, list_add_rcu(&cfs_rq->leaf_cfs_rq_list, prev_leaf_cfs_rq); cfs_rq->on_list = 1; + if (cfs_rq->tg->parent) { + int cpu = cpu_of(rq_of(cfs_rq)); + cfs_rq->tg->parent->cfs_rq[cpu]->children_on_list++; + } } } static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) { - if (cfs_rq->on_list) { + if (cfs_rq->on_list && !cfs_rq->children_on_list) { list_del_rcu(&cfs_rq->leaf_cfs_rq_list); cfs_rq->on_list = 0; + if (cfs_rq->tg->parent) { + int cpu = cpu_of(rq_of(cfs_rq)); + cfs_rq->tg->parent->cfs_rq[cpu]->children_on_list--; + } } } -- 1.7.3.4 ^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-07-23 18:41 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-18 10:50 [PATCH 0/2] Enforce hierarchical order of leaf CFS runqueues Jan H. Schönherr 2011-07-18 10:50 ` [PATCH 1/2] sched: Enforce " Jan H. Schönherr 2011-07-18 23:24 ` Paul Turner 2011-07-19 13:08 ` Peter Zijlstra 2011-07-19 17:48 ` Paul Turner 2011-07-19 15:17 ` Jan Schönherr 2011-07-19 17:53 ` Paul Turner 2011-07-21 13:20 ` Jan H. Schönherr 2011-07-21 13:20 ` [PATCH RFC 1/3] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr 2011-07-21 13:20 ` [PATCH RFC 2/3] rcu: More rcu-variants for list manipulation Jan H. Schönherr 2011-07-22 16:21 ` Paul E. McKenney 2011-07-23 18:41 ` Jan Schönherr 2011-07-21 13:20 ` [PATCH RFC 3/3] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr 2011-07-18 10:50 ` [PATCH 2/2] sched: Prevent removal of leaf CFS runqueues with on_list children Jan H. Schönherr
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox