* [PATCHSET] cgroup: allow dropping RCU read lock while iterating @ 2013-05-21 1:50 Tejun Heo [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2013-05-21 1:50 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA Currently all cgroup iterators require the whole traversal to be contained in a single RCU read critical section, which can be too restrictive as there are times when blocking operations are necessary during traversal. This forces controllers to implement specific workarounds in those cases - building separate iteration list, punting actual operations to work items and so on. This patchset updates cgroup iterators so that they allow dropping RCU read lock while iteration is in progress so that controllers which require sleeping during iteration don't need to implement their own mechanisms. Dropping RCU read lock during iteration is unsafe because cgroup->sibling.next can't be trusted once RCU read lock is dropped. The sibling list is a RCU list and when a cgroup is removed the next pointer is retained to keep RCU traversal working. If the next sibling is removed while RCU read lock is dropped, the removed current cgroup's next won't be updated and the next sibling may complete its grace period and get freed leaving the next pointer dangling. Working around the problem is relatiely simple. Whether ->sibling.next can be trusted can be trusted can be decided by looking at CGRP_REMOVED - as cgroup removals are fully serialized, the flag is guaranteed to be visible before the next sibling finishes its grace period. For those cases, each cgroup is assigned a monotonically increasing serial number. Because new cgroups are always appeneded to the children list, it's guaranteed that all children list are sorted in the ascending order of the serial numbers. When the next pointer can't be trusted, the next sibling can be located by walking the parent's children list from the beginning looking for the first cgroup with higher serial number. The above is implemented in cgroup_next_sibling() and all iterators are updated to use it to find out the next sibling thus allowing droppping RCU read lock while iteration is in progress. This patchset replaces separate iteration list in device_cgroup with direct descendant walk and there will be further patches making use of this update. This patchset contains the following five patches. 0001-cgroup-fix-a-subtle-bug-in-descendant-pre-order-walk.patch 0002-cgroup-make-cgroup_is_removed-static.patch 0003-cgroup-add-cgroup-serial_nr-and-implement-cgroup_nex.patch 0004-cgroup-update-iterators-to-use-cgroup_next_sibling.patch 0005-device_cgroup-simplify-cgroup-tree-walk-in-propagate.patch 0001 fixes a subtle iteration bug. Will be applied to for-3.10-fixes. 0002 is a trivial prep patch. 0003 implements cgroup_next_sibling() which can find out the next sibling regardless of the state of the current cgroup. 0004 updates all iterators to use cgroup_next_sibling(). 0005 replaces iteration list work around in device_cgroup with direct iteration. This patchset is on top of cgroup/for-3.11 23958e729e ("cgroup.h: remove some functions that are now gone") and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-interruptible-iter diffstat follows. include/linux/cgroup.h | 31 +++++++++++--- kernel/cgroup.c | 98 ++++++++++++++++++++++++++++++++++++++++------- security/device_cgroup.c | 56 ++++++++------------------ 3 files changed, 128 insertions(+), 57 deletions(-) Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2013-05-21 1:50 ` Tejun Heo [not found] ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 1:50 ` [PATCH 2/5] cgroup: make cgroup_is_removed() static Tejun Heo ` (5 subsequent siblings) 6 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2013-05-21 1:50 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, bsingharora-Re5JQEeQqe8AvxtiuMwx3w, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo, stable-u79uwXL29TY76Z2rM5mHXA When cgroup_next_descendant_pre() initiates a walk, it checks whether the subtree root doesn't have any children and if not returns NULL. Later code assumes that the subtree isn't empty. This is broken because the subtree may become empty inbetween, which can lead to the traversal escaping the subtree by walking to the sibling of the subtree root. There's no reason to have the early exit path. Remove it along with the later assumption that the subtree isn't empty. This simplifies the code a bit and fixes the subtle bug. While at it, fix the comment of cgroup_for_each_descendant_pre() which was incorrectly referring to ->css_offline() instead of ->css_online(). Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- include/linux/cgroup.h | 2 +- kernel/cgroup.c | 9 +++------ 2 files changed, 4 insertions(+), 7 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 4f6f513..1df5f69 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -709,7 +709,7 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos); * * If a subsystem synchronizes against the parent in its ->css_online() and * before starting iterating, and synchronizes against @pos on each - * iteration, any descendant cgroup which finished ->css_offline() is + * iteration, any descendant cgroup which finished ->css_online() is * guaranteed to be visible in the future iterations. * * In other words, the following guarantees that a descendant can't escape diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 6b2b1d9..f20f80c 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2990,11 +2990,8 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, WARN_ON_ONCE(!rcu_read_lock_held()); /* if first iteration, pretend we just visited @cgroup */ - if (!pos) { - if (list_empty(&cgroup->children)) - return NULL; + if (!pos) pos = cgroup; - } /* visit the first child if exists */ next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling); @@ -3002,14 +2999,14 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, return next; /* no child, visit my or the closest ancestor's next sibling */ - do { + while (pos != cgroup) { next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling); if (&next->sibling != &pos->parent->children) return next; pos = pos->parent; - } while (pos != cgroup); + } return NULL; } -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 21+ messages in thread
[parent not found: <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk [not found] ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2013-05-22 18:22 ` Michal Hocko 2013-05-24 1:51 ` Tejun Heo 1 sibling, 0 replies; 21+ messages in thread From: Michal Hocko @ 2013-05-22 18:22 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, bsingharora-Re5JQEeQqe8AvxtiuMwx3w, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, stable-u79uwXL29TY76Z2rM5mHXA On Tue 21-05-13 10:50:21, Tejun Heo wrote: > When cgroup_next_descendant_pre() initiates a walk, it checks whether > the subtree root doesn't have any children and if not returns NULL. > Later code assumes that the subtree isn't empty. This is broken > because the subtree may become empty inbetween, which can lead to the > traversal escaping the subtree by walking to the sibling of the > subtree root. > > There's no reason to have the early exit path. Remove it along with > the later assumption that the subtree isn't empty. This simplifies > the code a bit and fixes the subtle bug. > > While at it, fix the comment of cgroup_for_each_descendant_pre() which > was incorrectly referring to ->css_offline() instead of > ->css_online(). > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Well spotted and looks good to me Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > --- > include/linux/cgroup.h | 2 +- > kernel/cgroup.c | 9 +++------ > 2 files changed, 4 insertions(+), 7 deletions(-) > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index 4f6f513..1df5f69 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -709,7 +709,7 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos); > * > * If a subsystem synchronizes against the parent in its ->css_online() and > * before starting iterating, and synchronizes against @pos on each > - * iteration, any descendant cgroup which finished ->css_offline() is > + * iteration, any descendant cgroup which finished ->css_online() is > * guaranteed to be visible in the future iterations. > * > * In other words, the following guarantees that a descendant can't escape > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 6b2b1d9..f20f80c 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -2990,11 +2990,8 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, > WARN_ON_ONCE(!rcu_read_lock_held()); > > /* if first iteration, pretend we just visited @cgroup */ > - if (!pos) { > - if (list_empty(&cgroup->children)) > - return NULL; > + if (!pos) > pos = cgroup; > - } > > /* visit the first child if exists */ > next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling); > @@ -3002,14 +2999,14 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, > return next; > > /* no child, visit my or the closest ancestor's next sibling */ > - do { > + while (pos != cgroup) { > next = list_entry_rcu(pos->sibling.next, struct cgroup, > sibling); > if (&next->sibling != &pos->parent->children) > return next; > > pos = pos->parent; > - } while (pos != cgroup); > + } > > return NULL; > } > -- > 1.8.1.4 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk [not found] ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-22 18:22 ` Michal Hocko @ 2013-05-24 1:51 ` Tejun Heo 1 sibling, 0 replies; 21+ messages in thread From: Tejun Heo @ 2013-05-24 1:51 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, stable-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA On Tue, May 21, 2013 at 10:50:21AM +0900, Tejun Heo wrote: > When cgroup_next_descendant_pre() initiates a walk, it checks whether > the subtree root doesn't have any children and if not returns NULL. > Later code assumes that the subtree isn't empty. This is broken > because the subtree may become empty inbetween, which can lead to the > traversal escaping the subtree by walking to the sibling of the > subtree root. > > There's no reason to have the early exit path. Remove it along with > the later assumption that the subtree isn't empty. This simplifies > the code a bit and fixes the subtle bug. > > While at it, fix the comment of cgroup_for_each_descendant_pre() which > was incorrectly referring to ->css_offline() instead of > ->css_online(). > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Applied to cgroup/for-3.10-fixes. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/5] cgroup: make cgroup_is_removed() static [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 1:50 ` [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk Tejun Heo @ 2013-05-21 1:50 ` Tejun Heo [not found] ` <1369101025-28335-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 1:50 ` [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() Tejun Heo ` (4 subsequent siblings) 6 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2013-05-21 1:50 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, bsingharora-Re5JQEeQqe8AvxtiuMwx3w, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo cgroup_is_removed() no longer has external users and it shouldn't grow any - controllers should deal with cgroup_subsys_state on/offline state instead of cgroup removal state. Make it static. While at it, make it return bool. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- include/linux/cgroup.h | 1 - kernel/cgroup.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 1df5f69..8d9f3c9 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -538,7 +538,6 @@ static inline const char *cgroup_name(const struct cgroup *cgrp) int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts); int cgroup_rm_cftypes(struct cgroup_subsys *ss, struct cftype *cfts); -int cgroup_is_removed(const struct cgroup *cgrp); bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor); int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index f20f80c..3222f93 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -226,7 +226,7 @@ static int css_refcnt(struct cgroup_subsys_state *css) } /* convenient tests for these bits */ -inline int cgroup_is_removed(const struct cgroup *cgrp) +static inline bool cgroup_is_removed(const struct cgroup *cgrp) { return test_bit(CGRP_REMOVED, &cgrp->flags); } -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 21+ messages in thread
[parent not found: <1369101025-28335-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 2/5] cgroup: make cgroup_is_removed() static [not found] ` <1369101025-28335-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2013-05-24 1:56 ` Tejun Heo [not found] ` <20130524015613.GB19755-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2013-05-24 1:56 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA On Tue, May 21, 2013 at 10:50:22AM +0900, Tejun Heo wrote: > cgroup_is_removed() no longer has external users and it shouldn't grow > any - controllers should deal with cgroup_subsys_state on/offline > state instead of cgroup removal state. Make it static. > > While at it, make it return bool. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Applied 2-5 to cgroup/for-3.11. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20130524015613.GB19755-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: [PATCH 2/5] cgroup: make cgroup_is_removed() static [not found] ` <20130524015613.GB19755-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2013-05-24 3:32 ` Li Zefan 0 siblings, 0 replies; 21+ messages in thread From: Li Zefan @ 2013-05-24 3:32 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, bsingharora-Re5JQEeQqe8AvxtiuMwx3w, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A On 2013/5/24 9:56, Tejun Heo wrote: > On Tue, May 21, 2013 at 10:50:22AM +0900, Tejun Heo wrote: >> cgroup_is_removed() no longer has external users and it shouldn't grow >> any - controllers should deal with cgroup_subsys_state on/offline >> state instead of cgroup removal state. Make it static. >> >> While at it, make it return bool. >> >> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > Applied 2-5 to cgroup/for-3.11. > Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Has been kind of busy with internal work. I'll update the cpuset patchset to use this. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 1:50 ` [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk Tejun Heo 2013-05-21 1:50 ` [PATCH 2/5] cgroup: make cgroup_is_removed() static Tejun Heo @ 2013-05-21 1:50 ` Tejun Heo [not found] ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 1:50 ` [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() Tejun Heo ` (3 subsequent siblings) 6 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2013-05-21 1:50 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, bsingharora-Re5JQEeQqe8AvxtiuMwx3w, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo Currently, there's no easy way to find out the next sibling cgroup unless it's known that the current cgroup is accessed from the parent's children list in a single RCU critical section. This in turn forces all iterators to require whole iteration to be enclosed in a single RCU critical section, which sometimes is too restrictive. This patch implements cgroup_next_sibling() which can reliably determine the next sibling regardless of the state of the current cgroup as long as it's accessible. It currently is impossible to determine the next sibling after dropping RCU read lock because the cgroup being iterated could be removed anytime and if RCU read lock is dropped, nothing guarantess its ->sibling.next pointer is accessible. A removed cgroup would continue to point to its next sibling for RCU accesses but stop receiving updates from the sibling. IOW, the next sibling could be removed and then complete its grace period while RCU read lock is dropped, making it unsafe to dereference ->sibling.next after dropping and re-acquiring RCU read lock. This can be solved by adding a way to traverse to the next sibling without dereferencing ->sibling.next. This patch adds a monotonically increasing cgroup serial number, cgroup->serial_nr, which guarantees that all cgroup->children lists are kept in increasing serial_nr order. A new function, cgroup_next_sibling(), is implemented, which, if CGRP_REMOVED is not set on the current cgroup, follows ->sibling.next; otherwise, traverses the parent's ->children list until it sees a sibling with higher ->serial_nr. This allows the function to always return the next sibling regardless of the state of the current cgroup without adding overhead in the fast path. Further patches will update the iterators to use cgroup_next_sibling() so that they allow dropping RCU read lock and blocking while iteration is in progress which in turn will be used to simplify controllers. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- include/linux/cgroup.h | 10 ++++++++ kernel/cgroup.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 8d9f3c9..ee041a0 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -189,6 +189,14 @@ struct cgroup { struct dentry *dentry; /* cgroup fs entry, RCU protected */ /* + * Monotonically increasing unique serial number which defines a + * uniform order among all cgroups. It's guaranteed that all + * ->children lists are in the ascending order of ->serial_nr. + * It's used to allow interrupting and resuming iterations. + */ + u64 serial_nr; + + /* * This is a copy of dentry->d_name, and it's needed because * we can't use dentry->d_name in cgroup_path(). * @@ -675,6 +683,8 @@ static inline struct cgroup* task_cgroup(struct task_struct *task, return task_subsys_state(task, subsys_id)->cgroup; } +struct cgroup *cgroup_next_sibling(struct cgroup *pos); + /** * cgroup_for_each_child - iterate through children of a cgroup * @pos: the cgroup * to use as the loop cursor diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 3222f93..bc757d7 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2975,6 +2975,55 @@ static void cgroup_enable_task_cg_lists(void) } /** + * cgroup_next_sibling - find the next sibling of a given cgroup + * @pos: the current cgroup + * + * This function returns the next sibling of @pos and should be called + * under RCU read lock. The only requirement is that @pos is accessible. + * The next sibling is guaranteed to be returned regardless of @pos's + * state. + */ +struct cgroup *cgroup_next_sibling(struct cgroup *pos) +{ + struct cgroup *next; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + /* + * @pos could already have been removed. Once a cgroup is removed, + * its ->sibling.next is no longer updated when its next sibling + * changes. As CGRP_REMOVED is set on removal which is fully + * serialized, if we see it unasserted, it's guaranteed that the + * next sibling hasn't finished its grace period even if it's + * already removed, and thus safe to dereference from this RCU + * critical section. If ->sibling.next is inaccessible, + * cgroup_is_removed() is guaranteed to be visible as %true here. + */ + if (likely(!cgroup_is_removed(pos))) { + next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling); + if (&next->sibling != &pos->parent->children) + return next; + return NULL; + } + + /* + * Can't dereference the next pointer. Each cgroup is given a + * monotonically increasing unique serial number and always + * appended to the sibling list, so the next one can be found by + * walking the parent's children until we see a cgroup with higher + * serial number than @pos's. + * + * While this path can be slow, it's taken only when either the + * current cgroup is removed or iteration and reomval race. + */ + list_for_each_entry_rcu(next, &pos->parent->children, sibling) + if (next->serial_nr > pos->serial_nr) + return next; + return NULL; +} +EXPORT_SYMBOL_GPL(cgroup_next_sibling); + +/** * cgroup_next_descendant_pre - find the next descendant for pre-order walk * @pos: the current position (%NULL to initiate traversal) * @cgroup: cgroup whose descendants to walk @@ -4136,6 +4185,7 @@ static void offline_css(struct cgroup_subsys *ss, struct cgroup *cgrp) static long cgroup_create(struct cgroup *parent, struct dentry *dentry, umode_t mode) { + static atomic64_t serial_nr_cursor = ATOMIC64_INIT(0); struct cgroup *cgrp; struct cgroup_name *name; struct cgroupfs_root *root = parent->root; @@ -4216,6 +4266,14 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry, goto err_free_all; lockdep_assert_held(&dentry->d_inode->i_mutex); + /* + * Assign a monotonically increasing serial number. With the list + * appending below, it guarantees that sibling cgroups are always + * sorted in the ascending serial number order on the parent's + * ->children. + */ + cgrp->serial_nr = atomic64_inc_return(&serial_nr_cursor); + /* allocation complete, commit to creation */ list_add_tail(&cgrp->allcg_node, &root->allcg_list); list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children); @@ -4303,6 +4361,10 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) * removed. This makes future css_tryget() and child creation * attempts fail thus maintaining the removal conditions verified * above. + * + * Note that CGRP_REMVOED clearing is depended upon by + * cgroup_next_sibling() to resume iteration after dropping RCU + * read lock. See cgroup_next_sibling() for details. */ for_each_subsys(cgrp->root, ss) { struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 21+ messages in thread
[parent not found: <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() [not found] ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2013-05-21 14:33 ` Serge Hallyn 2013-05-22 14:36 ` Aristeu Rozanski 2013-05-22 18:41 ` Michal Hocko 2 siblings, 0 replies; 21+ messages in thread From: Serge Hallyn @ 2013-05-21 14:33 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > Currently, there's no easy way to find out the next sibling cgroup > unless it's known that the current cgroup is accessed from the > parent's children list in a single RCU critical section. This in turn > forces all iterators to require whole iteration to be enclosed in a > single RCU critical section, which sometimes is too restrictive. This > patch implements cgroup_next_sibling() which can reliably determine > the next sibling regardless of the state of the current cgroup as long > as it's accessible. > > It currently is impossible to determine the next sibling after > dropping RCU read lock because the cgroup being iterated could be > removed anytime and if RCU read lock is dropped, nothing guarantess > its ->sibling.next pointer is accessible. A removed cgroup would > continue to point to its next sibling for RCU accesses but stop > receiving updates from the sibling. IOW, the next sibling could be > removed and then complete its grace period while RCU read lock is > dropped, making it unsafe to dereference ->sibling.next after dropping > and re-acquiring RCU read lock. > > This can be solved by adding a way to traverse to the next sibling > without dereferencing ->sibling.next. This patch adds a monotonically > increasing cgroup serial number, cgroup->serial_nr, which guarantees > that all cgroup->children lists are kept in increasing serial_nr > order. A new function, cgroup_next_sibling(), is implemented, which, > if CGRP_REMOVED is not set on the current cgroup, follows > ->sibling.next; otherwise, traverses the parent's ->children list > until it sees a sibling with higher ->serial_nr. > > This allows the function to always return the next sibling regardless > of the state of the current cgroup without adding overhead in the fast > path. > > Further patches will update the iterators to use cgroup_next_sibling() > so that they allow dropping RCU read lock and blocking while iteration > is in progress which in turn will be used to simplify controllers. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Looks sensible, thanks. (ultra-nit: typo @ "reomval") Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> > --- > include/linux/cgroup.h | 10 ++++++++ > kernel/cgroup.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 72 insertions(+) > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index 8d9f3c9..ee041a0 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -189,6 +189,14 @@ struct cgroup { > struct dentry *dentry; /* cgroup fs entry, RCU protected */ > > /* > + * Monotonically increasing unique serial number which defines a > + * uniform order among all cgroups. It's guaranteed that all > + * ->children lists are in the ascending order of ->serial_nr. > + * It's used to allow interrupting and resuming iterations. > + */ > + u64 serial_nr; > + > + /* > * This is a copy of dentry->d_name, and it's needed because > * we can't use dentry->d_name in cgroup_path(). > * > @@ -675,6 +683,8 @@ static inline struct cgroup* task_cgroup(struct task_struct *task, > return task_subsys_state(task, subsys_id)->cgroup; > } > > +struct cgroup *cgroup_next_sibling(struct cgroup *pos); > + > /** > * cgroup_for_each_child - iterate through children of a cgroup > * @pos: the cgroup * to use as the loop cursor > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 3222f93..bc757d7 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -2975,6 +2975,55 @@ static void cgroup_enable_task_cg_lists(void) > } > > /** > + * cgroup_next_sibling - find the next sibling of a given cgroup > + * @pos: the current cgroup > + * > + * This function returns the next sibling of @pos and should be called > + * under RCU read lock. The only requirement is that @pos is accessible. > + * The next sibling is guaranteed to be returned regardless of @pos's > + * state. > + */ > +struct cgroup *cgroup_next_sibling(struct cgroup *pos) > +{ > + struct cgroup *next; > + > + WARN_ON_ONCE(!rcu_read_lock_held()); > + > + /* > + * @pos could already have been removed. Once a cgroup is removed, > + * its ->sibling.next is no longer updated when its next sibling > + * changes. As CGRP_REMOVED is set on removal which is fully > + * serialized, if we see it unasserted, it's guaranteed that the > + * next sibling hasn't finished its grace period even if it's > + * already removed, and thus safe to dereference from this RCU > + * critical section. If ->sibling.next is inaccessible, > + * cgroup_is_removed() is guaranteed to be visible as %true here. > + */ > + if (likely(!cgroup_is_removed(pos))) { > + next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling); > + if (&next->sibling != &pos->parent->children) > + return next; > + return NULL; > + } > + > + /* > + * Can't dereference the next pointer. Each cgroup is given a > + * monotonically increasing unique serial number and always > + * appended to the sibling list, so the next one can be found by > + * walking the parent's children until we see a cgroup with higher > + * serial number than @pos's. > + * > + * While this path can be slow, it's taken only when either the > + * current cgroup is removed or iteration and reomval race. > + */ > + list_for_each_entry_rcu(next, &pos->parent->children, sibling) > + if (next->serial_nr > pos->serial_nr) > + return next; > + return NULL; > +} > +EXPORT_SYMBOL_GPL(cgroup_next_sibling); > + > +/** > * cgroup_next_descendant_pre - find the next descendant for pre-order walk > * @pos: the current position (%NULL to initiate traversal) > * @cgroup: cgroup whose descendants to walk > @@ -4136,6 +4185,7 @@ static void offline_css(struct cgroup_subsys *ss, struct cgroup *cgrp) > static long cgroup_create(struct cgroup *parent, struct dentry *dentry, > umode_t mode) > { > + static atomic64_t serial_nr_cursor = ATOMIC64_INIT(0); > struct cgroup *cgrp; > struct cgroup_name *name; > struct cgroupfs_root *root = parent->root; > @@ -4216,6 +4266,14 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry, > goto err_free_all; > lockdep_assert_held(&dentry->d_inode->i_mutex); > > + /* > + * Assign a monotonically increasing serial number. With the list > + * appending below, it guarantees that sibling cgroups are always > + * sorted in the ascending serial number order on the parent's > + * ->children. > + */ > + cgrp->serial_nr = atomic64_inc_return(&serial_nr_cursor); > + > /* allocation complete, commit to creation */ > list_add_tail(&cgrp->allcg_node, &root->allcg_list); > list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children); > @@ -4303,6 +4361,10 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) > * removed. This makes future css_tryget() and child creation > * attempts fail thus maintaining the removal conditions verified > * above. > + * > + * Note that CGRP_REMVOED clearing is depended upon by > + * cgroup_next_sibling() to resume iteration after dropping RCU > + * read lock. See cgroup_next_sibling() for details. > */ > for_each_subsys(cgrp->root, ss) { > struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; > -- > 1.8.1.4 > > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() [not found] ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 14:33 ` Serge Hallyn @ 2013-05-22 14:36 ` Aristeu Rozanski [not found] ` <20130522143636.GC16739-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-05-22 18:41 ` Michal Hocko 2 siblings, 1 reply; 21+ messages in thread From: Aristeu Rozanski @ 2013-05-22 14:36 UTC (permalink / raw) To: Tejun Heo Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA Hi Tejun, On Tue, May 21, 2013 at 10:50:23AM +0900, Tejun Heo wrote: > + * current cgroup is removed or iteration and reomval race. ^^^^^^^ small typo -- Aristeu ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20130522143636.GC16739-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() [not found] ` <20130522143636.GC16739-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-05-22 14:38 ` Tejun Heo 0 siblings, 0 replies; 21+ messages in thread From: Tejun Heo @ 2013-05-22 14:38 UTC (permalink / raw) To: Aristeu Rozanski Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA On Wed, May 22, 2013 at 10:36:36AM -0400, Aristeu Rozanski wrote: > Hi Tejun, > On Tue, May 21, 2013 at 10:50:23AM +0900, Tejun Heo wrote: > > + * current cgroup is removed or iteration and reomval race. > ^^^^^^^ > small typo Yeap, Serge spotted it too. Updated. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() [not found] ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 14:33 ` Serge Hallyn 2013-05-22 14:36 ` Aristeu Rozanski @ 2013-05-22 18:41 ` Michal Hocko 2 siblings, 0 replies; 21+ messages in thread From: Michal Hocko @ 2013-05-22 18:41 UTC (permalink / raw) To: Tejun Heo Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA On Tue 21-05-13 10:50:23, Tejun Heo wrote: > Currently, there's no easy way to find out the next sibling cgroup > unless it's known that the current cgroup is accessed from the > parent's children list in a single RCU critical section. This in turn > forces all iterators to require whole iteration to be enclosed in a > single RCU critical section, which sometimes is too restrictive. This > patch implements cgroup_next_sibling() which can reliably determine > the next sibling regardless of the state of the current cgroup as long > as it's accessible. > > It currently is impossible to determine the next sibling after > dropping RCU read lock because the cgroup being iterated could be > removed anytime and if RCU read lock is dropped, nothing guarantess > its ->sibling.next pointer is accessible. A removed cgroup would > continue to point to its next sibling for RCU accesses but stop > receiving updates from the sibling. IOW, the next sibling could be > removed and then complete its grace period while RCU read lock is > dropped, making it unsafe to dereference ->sibling.next after dropping > and re-acquiring RCU read lock. > > This can be solved by adding a way to traverse to the next sibling > without dereferencing ->sibling.next. This patch adds a monotonically > increasing cgroup serial number, cgroup->serial_nr, which guarantees > that all cgroup->children lists are kept in increasing serial_nr > order. A new function, cgroup_next_sibling(), is implemented, which, > if CGRP_REMOVED is not set on the current cgroup, follows > ->sibling.next; otherwise, traverses the parent's ->children list > until it sees a sibling with higher ->serial_nr. > > This allows the function to always return the next sibling regardless > of the state of the current cgroup without adding overhead in the fast > path. > > Further patches will update the iterators to use cgroup_next_sibling() > so that they allow dropping RCU read lock and blocking while iteration > is in progress which in turn will be used to simplify controllers. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> OK, I was about to object that the given pos could be freed already but I can see that the next patch documents that the caller has to make sure that pos will not go away (by elevating css count) before rcu is dropped. Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > --- > include/linux/cgroup.h | 10 ++++++++ > kernel/cgroup.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 72 insertions(+) > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index 8d9f3c9..ee041a0 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -189,6 +189,14 @@ struct cgroup { > struct dentry *dentry; /* cgroup fs entry, RCU protected */ > > /* > + * Monotonically increasing unique serial number which defines a > + * uniform order among all cgroups. It's guaranteed that all > + * ->children lists are in the ascending order of ->serial_nr. > + * It's used to allow interrupting and resuming iterations. > + */ > + u64 serial_nr; > + > + /* > * This is a copy of dentry->d_name, and it's needed because > * we can't use dentry->d_name in cgroup_path(). > * > @@ -675,6 +683,8 @@ static inline struct cgroup* task_cgroup(struct task_struct *task, > return task_subsys_state(task, subsys_id)->cgroup; > } > > +struct cgroup *cgroup_next_sibling(struct cgroup *pos); > + > /** > * cgroup_for_each_child - iterate through children of a cgroup > * @pos: the cgroup * to use as the loop cursor > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 3222f93..bc757d7 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -2975,6 +2975,55 @@ static void cgroup_enable_task_cg_lists(void) > } > > /** > + * cgroup_next_sibling - find the next sibling of a given cgroup > + * @pos: the current cgroup > + * > + * This function returns the next sibling of @pos and should be called > + * under RCU read lock. The only requirement is that @pos is accessible. > + * The next sibling is guaranteed to be returned regardless of @pos's > + * state. > + */ > +struct cgroup *cgroup_next_sibling(struct cgroup *pos) > +{ > + struct cgroup *next; > + > + WARN_ON_ONCE(!rcu_read_lock_held()); > + > + /* > + * @pos could already have been removed. Once a cgroup is removed, > + * its ->sibling.next is no longer updated when its next sibling > + * changes. As CGRP_REMOVED is set on removal which is fully > + * serialized, if we see it unasserted, it's guaranteed that the > + * next sibling hasn't finished its grace period even if it's > + * already removed, and thus safe to dereference from this RCU > + * critical section. If ->sibling.next is inaccessible, > + * cgroup_is_removed() is guaranteed to be visible as %true here. > + */ > + if (likely(!cgroup_is_removed(pos))) { > + next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling); > + if (&next->sibling != &pos->parent->children) > + return next; > + return NULL; > + } > + > + /* > + * Can't dereference the next pointer. Each cgroup is given a > + * monotonically increasing unique serial number and always > + * appended to the sibling list, so the next one can be found by > + * walking the parent's children until we see a cgroup with higher > + * serial number than @pos's. > + * > + * While this path can be slow, it's taken only when either the > + * current cgroup is removed or iteration and reomval race. > + */ > + list_for_each_entry_rcu(next, &pos->parent->children, sibling) > + if (next->serial_nr > pos->serial_nr) > + return next; > + return NULL; > +} > +EXPORT_SYMBOL_GPL(cgroup_next_sibling); > + > +/** > * cgroup_next_descendant_pre - find the next descendant for pre-order walk > * @pos: the current position (%NULL to initiate traversal) > * @cgroup: cgroup whose descendants to walk > @@ -4136,6 +4185,7 @@ static void offline_css(struct cgroup_subsys *ss, struct cgroup *cgrp) > static long cgroup_create(struct cgroup *parent, struct dentry *dentry, > umode_t mode) > { > + static atomic64_t serial_nr_cursor = ATOMIC64_INIT(0); > struct cgroup *cgrp; > struct cgroup_name *name; > struct cgroupfs_root *root = parent->root; > @@ -4216,6 +4266,14 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry, > goto err_free_all; > lockdep_assert_held(&dentry->d_inode->i_mutex); > > + /* > + * Assign a monotonically increasing serial number. With the list > + * appending below, it guarantees that sibling cgroups are always > + * sorted in the ascending serial number order on the parent's > + * ->children. > + */ > + cgrp->serial_nr = atomic64_inc_return(&serial_nr_cursor); > + > /* allocation complete, commit to creation */ > list_add_tail(&cgrp->allcg_node, &root->allcg_list); > list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children); > @@ -4303,6 +4361,10 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) > * removed. This makes future css_tryget() and child creation > * attempts fail thus maintaining the removal conditions verified > * above. > + * > + * Note that CGRP_REMVOED clearing is depended upon by > + * cgroup_next_sibling() to resume iteration after dropping RCU > + * read lock. See cgroup_next_sibling() for details. > */ > for_each_subsys(cgrp->root, ss) { > struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; > -- > 1.8.1.4 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (2 preceding siblings ...) 2013-05-21 1:50 ` [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() Tejun Heo @ 2013-05-21 1:50 ` Tejun Heo [not found] ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 1:50 ` [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception() Tejun Heo ` (2 subsequent siblings) 6 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2013-05-21 1:50 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, bsingharora-Re5JQEeQqe8AvxtiuMwx3w, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo This patch converts cgroup_for_each_child(), cgroup_next_descendant_pre/post() and thus cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling() instead of manually dereferencing ->sibling.next. The only reason the iterators couldn't allow dropping RCU read lock while iteration is in progress was because they couldn't determine the next sibling safely once RCU read lock is dropped. Using cgroup_next_sibling() removes that problem and enables all iterators to allow dropping RCU read lock in the middle. Comments are updated accordingly. This makes the iterators easier to use and will simplify controllers. Note that @cgroup argument is renamed to @cgrp in cgroup_for_each_child() because it conflicts with "struct cgroup" used in the new macro body. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- include/linux/cgroup.h | 18 ++++++++++++++---- kernel/cgroup.c | 25 +++++++++++++++++++------ 2 files changed, 33 insertions(+), 10 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index ee041a0..d0ad379 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -688,9 +688,9 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos); /** * cgroup_for_each_child - iterate through children of a cgroup * @pos: the cgroup * to use as the loop cursor - * @cgroup: cgroup whose children to walk + * @cgrp: cgroup whose children to walk * - * Walk @cgroup's children. Must be called under rcu_read_lock(). A child + * Walk @cgrp's children. Must be called under rcu_read_lock(). A child * cgroup which hasn't finished ->css_online() or already has finished * ->css_offline() may show up during traversal and it's each subsystem's * responsibility to verify that each @pos is alive. @@ -698,9 +698,15 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos); * If a subsystem synchronizes against the parent in its ->css_online() and * before starting iterating, a cgroup which finished ->css_online() is * guaranteed to be visible in the future iterations. + * + * It is allowed to temporarily drop RCU read lock during iteration. The + * caller is responsible for ensuring that @pos remains accessible until + * the start of the next iteration by, for example, bumping the css refcnt. */ -#define cgroup_for_each_child(pos, cgroup) \ - list_for_each_entry_rcu(pos, &(cgroup)->children, sibling) +#define cgroup_for_each_child(pos, cgrp) \ + for ((pos) = list_first_or_null_rcu(&(cgrp)->children, \ + struct cgroup, sibling); \ + (pos); (pos) = cgroup_next_sibling((pos))) struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, struct cgroup *cgroup); @@ -759,6 +765,10 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos); * Alternatively, a subsystem may choose to use a single global lock to * synchronize ->css_online() and ->css_offline() against tree-walking * operations. + * + * It is allowed to temporarily drop RCU read lock during iteration. The + * caller is responsible for ensuring that @pos remains accessible until + * the start of the next iteration by, for example, bumping the css refcnt. */ #define cgroup_for_each_descendant_pre(pos, cgroup) \ for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos); \ diff --git a/kernel/cgroup.c b/kernel/cgroup.c index bc757d7..21b1ee4 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -3030,6 +3030,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_sibling); * * To be used by cgroup_for_each_descendant_pre(). Find the next * descendant to visit for pre-order traversal of @cgroup's descendants. + * + * While this function requires RCU read locking, it doesn't require the + * whole traversal to be contained in a single RCU critical section. This + * function will return the correct next descendant as long as both @pos + * and @cgroup are accessible and @pos is a descendant of @cgroup. */ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, struct cgroup *cgroup) @@ -3049,11 +3054,9 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, /* no child, visit my or the closest ancestor's next sibling */ while (pos != cgroup) { - next = list_entry_rcu(pos->sibling.next, struct cgroup, - sibling); - if (&next->sibling != &pos->parent->children) + next = cgroup_next_sibling(pos); + if (next) return next; - pos = pos->parent; } @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre); * Return the rightmost descendant of @pos. If there's no descendant, * @pos is returned. This can be used during pre-order traversal to skip * subtree of @pos. + * + * While this function requires RCU read locking, it doesn't require the + * whole traversal to be contained in a single RCU critical section. This + * function will return the correct rightmost descendant as long as @pos is + * accessible. */ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos) { @@ -3107,6 +3115,11 @@ static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos) * * To be used by cgroup_for_each_descendant_post(). Find the next * descendant to visit for post-order traversal of @cgroup's descendants. + * + * While this function requires RCU read locking, it doesn't require the + * whole traversal to be contained in a single RCU critical section. This + * function will return the correct next descendant as long as both @pos + * and @cgroup are accessible and @pos is a descendant of @cgroup. */ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos, struct cgroup *cgroup) @@ -3122,8 +3135,8 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos, } /* if there's an unvisited sibling, visit its leftmost descendant */ - next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling); - if (&next->sibling != &pos->parent->children) + next = cgroup_next_sibling(pos); + if (next) return cgroup_leftmost_descendant(next); /* no sibling left, visit parent */ -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 21+ messages in thread
[parent not found: <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() [not found] ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2013-05-21 22:31 ` Serge Hallyn 2013-05-22 9:09 ` Li Zefan 2013-05-22 18:46 ` Michal Hocko 2 siblings, 0 replies; 21+ messages in thread From: Serge Hallyn @ 2013-05-21 22:31 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > This patch converts cgroup_for_each_child(), > cgroup_next_descendant_pre/post() and thus > cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling() > instead of manually dereferencing ->sibling.next. > > The only reason the iterators couldn't allow dropping RCU read lock > while iteration is in progress was because they couldn't determine the > next sibling safely once RCU read lock is dropped. Using > cgroup_next_sibling() removes that problem and enables all iterators > to allow dropping RCU read lock in the middle. Comments are updated > accordingly. > > This makes the iterators easier to use and will simplify controllers. > > Note that @cgroup argument is renamed to @cgrp in > cgroup_for_each_child() because it conflicts with "struct cgroup" used > in the new macro body. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> I didn't test, but it looks correct, thanks. Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> > --- > include/linux/cgroup.h | 18 ++++++++++++++---- > kernel/cgroup.c | 25 +++++++++++++++++++------ > 2 files changed, 33 insertions(+), 10 deletions(-) > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index ee041a0..d0ad379 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -688,9 +688,9 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos); > /** > * cgroup_for_each_child - iterate through children of a cgroup > * @pos: the cgroup * to use as the loop cursor > - * @cgroup: cgroup whose children to walk > + * @cgrp: cgroup whose children to walk > * > - * Walk @cgroup's children. Must be called under rcu_read_lock(). A child > + * Walk @cgrp's children. Must be called under rcu_read_lock(). A child > * cgroup which hasn't finished ->css_online() or already has finished > * ->css_offline() may show up during traversal and it's each subsystem's > * responsibility to verify that each @pos is alive. > @@ -698,9 +698,15 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos); > * If a subsystem synchronizes against the parent in its ->css_online() and > * before starting iterating, a cgroup which finished ->css_online() is > * guaranteed to be visible in the future iterations. > + * > + * It is allowed to temporarily drop RCU read lock during iteration. The > + * caller is responsible for ensuring that @pos remains accessible until > + * the start of the next iteration by, for example, bumping the css refcnt. > */ > -#define cgroup_for_each_child(pos, cgroup) \ > - list_for_each_entry_rcu(pos, &(cgroup)->children, sibling) > +#define cgroup_for_each_child(pos, cgrp) \ > + for ((pos) = list_first_or_null_rcu(&(cgrp)->children, \ > + struct cgroup, sibling); \ > + (pos); (pos) = cgroup_next_sibling((pos))) > > struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, > struct cgroup *cgroup); > @@ -759,6 +765,10 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos); > * Alternatively, a subsystem may choose to use a single global lock to > * synchronize ->css_online() and ->css_offline() against tree-walking > * operations. > + * > + * It is allowed to temporarily drop RCU read lock during iteration. The > + * caller is responsible for ensuring that @pos remains accessible until > + * the start of the next iteration by, for example, bumping the css refcnt. > */ > #define cgroup_for_each_descendant_pre(pos, cgroup) \ > for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos); \ > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index bc757d7..21b1ee4 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -3030,6 +3030,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_sibling); > * > * To be used by cgroup_for_each_descendant_pre(). Find the next > * descendant to visit for pre-order traversal of @cgroup's descendants. > + * > + * While this function requires RCU read locking, it doesn't require the > + * whole traversal to be contained in a single RCU critical section. This > + * function will return the correct next descendant as long as both @pos > + * and @cgroup are accessible and @pos is a descendant of @cgroup. > */ > struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, > struct cgroup *cgroup) > @@ -3049,11 +3054,9 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, > > /* no child, visit my or the closest ancestor's next sibling */ > while (pos != cgroup) { > - next = list_entry_rcu(pos->sibling.next, struct cgroup, > - sibling); > - if (&next->sibling != &pos->parent->children) > + next = cgroup_next_sibling(pos); > + if (next) > return next; > - > pos = pos->parent; > } > > @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre); > * Return the rightmost descendant of @pos. If there's no descendant, > * @pos is returned. This can be used during pre-order traversal to skip > * subtree of @pos. > + * > + * While this function requires RCU read locking, it doesn't require the > + * whole traversal to be contained in a single RCU critical section. This > + * function will return the correct rightmost descendant as long as @pos is > + * accessible. > */ > struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos) > { > @@ -3107,6 +3115,11 @@ static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos) > * > * To be used by cgroup_for_each_descendant_post(). Find the next > * descendant to visit for post-order traversal of @cgroup's descendants. > + * > + * While this function requires RCU read locking, it doesn't require the > + * whole traversal to be contained in a single RCU critical section. This > + * function will return the correct next descendant as long as both @pos > + * and @cgroup are accessible and @pos is a descendant of @cgroup. > */ > struct cgroup *cgroup_next_descendant_post(struct cgroup *pos, > struct cgroup *cgroup) > @@ -3122,8 +3135,8 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos, > } > > /* if there's an unvisited sibling, visit its leftmost descendant */ > - next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling); > - if (&next->sibling != &pos->parent->children) > + next = cgroup_next_sibling(pos); > + if (next) > return cgroup_leftmost_descendant(next); > > /* no sibling left, visit parent */ > -- > 1.8.1.4 > > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() [not found] ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 22:31 ` Serge Hallyn @ 2013-05-22 9:09 ` Li Zefan [not found] ` <519C8B2E.5040606-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> 2013-05-22 18:46 ` Michal Hocko 2 siblings, 1 reply; 21+ messages in thread From: Li Zefan @ 2013-05-22 9:09 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, bsingharora-Re5JQEeQqe8AvxtiuMwx3w, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A > @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre); > * Return the rightmost descendant of @pos. If there's no descendant, > * @pos is returned. This can be used during pre-order traversal to skip > * subtree of @pos. > + * > + * While this function requires RCU read locking, it doesn't require the > + * whole traversal to be contained in a single RCU critical section. This > + * function will return the correct rightmost descendant as long as @pos is > + * accessible. > */ > struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos) > { Forgot to convert cgroup_rightmost_descendat() to use cgroup_next_sibling()? do { last = pos; /* ->prev isn't RCU safe, walk ->next till the end */ pos = NULL; list_for_each_entry_rcu(tmp, &last->children, sibling) pos = tmp; } while (pos); ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <519C8B2E.5040606-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() [not found] ` <519C8B2E.5040606-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> @ 2013-05-22 9:17 ` Tejun Heo 0 siblings, 0 replies; 21+ messages in thread From: Tejun Heo @ 2013-05-22 9:17 UTC (permalink / raw) To: Li Zefan Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA On Wed, May 22, 2013 at 05:09:02PM +0800, Li Zefan wrote: > > @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre); > > * Return the rightmost descendant of @pos. If there's no descendant, > > * @pos is returned. This can be used during pre-order traversal to skip > > * subtree of @pos. > > + * > > + * While this function requires RCU read locking, it doesn't require the > > + * whole traversal to be contained in a single RCU critical section. This > > + * function will return the correct rightmost descendant as long as @pos is > > + * accessible. > > */ > > struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos) > > { > > Forgot to convert cgroup_rightmost_descendat() to use cgroup_next_sibling()? > > do { > last = pos; > /* ->prev isn't RCU safe, walk ->next till the end */ > pos = NULL; > list_for_each_entry_rcu(tmp, &last->children, sibling) > pos = tmp; > } while (pos); It's always walking from the parent->children with RCU read locked, so it doesn't need to be converted. We can still convert it for consistency and I had that in the patch originally but dropped it as the conversion seemed a bit misleading as it covers a case which can never happen. Hmmm.... Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() [not found] ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 22:31 ` Serge Hallyn 2013-05-22 9:09 ` Li Zefan @ 2013-05-22 18:46 ` Michal Hocko 2 siblings, 0 replies; 21+ messages in thread From: Michal Hocko @ 2013-05-22 18:46 UTC (permalink / raw) To: Tejun Heo Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA On Tue 21-05-13 10:50:24, Tejun Heo wrote: > This patch converts cgroup_for_each_child(), > cgroup_next_descendant_pre/post() and thus > cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling() > instead of manually dereferencing ->sibling.next. > > The only reason the iterators couldn't allow dropping RCU read lock > while iteration is in progress was because they couldn't determine the > next sibling safely once RCU read lock is dropped. Using > cgroup_next_sibling() removes that problem and enables all iterators > to allow dropping RCU read lock in the middle. Comments are updated > accordingly. > > This makes the iterators easier to use and will simplify controllers. > > Note that @cgroup argument is renamed to @cgrp in > cgroup_for_each_child() because it conflicts with "struct cgroup" used > in the new macro body. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Looks good to me Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > --- > include/linux/cgroup.h | 18 ++++++++++++++---- > kernel/cgroup.c | 25 +++++++++++++++++++------ > 2 files changed, 33 insertions(+), 10 deletions(-) > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index ee041a0..d0ad379 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -688,9 +688,9 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos); > /** > * cgroup_for_each_child - iterate through children of a cgroup > * @pos: the cgroup * to use as the loop cursor > - * @cgroup: cgroup whose children to walk > + * @cgrp: cgroup whose children to walk > * > - * Walk @cgroup's children. Must be called under rcu_read_lock(). A child > + * Walk @cgrp's children. Must be called under rcu_read_lock(). A child > * cgroup which hasn't finished ->css_online() or already has finished > * ->css_offline() may show up during traversal and it's each subsystem's > * responsibility to verify that each @pos is alive. > @@ -698,9 +698,15 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos); > * If a subsystem synchronizes against the parent in its ->css_online() and > * before starting iterating, a cgroup which finished ->css_online() is > * guaranteed to be visible in the future iterations. > + * > + * It is allowed to temporarily drop RCU read lock during iteration. The > + * caller is responsible for ensuring that @pos remains accessible until > + * the start of the next iteration by, for example, bumping the css refcnt. > */ > -#define cgroup_for_each_child(pos, cgroup) \ > - list_for_each_entry_rcu(pos, &(cgroup)->children, sibling) > +#define cgroup_for_each_child(pos, cgrp) \ > + for ((pos) = list_first_or_null_rcu(&(cgrp)->children, \ > + struct cgroup, sibling); \ > + (pos); (pos) = cgroup_next_sibling((pos))) > > struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, > struct cgroup *cgroup); > @@ -759,6 +765,10 @@ struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos); > * Alternatively, a subsystem may choose to use a single global lock to > * synchronize ->css_online() and ->css_offline() against tree-walking > * operations. > + * > + * It is allowed to temporarily drop RCU read lock during iteration. The > + * caller is responsible for ensuring that @pos remains accessible until > + * the start of the next iteration by, for example, bumping the css refcnt. > */ > #define cgroup_for_each_descendant_pre(pos, cgroup) \ > for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos); \ > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index bc757d7..21b1ee4 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -3030,6 +3030,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_sibling); > * > * To be used by cgroup_for_each_descendant_pre(). Find the next > * descendant to visit for pre-order traversal of @cgroup's descendants. > + * > + * While this function requires RCU read locking, it doesn't require the > + * whole traversal to be contained in a single RCU critical section. This > + * function will return the correct next descendant as long as both @pos > + * and @cgroup are accessible and @pos is a descendant of @cgroup. > */ > struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, > struct cgroup *cgroup) > @@ -3049,11 +3054,9 @@ struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos, > > /* no child, visit my or the closest ancestor's next sibling */ > while (pos != cgroup) { > - next = list_entry_rcu(pos->sibling.next, struct cgroup, > - sibling); > - if (&next->sibling != &pos->parent->children) > + next = cgroup_next_sibling(pos); > + if (next) > return next; > - > pos = pos->parent; > } > > @@ -3068,6 +3071,11 @@ EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre); > * Return the rightmost descendant of @pos. If there's no descendant, > * @pos is returned. This can be used during pre-order traversal to skip > * subtree of @pos. > + * > + * While this function requires RCU read locking, it doesn't require the > + * whole traversal to be contained in a single RCU critical section. This > + * function will return the correct rightmost descendant as long as @pos is > + * accessible. > */ > struct cgroup *cgroup_rightmost_descendant(struct cgroup *pos) > { > @@ -3107,6 +3115,11 @@ static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos) > * > * To be used by cgroup_for_each_descendant_post(). Find the next > * descendant to visit for post-order traversal of @cgroup's descendants. > + * > + * While this function requires RCU read locking, it doesn't require the > + * whole traversal to be contained in a single RCU critical section. This > + * function will return the correct next descendant as long as both @pos > + * and @cgroup are accessible and @pos is a descendant of @cgroup. > */ > struct cgroup *cgroup_next_descendant_post(struct cgroup *pos, > struct cgroup *cgroup) > @@ -3122,8 +3135,8 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos, > } > > /* if there's an unvisited sibling, visit its leftmost descendant */ > - next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling); > - if (&next->sibling != &pos->parent->children) > + next = cgroup_next_sibling(pos); > + if (next) > return cgroup_leftmost_descendant(next); > > /* no sibling left, visit parent */ > -- > 1.8.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception() [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (3 preceding siblings ...) 2013-05-21 1:50 ` [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() Tejun Heo @ 2013-05-21 1:50 ` Tejun Heo [not found] ` <1369101025-28335-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 3:20 ` [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo 2013-05-22 14:53 ` Aristeu Rozanski 6 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2013-05-21 1:50 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA, serue-r/Jw6+rmf7HQT0dZR+AlfA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, bsingharora-Re5JQEeQqe8AvxtiuMwx3w, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Tejun Heo During a config change, propagate_exception() needs to traverse the subtree to update config on the subtree. Because such config updates need to allocate memory, it couldn't directly use cgroup_for_each_descendant_pre() which required the whole iteration to be contained in a single RCU read critical section. To work around the limitation, propagate_exception() built a linked list of descendant cgroups while read-locking RCU and then walked the list afterwards, which is safe as the whole iteration is protected by devcgroup_mutex. This works but is cumbersome. With the recent updates, cgroup iterators now allow dropping RCU read lock while iteration is in progress making this workaround no longer necessary. This patch replaces dev_cgroup->propagate_pending list and get_online_devcg() with direct cgroup_for_each_descendant_pre() walk. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- security/device_cgroup.c | 56 ++++++++++++++++-------------------------------- 1 file changed, 18 insertions(+), 38 deletions(-) diff --git a/security/device_cgroup.c b/security/device_cgroup.c index dd0dc57..e8aad69 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -49,8 +49,6 @@ struct dev_cgroup { struct cgroup_subsys_state css; struct list_head exceptions; enum devcg_behavior behavior; - /* temporary list for pending propagation operations */ - struct list_head propagate_pending; }; static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s) @@ -241,7 +239,6 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup) if (!dev_cgroup) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&dev_cgroup->exceptions); - INIT_LIST_HEAD(&dev_cgroup->propagate_pending); dev_cgroup->behavior = DEVCG_DEFAULT_NONE; return &dev_cgroup->css; @@ -445,34 +442,6 @@ static void revalidate_active_exceptions(struct dev_cgroup *devcg) } /** - * get_online_devcg - walks the cgroup tree and fills a list with the online - * groups - * @root: cgroup used as starting point - * @online: list that will be filled with online groups - * - * Must be called with devcgroup_mutex held. Grabs RCU lock. - * Because devcgroup_mutex is held, no devcg will become online or offline - * during the tree walk (see devcgroup_online, devcgroup_offline) - * A separated list is needed because propagate_behavior() and - * propagate_exception() need to allocate memory and can block. - */ -static void get_online_devcg(struct cgroup *root, struct list_head *online) -{ - struct cgroup *pos; - struct dev_cgroup *devcg; - - lockdep_assert_held(&devcgroup_mutex); - - rcu_read_lock(); - cgroup_for_each_descendant_pre(pos, root) { - devcg = cgroup_to_devcgroup(pos); - if (is_devcg_online(devcg)) - list_add_tail(&devcg->propagate_pending, online); - } - rcu_read_unlock(); -} - -/** * propagate_exception - propagates a new exception to the children * @devcg_root: device cgroup that added a new exception * @ex: new exception to be propagated @@ -482,15 +451,24 @@ static void get_online_devcg(struct cgroup *root, struct list_head *online) static int propagate_exception(struct dev_cgroup *devcg_root, struct dev_exception_item *ex) { - struct cgroup *root = devcg_root->css.cgroup; - struct dev_cgroup *devcg, *parent, *tmp; + struct cgroup *root = devcg_root->css.cgroup, *pos; int rc = 0; - LIST_HEAD(pending); - get_online_devcg(root, &pending); + rcu_read_lock(); - list_for_each_entry_safe(devcg, tmp, &pending, propagate_pending) { - parent = cgroup_to_devcgroup(devcg->css.cgroup->parent); + cgroup_for_each_descendant_pre(pos, root) { + struct dev_cgroup *devcg = cgroup_to_devcgroup(pos); + + /* + * Because devcgroup_mutex is held, no devcg will become + * online or offline during the tree walk (see on/offline + * methods), and online ones are safe to access outside RCU + * read lock without bumping refcnt. + */ + if (!is_devcg_online(devcg)) + continue; + + rcu_read_unlock(); /* * in case both root's behavior and devcg is allow, a new @@ -512,8 +490,10 @@ static int propagate_exception(struct dev_cgroup *devcg_root, } revalidate_active_exceptions(devcg); - list_del_init(&devcg->propagate_pending); + rcu_read_lock(); } + + rcu_read_unlock(); return rc; } -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 21+ messages in thread
[parent not found: <1369101025-28335-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception() [not found] ` <1369101025-28335-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2013-05-21 22:35 ` Serge Hallyn 0 siblings, 0 replies; 21+ messages in thread From: Serge Hallyn @ 2013-05-21 22:35 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, hannes-druUgvl0LCNAfugRpC6u6w, mhocko-AlSwsSmVLrQ, cgroups-u79uwXL29TY76Z2rM5mHXA, serue-r/Jw6+rmf7HQT0dZR+AlfA Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > During a config change, propagate_exception() needs to traverse the > subtree to update config on the subtree. Because such config updates > need to allocate memory, it couldn't directly use > cgroup_for_each_descendant_pre() which required the whole iteration to > be contained in a single RCU read critical section. To work around > the limitation, propagate_exception() built a linked list of > descendant cgroups while read-locking RCU and then walked the list > afterwards, which is safe as the whole iteration is protected by > devcgroup_mutex. This works but is cumbersome. > > With the recent updates, cgroup iterators now allow dropping RCU read > lock while iteration is in progress making this workaround no longer > necessary. This patch replaces dev_cgroup->propagate_pending list and > get_online_devcg() with direct cgroup_for_each_descendant_pre() walk. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> > --- > security/device_cgroup.c | 56 ++++++++++++++++-------------------------------- > 1 file changed, 18 insertions(+), 38 deletions(-) > > diff --git a/security/device_cgroup.c b/security/device_cgroup.c > index dd0dc57..e8aad69 100644 > --- a/security/device_cgroup.c > +++ b/security/device_cgroup.c > @@ -49,8 +49,6 @@ struct dev_cgroup { > struct cgroup_subsys_state css; > struct list_head exceptions; > enum devcg_behavior behavior; > - /* temporary list for pending propagation operations */ > - struct list_head propagate_pending; > }; > > static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s) > @@ -241,7 +239,6 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup) > if (!dev_cgroup) > return ERR_PTR(-ENOMEM); > INIT_LIST_HEAD(&dev_cgroup->exceptions); > - INIT_LIST_HEAD(&dev_cgroup->propagate_pending); > dev_cgroup->behavior = DEVCG_DEFAULT_NONE; > > return &dev_cgroup->css; > @@ -445,34 +442,6 @@ static void revalidate_active_exceptions(struct dev_cgroup *devcg) > } > > /** > - * get_online_devcg - walks the cgroup tree and fills a list with the online > - * groups > - * @root: cgroup used as starting point > - * @online: list that will be filled with online groups > - * > - * Must be called with devcgroup_mutex held. Grabs RCU lock. > - * Because devcgroup_mutex is held, no devcg will become online or offline > - * during the tree walk (see devcgroup_online, devcgroup_offline) > - * A separated list is needed because propagate_behavior() and > - * propagate_exception() need to allocate memory and can block. > - */ > -static void get_online_devcg(struct cgroup *root, struct list_head *online) > -{ > - struct cgroup *pos; > - struct dev_cgroup *devcg; > - > - lockdep_assert_held(&devcgroup_mutex); > - > - rcu_read_lock(); > - cgroup_for_each_descendant_pre(pos, root) { > - devcg = cgroup_to_devcgroup(pos); > - if (is_devcg_online(devcg)) > - list_add_tail(&devcg->propagate_pending, online); > - } > - rcu_read_unlock(); > -} > - > -/** > * propagate_exception - propagates a new exception to the children > * @devcg_root: device cgroup that added a new exception > * @ex: new exception to be propagated > @@ -482,15 +451,24 @@ static void get_online_devcg(struct cgroup *root, struct list_head *online) > static int propagate_exception(struct dev_cgroup *devcg_root, > struct dev_exception_item *ex) > { > - struct cgroup *root = devcg_root->css.cgroup; > - struct dev_cgroup *devcg, *parent, *tmp; > + struct cgroup *root = devcg_root->css.cgroup, *pos; > int rc = 0; > - LIST_HEAD(pending); > > - get_online_devcg(root, &pending); > + rcu_read_lock(); > > - list_for_each_entry_safe(devcg, tmp, &pending, propagate_pending) { > - parent = cgroup_to_devcgroup(devcg->css.cgroup->parent); > + cgroup_for_each_descendant_pre(pos, root) { > + struct dev_cgroup *devcg = cgroup_to_devcgroup(pos); > + > + /* > + * Because devcgroup_mutex is held, no devcg will become > + * online or offline during the tree walk (see on/offline > + * methods), and online ones are safe to access outside RCU > + * read lock without bumping refcnt. > + */ > + if (!is_devcg_online(devcg)) > + continue; > + > + rcu_read_unlock(); > > /* > * in case both root's behavior and devcg is allow, a new > @@ -512,8 +490,10 @@ static int propagate_exception(struct dev_cgroup *devcg_root, > } > revalidate_active_exceptions(devcg); > > - list_del_init(&devcg->propagate_pending); > + rcu_read_lock(); > } > + > + rcu_read_unlock(); > return rc; > } > > -- > 1.8.1.4 > > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCHSET] cgroup: allow dropping RCU read lock while iterating [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (4 preceding siblings ...) 2013-05-21 1:50 ` [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception() Tejun Heo @ 2013-05-21 3:20 ` Tejun Heo 2013-05-22 14:53 ` Aristeu Rozanski 6 siblings, 0 replies; 21+ messages in thread From: Tejun Heo @ 2013-05-21 3:20 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, aris-H+wXaHxf7aLQT0dZR+AlfA, serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, bsingharora-Re5JQEeQqe8AvxtiuMwx3w, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A (oops, used Serge's old address, using the new canonical one) Serge, the original thread can be read from http://thread.gmane.org/gmane.linux.kernel.containers/26037 Li, the cpuset patchset you posted a couple weeks ago should be able to make use of this, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCHSET] cgroup: allow dropping RCU read lock while iterating [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (5 preceding siblings ...) 2013-05-21 3:20 ` [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo @ 2013-05-22 14:53 ` Aristeu Rozanski 6 siblings, 0 replies; 21+ messages in thread From: Aristeu Rozanski @ 2013-05-22 14:53 UTC (permalink / raw) To: Tejun Heo Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mhocko-AlSwsSmVLrQ, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA Hi Tejun, On Tue, May 21, 2013 at 10:50:20AM +0900, Tejun Heo wrote: > Currently all cgroup iterators require the whole traversal to be > contained in a single RCU read critical section, which can be too > restrictive as there are times when blocking operations are necessary > during traversal. This forces controllers to implement specific > workarounds in those cases - building separate iteration list, punting > actual operations to work items and so on. > > This patchset updates cgroup iterators so that they allow dropping RCU > read lock while iteration is in progress so that controllers which > require sleeping during iteration don't need to implement their own > mechanisms. > > Dropping RCU read lock during iteration is unsafe because > cgroup->sibling.next can't be trusted once RCU read lock is dropped. > The sibling list is a RCU list and when a cgroup is removed the next > pointer is retained to keep RCU traversal working. If the next > sibling is removed while RCU read lock is dropped, the removed current > cgroup's next won't be updated and the next sibling may complete its > grace period and get freed leaving the next pointer dangling. > > Working around the problem is relatiely simple. Whether > ->sibling.next can be trusted can be trusted can be decided by looking > at CGRP_REMOVED - as cgroup removals are fully serialized, the flag is > guaranteed to be visible before the next sibling finishes its grace > period. For those cases, each cgroup is assigned a monotonically > increasing serial number. Because new cgroups are always appeneded to > the children list, it's guaranteed that all children list are sorted > in the ascending order of the serial numbers. When the next pointer > can't be trusted, the next sibling can be located by walking the > parent's children list from the beginning looking for the first cgroup > with higher serial number. > > The above is implemented in cgroup_next_sibling() and all iterators > are updated to use it to find out the next sibling thus allowing > droppping RCU read lock while iteration is in progress. This patchset > replaces separate iteration list in device_cgroup with direct > descendant walk and there will be further patches making use of this > update. > > This patchset contains the following five patches. > > 0001-cgroup-fix-a-subtle-bug-in-descendant-pre-order-walk.patch > 0002-cgroup-make-cgroup_is_removed-static.patch > 0003-cgroup-add-cgroup-serial_nr-and-implement-cgroup_nex.patch > 0004-cgroup-update-iterators-to-use-cgroup_next_sibling.patch > 0005-device_cgroup-simplify-cgroup-tree-walk-in-propagate.patch > > 0001 fixes a subtle iteration bug. Will be applied to for-3.10-fixes. > > 0002 is a trivial prep patch. > > 0003 implements cgroup_next_sibling() which can find out the next > sibling regardless of the state of the current cgroup. > > 0004 updates all iterators to use cgroup_next_sibling(). > > 0005 replaces iteration list work around in device_cgroup with direct > iteration. > > This patchset is on top of cgroup/for-3.11 23958e729e ("cgroup.h: > remove some functions that are now gone") and available in the > following git branch. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-interruptible-iter patchset looks good to me. ran some tests in a kernel with it without problems. Acked-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> -- Aristeu ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2013-05-24 3:32 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-05-21 1:50 [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo [not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 1:50 ` [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk Tejun Heo [not found] ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-22 18:22 ` Michal Hocko 2013-05-24 1:51 ` Tejun Heo 2013-05-21 1:50 ` [PATCH 2/5] cgroup: make cgroup_is_removed() static Tejun Heo [not found] ` <1369101025-28335-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-24 1:56 ` Tejun Heo [not found] ` <20130524015613.GB19755-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2013-05-24 3:32 ` Li Zefan 2013-05-21 1:50 ` [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() Tejun Heo [not found] ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 14:33 ` Serge Hallyn 2013-05-22 14:36 ` Aristeu Rozanski [not found] ` <20130522143636.GC16739-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-05-22 14:38 ` Tejun Heo 2013-05-22 18:41 ` Michal Hocko 2013-05-21 1:50 ` [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() Tejun Heo [not found] ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 22:31 ` Serge Hallyn 2013-05-22 9:09 ` Li Zefan [not found] ` <519C8B2E.5040606-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> 2013-05-22 9:17 ` Tejun Heo 2013-05-22 18:46 ` Michal Hocko 2013-05-21 1:50 ` [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception() Tejun Heo [not found] ` <1369101025-28335-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-05-21 22:35 ` Serge Hallyn 2013-05-21 3:20 ` [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo 2013-05-22 14:53 ` Aristeu Rozanski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).