From: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
To: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
mhocko-AlSwsSmVLrQ@public.gmane.org,
hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Date: Tue, 21 May 2013 09:33:14 -0500 [thread overview]
Message-ID: <20130521143314.GC20263@sergelap> (raw)
In-Reply-To: <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Currently, there's no easy way to find out the next sibling cgroup
> unless it's known that the current cgroup is accessed from the
> parent's children list in a single RCU critical section. This in turn
> forces all iterators to require whole iteration to be enclosed in a
> single RCU critical section, which sometimes is too restrictive. This
> patch implements cgroup_next_sibling() which can reliably determine
> the next sibling regardless of the state of the current cgroup as long
> as it's accessible.
>
> It currently is impossible to determine the next sibling after
> dropping RCU read lock because the cgroup being iterated could be
> removed anytime and if RCU read lock is dropped, nothing guarantess
> its ->sibling.next pointer is accessible. A removed cgroup would
> continue to point to its next sibling for RCU accesses but stop
> receiving updates from the sibling. IOW, the next sibling could be
> removed and then complete its grace period while RCU read lock is
> dropped, making it unsafe to dereference ->sibling.next after dropping
> and re-acquiring RCU read lock.
>
> This can be solved by adding a way to traverse to the next sibling
> without dereferencing ->sibling.next. This patch adds a monotonically
> increasing cgroup serial number, cgroup->serial_nr, which guarantees
> that all cgroup->children lists are kept in increasing serial_nr
> order. A new function, cgroup_next_sibling(), is implemented, which,
> if CGRP_REMOVED is not set on the current cgroup, follows
> ->sibling.next; otherwise, traverses the parent's ->children list
> until it sees a sibling with higher ->serial_nr.
>
> This allows the function to always return the next sibling regardless
> of the state of the current cgroup without adding overhead in the fast
> path.
>
> Further patches will update the iterators to use cgroup_next_sibling()
> so that they allow dropping RCU read lock and blocking while iteration
> is in progress which in turn will be used to simplify controllers.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Looks sensible, thanks.
(ultra-nit: typo @ "reomval")
Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
> include/linux/cgroup.h | 10 ++++++++
> kernel/cgroup.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 72 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8d9f3c9..ee041a0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -189,6 +189,14 @@ struct cgroup {
> struct dentry *dentry; /* cgroup fs entry, RCU protected */
>
> /*
> + * Monotonically increasing unique serial number which defines a
> + * uniform order among all cgroups. It's guaranteed that all
> + * ->children lists are in the ascending order of ->serial_nr.
> + * It's used to allow interrupting and resuming iterations.
> + */
> + u64 serial_nr;
> +
> + /*
> * This is a copy of dentry->d_name, and it's needed because
> * we can't use dentry->d_name in cgroup_path().
> *
> @@ -675,6 +683,8 @@ static inline struct cgroup* task_cgroup(struct task_struct *task,
> return task_subsys_state(task, subsys_id)->cgroup;
> }
>
> +struct cgroup *cgroup_next_sibling(struct cgroup *pos);
> +
> /**
> * cgroup_for_each_child - iterate through children of a cgroup
> * @pos: the cgroup * to use as the loop cursor
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 3222f93..bc757d7 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2975,6 +2975,55 @@ static void cgroup_enable_task_cg_lists(void)
> }
>
> /**
> + * cgroup_next_sibling - find the next sibling of a given cgroup
> + * @pos: the current cgroup
> + *
> + * This function returns the next sibling of @pos and should be called
> + * under RCU read lock. The only requirement is that @pos is accessible.
> + * The next sibling is guaranteed to be returned regardless of @pos's
> + * state.
> + */
> +struct cgroup *cgroup_next_sibling(struct cgroup *pos)
> +{
> + struct cgroup *next;
> +
> + WARN_ON_ONCE(!rcu_read_lock_held());
> +
> + /*
> + * @pos could already have been removed. Once a cgroup is removed,
> + * its ->sibling.next is no longer updated when its next sibling
> + * changes. As CGRP_REMOVED is set on removal which is fully
> + * serialized, if we see it unasserted, it's guaranteed that the
> + * next sibling hasn't finished its grace period even if it's
> + * already removed, and thus safe to dereference from this RCU
> + * critical section. If ->sibling.next is inaccessible,
> + * cgroup_is_removed() is guaranteed to be visible as %true here.
> + */
> + if (likely(!cgroup_is_removed(pos))) {
> + next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
> + if (&next->sibling != &pos->parent->children)
> + return next;
> + return NULL;
> + }
> +
> + /*
> + * Can't dereference the next pointer. Each cgroup is given a
> + * monotonically increasing unique serial number and always
> + * appended to the sibling list, so the next one can be found by
> + * walking the parent's children until we see a cgroup with higher
> + * serial number than @pos's.
> + *
> + * While this path can be slow, it's taken only when either the
> + * current cgroup is removed or iteration and reomval race.
> + */
> + list_for_each_entry_rcu(next, &pos->parent->children, sibling)
> + if (next->serial_nr > pos->serial_nr)
> + return next;
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_next_sibling);
> +
> +/**
> * cgroup_next_descendant_pre - find the next descendant for pre-order walk
> * @pos: the current position (%NULL to initiate traversal)
> * @cgroup: cgroup whose descendants to walk
> @@ -4136,6 +4185,7 @@ static void offline_css(struct cgroup_subsys *ss, struct cgroup *cgrp)
> static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> umode_t mode)
> {
> + static atomic64_t serial_nr_cursor = ATOMIC64_INIT(0);
> struct cgroup *cgrp;
> struct cgroup_name *name;
> struct cgroupfs_root *root = parent->root;
> @@ -4216,6 +4266,14 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> goto err_free_all;
> lockdep_assert_held(&dentry->d_inode->i_mutex);
>
> + /*
> + * Assign a monotonically increasing serial number. With the list
> + * appending below, it guarantees that sibling cgroups are always
> + * sorted in the ascending serial number order on the parent's
> + * ->children.
> + */
> + cgrp->serial_nr = atomic64_inc_return(&serial_nr_cursor);
> +
> /* allocation complete, commit to creation */
> list_add_tail(&cgrp->allcg_node, &root->allcg_list);
> list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children);
> @@ -4303,6 +4361,10 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
> * removed. This makes future css_tryget() and child creation
> * attempts fail thus maintaining the removal conditions verified
> * above.
> + *
> + * Note that CGRP_REMVOED clearing is depended upon by
> + * cgroup_next_sibling() to resume iteration after dropping RCU
> + * read lock. See cgroup_next_sibling() for details.
> */
> for_each_subsys(cgrp->root, ss) {
> struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> --
> 1.8.1.4
>
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
next prev parent reply other threads:[~2013-05-21 14:33 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-21 1:50 [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo
[not found] ` <1369101025-28335-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-21 1:50 ` [PATCH 1/5] cgroup: fix a subtle bug in descendant pre-order walk Tejun Heo
2013-05-21 1:50 ` Tejun Heo
[not found] ` <1369101025-28335-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-22 18:22 ` Michal Hocko
2013-05-22 18:22 ` Michal Hocko
2013-05-24 1:51 ` Tejun Heo
2013-05-21 1:50 ` [PATCH 2/5] cgroup: make cgroup_is_removed() static Tejun Heo
[not found] ` <1369101025-28335-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-24 1:56 ` Tejun Heo
[not found] ` <20130524015613.GB19755-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-05-24 3:32 ` Li Zefan
2013-05-24 3:32 ` Li Zefan
2013-05-21 1:50 ` Tejun Heo
2013-05-21 1:50 ` [PATCH 3/5] cgroup: add cgroup->serial_nr and implement cgroup_next_sibling() Tejun Heo
[not found] ` <1369101025-28335-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-21 14:33 ` Serge Hallyn
2013-05-21 14:33 ` Serge Hallyn [this message]
2013-05-22 14:36 ` Aristeu Rozanski
[not found] ` <20130522143636.GC16739-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-05-22 14:38 ` Tejun Heo
2013-05-22 18:41 ` Michal Hocko
2013-05-21 1:50 ` Tejun Heo
2013-05-21 1:50 ` [PATCH 4/5] cgroup: update iterators to use cgroup_next_sibling() Tejun Heo
2013-05-21 1:50 ` Tejun Heo
[not found] ` <1369101025-28335-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-21 22:31 ` Serge Hallyn
2013-05-21 22:31 ` Serge Hallyn
2013-05-22 9:09 ` Li Zefan
2013-05-22 9:09 ` Li Zefan
[not found] ` <519C8B2E.5040606-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-05-22 9:17 ` Tejun Heo
2013-05-22 18:46 ` Michal Hocko
2013-05-21 1:50 ` [PATCH 5/5] device_cgroup: simplify cgroup tree walk in propagate_exception() Tejun Heo
2013-05-21 1:50 ` Tejun Heo
[not found] ` <1369101025-28335-6-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-05-21 22:35 ` Serge Hallyn
2013-05-21 3:20 ` [PATCHSET] cgroup: allow dropping RCU read lock while iterating Tejun Heo
2013-05-21 3:20 ` Tejun Heo
2013-05-22 14:53 ` Aristeu Rozanski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130521143314.GC20263@sergelap \
--to=serge.hallyn-gewih/nmzzlqt0dzr+alfa@public.gmane.org \
--cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
--cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
--cc=lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
--cc=mhocko-AlSwsSmVLrQ@public.gmane.org \
--cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.