* [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref
@ 2013-06-13  4:04 Tejun Heo
  2013-06-13  4:04 ` [PATCH 05/11] cgroup: clean up css_[try]get() and css_put() Tejun Heo
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Tejun Heo @ 2013-06-13  4:04 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	koverstreet-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
Hello,
The changes from the last take[L] are,
* Rebased on top of further percpu-refcount updates.
* 0003: Broken patch description updated.
* 0004: Stupid list_del_init() conversions from the last decade
        dropped.
* 0005: Typo fix.
* 0011: percpu_ref_init() error handling fixed.  Premature and
        duplicate base css ref puts fixed.
This patchset does a lot of cleanup and then updates css
(cgroup_subsys_state) refcnt to use the new percpu reference counter.
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various hugely diverse paths css's end up being
used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patchset contains the following 11 patches.
 0001-cgroup-remove-now-unused-css_depth.patch
 0002-cgroup-consistently-use-cset-for-struct-css_set-vari.patch
 0003-cgroup-bring-some-sanity-to-naming-around-cg_cgroup_.patch
 0004-cgroup-use-kzalloc-instead-of-kmalloc.patch
 0005-cgroup-clean-up-css_-try-get-and-css_put.patch
 0006-cgroup-rename-CGRP_REMOVED-to-CGRP_DEAD.patch
 0007-cgroup-drop-unnecessary-RCU-dancing-from-__put_css_s.patch
 0008-cgroup-remove-cgroup-count-and-use.patch
 0009-cgroup-reorder-the-operations-in-cgroup_destroy_lock.patch
 0010-cgroup-split-cgroup-destruction-into-two-steps.patch
 0011-cgroup-use-percpu-refcnt-for-cgroup_subsys_states.patch
0001-0007 are cleanups, many of them cosmetic.  They clean up the code
paths which are related with the refcnting changes.  If you're only
interested in the precpu usage, you can probably skip these.
0008-0010 updates the cgroup destruction path so that percpu refcnting
can be used.
0011 updates css refcnting to use percpu_ref.
This patchset is on top of
  cgroup/for-3.11 d5c56ced77 ("cgroup: clean up the cftype array for the base cgroup files")
+ [1] percpu/review-percpu-ref-tryget
and available in the following git branch.
 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-css-percpu-ref
diffstat follows.  Thanks.
 include/linux/cgroup.h |   87 ++---
 kernel/cgroup.c        |  723 ++++++++++++++++++++++++++-----------------------
 2 files changed, 420 insertions(+), 390 deletions(-)
--
tejun
[L] http://thread.gmane.org/gmane.linux.kernel.containers/26167
[1] http://thread.gmane.org/gmane.linux.kernel/1507258
    http://thread.gmane.org/gmane.linux.kernel/1507437
^ permalink raw reply	[flat|nested] 27+ messages in thread* [PATCH 05/11] cgroup: clean up css_[try]get() and css_put() 2013-06-13 4:04 [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Tejun Heo @ 2013-06-13 4:04 ` Tejun Heo [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (3 subsequent siblings) 4 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan; +Cc: containers, cgroups, koverstreet, linux-kernel, cl, Tejun Heo * __css_get() isn't used by anyone. Fold it into css_get(). * Add proper function comments to all css reference functions. This patch is purely cosmetic. v2: Typo fix as per Li. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> --- include/linux/cgroup.h | 48 ++++++++++++++++++++++++------------------------ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 0e32855..a494636 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -94,33 +94,31 @@ enum { CSS_ONLINE = (1 << 1), /* between ->css_online() and ->css_offline() */ }; -/* Caller must verify that the css is not for root cgroup */ -static inline void __css_get(struct cgroup_subsys_state *css, int count) -{ - atomic_add(count, &css->refcnt); -} - -/* - * Call css_get() to hold a reference on the css; it can be used - * for a reference obtained via: - * - an existing ref-counted reference to the css - * - task->cgroups for a locked task +/** + * css_get - obtain a reference on the specified css + * @css: target css + * + * The caller must already have a reference. */ - static inline void css_get(struct cgroup_subsys_state *css) { /* We don't need to reference count the root state */ if (!(css->flags & CSS_ROOT)) - __css_get(css, 1); + atomic_inc(&css->refcnt); } -/* - * Call css_tryget() to take a reference on a css if your existing - * (known-valid) reference isn't already ref-counted. Returns false if - * the css has been destroyed. - */ - extern bool __css_tryget(struct cgroup_subsys_state *css); + +/** + * css_tryget - try to obtain a reference on the specified css + * @css: target css + * + * Obtain a reference on @css if it's alive. The caller naturally needs to + * ensure that @css is accessible but doesn't have to be holding a + * reference on it - IOW, RCU protected access is good enough for this + * function. Returns %true if a reference count was successfully obtained; + * %false otherwise. + */ static inline bool css_tryget(struct cgroup_subsys_state *css) { if (css->flags & CSS_ROOT) @@ -128,12 +126,14 @@ static inline bool css_tryget(struct cgroup_subsys_state *css) return __css_tryget(css); } -/* - * css_put() should be called to release a reference taken by - * css_get() or css_tryget() - */ - extern void __css_put(struct cgroup_subsys_state *css); + +/** + * css_put - put a css reference + * @css: target css + * + * Put a reference obtained via css_get() and css_tryget(). + */ static inline void css_put(struct cgroup_subsys_state *css) { if (!(css->flags & CSS_ROOT)) -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
[parent not found: <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* [PATCH 01/11] cgroup: remove now unused css_depth() [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2013-06-13 4:04 ` Tejun Heo 2013-06-13 4:04 ` [PATCH 02/11] cgroup: consistently use @cset for struct css_set variables Tejun Heo ` (6 subsequent siblings) 7 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- include/linux/cgroup.h | 1 - kernel/cgroup.c | 12 ------------ 2 files changed, 13 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index d0ad379..5830592 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -848,7 +848,6 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg, /* Get id and depth of css */ unsigned short css_id(struct cgroup_subsys_state *css); -unsigned short css_depth(struct cgroup_subsys_state *css); struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id); #else /* !CONFIG_CGROUPS */ diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 3ad2a9c..dc8997a 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5230,18 +5230,6 @@ unsigned short css_id(struct cgroup_subsys_state *css) } EXPORT_SYMBOL_GPL(css_id); -unsigned short css_depth(struct cgroup_subsys_state *css) -{ - struct css_id *cssid; - - cssid = rcu_dereference_check(css->id, css_refcnt(css)); - - if (cssid) - return cssid->depth; - return 0; -} -EXPORT_SYMBOL_GPL(css_depth); - /** * css_is_ancestor - test "root" css is an ancestor of "child" * @child: the css to be tested. -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 02/11] cgroup: consistently use @cset for struct css_set variables [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-06-13 4:04 ` [PATCH 01/11] cgroup: remove now unused css_depth() Tejun Heo @ 2013-06-13 4:04 ` Tejun Heo 2013-06-13 4:04 ` [PATCH 03/11] cgroup: bring some sanity to naming around cg_cgroup_link Tejun Heo ` (5 subsequent siblings) 7 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Tejun Heo cgroup.c uses @cg for most struct css_set variables, which in itself could be a bit confusing, but made much worse by the fact that there are places which use @cg for struct cgroup variables. compare_css_sets() epitomizes this confusion - @[old_]cg are struct css_set while @cg[12] are struct cgroup. It's not like the whole deal with cgroup, css_set and cg_cgroup_link isn't already confusing enough. Let's give it some sanity by uniformly using @cset for all struct css_set variables. * s/cg/cset/ for all css_set variables. * s/oldcg/old_cset/ s/oldcgrp/old_cgrp/. The same for the ones prefixed with "new". * s/cg/cgrp/ for cgroup variables in compare_css_sets(). * s/css/cset/ for the cgroup variable in task_cgroup_from_root(). * Whiteline adjustments. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- kernel/cgroup.c | 216 ++++++++++++++++++++++++++++---------------------------- 1 file changed, 109 insertions(+), 107 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index dc8997a..e7f6845 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -376,30 +376,32 @@ static unsigned long css_set_hash(struct cgroup_subsys_state *css[]) * compiled into their kernel but not actually in use */ static int use_task_css_set_links __read_mostly; -static void __put_css_set(struct css_set *cg, int taskexit) +static void __put_css_set(struct css_set *cset, int taskexit) { struct cg_cgroup_link *link; struct cg_cgroup_link *saved_link; + /* * Ensure that the refcount doesn't hit zero while any readers * can see it. Similar to atomic_dec_and_lock(), but for an * rwlock */ - if (atomic_add_unless(&cg->refcount, -1, 1)) + if (atomic_add_unless(&cset->refcount, -1, 1)) return; write_lock(&css_set_lock); - if (!atomic_dec_and_test(&cg->refcount)) { + if (!atomic_dec_and_test(&cset->refcount)) { write_unlock(&css_set_lock); return; } /* This css_set is dead. unlink it and release cgroup refcounts */ - hash_del(&cg->hlist); + hash_del(&cset->hlist); css_set_count--; - list_for_each_entry_safe(link, saved_link, &cg->cg_links, + list_for_each_entry_safe(link, saved_link, &cset->cg_links, cg_link_list) { struct cgroup *cgrp = link->cgrp; + list_del(&link->cg_link_list); list_del(&link->cgrp_link_list); @@ -421,45 +423,45 @@ static void __put_css_set(struct css_set *cg, int taskexit) } write_unlock(&css_set_lock); - kfree_rcu(cg, rcu_head); + kfree_rcu(cset, rcu_head); } /* * refcounted get/put for css_set objects */ -static inline void get_css_set(struct css_set *cg) +static inline void get_css_set(struct css_set *cset) { - atomic_inc(&cg->refcount); + atomic_inc(&cset->refcount); } -static inline void put_css_set(struct css_set *cg) +static inline void put_css_set(struct css_set *cset) { - __put_css_set(cg, 0); + __put_css_set(cset, 0); } -static inline void put_css_set_taskexit(struct css_set *cg) +static inline void put_css_set_taskexit(struct css_set *cset) { - __put_css_set(cg, 1); + __put_css_set(cset, 1); } /* * compare_css_sets - helper function for find_existing_css_set(). - * @cg: candidate css_set being tested - * @old_cg: existing css_set for a task + * @cset: candidate css_set being tested + * @old_cset: existing css_set for a task * @new_cgrp: cgroup that's being entered by the task * @template: desired set of css pointers in css_set (pre-calculated) * * Returns true if "cg" matches "old_cg" except for the hierarchy * which "new_cgrp" belongs to, for which it should match "new_cgrp". */ -static bool compare_css_sets(struct css_set *cg, - struct css_set *old_cg, +static bool compare_css_sets(struct css_set *cset, + struct css_set *old_cset, struct cgroup *new_cgrp, struct cgroup_subsys_state *template[]) { struct list_head *l1, *l2; - if (memcmp(template, cg->subsys, sizeof(cg->subsys))) { + if (memcmp(template, cset->subsys, sizeof(cset->subsys))) { /* Not all subsystems matched */ return false; } @@ -473,28 +475,28 @@ static bool compare_css_sets(struct css_set *cg, * candidates. */ - l1 = &cg->cg_links; - l2 = &old_cg->cg_links; + l1 = &cset->cg_links; + l2 = &old_cset->cg_links; while (1) { struct cg_cgroup_link *cgl1, *cgl2; - struct cgroup *cg1, *cg2; + struct cgroup *cgrp1, *cgrp2; l1 = l1->next; l2 = l2->next; /* See if we reached the end - both lists are equal length. */ - if (l1 == &cg->cg_links) { - BUG_ON(l2 != &old_cg->cg_links); + if (l1 == &cset->cg_links) { + BUG_ON(l2 != &old_cset->cg_links); break; } else { - BUG_ON(l2 == &old_cg->cg_links); + BUG_ON(l2 == &old_cset->cg_links); } /* Locate the cgroups associated with these links. */ cgl1 = list_entry(l1, struct cg_cgroup_link, cg_link_list); cgl2 = list_entry(l2, struct cg_cgroup_link, cg_link_list); - cg1 = cgl1->cgrp; - cg2 = cgl2->cgrp; + cgrp1 = cgl1->cgrp; + cgrp2 = cgl2->cgrp; /* Hierarchies should be linked in the same order. */ - BUG_ON(cg1->root != cg2->root); + BUG_ON(cgrp1->root != cgrp2->root); /* * If this hierarchy is the hierarchy of the cgroup @@ -503,11 +505,11 @@ static bool compare_css_sets(struct css_set *cg, * hierarchy, then this css_set should point to the * same cgroup as the old css_set. */ - if (cg1->root == new_cgrp->root) { - if (cg1 != new_cgrp) + if (cgrp1->root == new_cgrp->root) { + if (cgrp1 != new_cgrp) return false; } else { - if (cg1 != cg2) + if (cgrp1 != cgrp2) return false; } } @@ -527,14 +529,13 @@ static bool compare_css_sets(struct css_set *cg, * template: location in which to build the desired set of subsystem * state objects for the new cgroup group */ -static struct css_set *find_existing_css_set( - struct css_set *oldcg, - struct cgroup *cgrp, - struct cgroup_subsys_state *template[]) +static struct css_set *find_existing_css_set(struct css_set *old_cset, + struct cgroup *cgrp, + struct cgroup_subsys_state *template[]) { int i; struct cgroupfs_root *root = cgrp->root; - struct css_set *cg; + struct css_set *cset; unsigned long key; /* @@ -551,17 +552,17 @@ static struct css_set *find_existing_css_set( } else { /* Subsystem is not in this hierarchy, so we * don't want to change the subsystem state */ - template[i] = oldcg->subsys[i]; + template[i] = old_cset->subsys[i]; } } key = css_set_hash(template); - hash_for_each_possible(css_set_table, cg, hlist, key) { - if (!compare_css_sets(cg, oldcg, cgrp, template)) + hash_for_each_possible(css_set_table, cset, hlist, key) { + if (!compare_css_sets(cset, old_cset, cgrp, template)) continue; /* This css_set matches what we need */ - return cg; + return cset; } /* No existing cgroup group matched */ @@ -603,18 +604,18 @@ static int allocate_cg_links(int count, struct list_head *tmp) /** * link_css_set - a helper function to link a css_set to a cgroup * @tmp_cg_links: cg_cgroup_link objects allocated by allocate_cg_links() - * @cg: the css_set to be linked + * @cset: the css_set to be linked * @cgrp: the destination cgroup */ static void link_css_set(struct list_head *tmp_cg_links, - struct css_set *cg, struct cgroup *cgrp) + struct css_set *cset, struct cgroup *cgrp) { struct cg_cgroup_link *link; BUG_ON(list_empty(tmp_cg_links)); link = list_first_entry(tmp_cg_links, struct cg_cgroup_link, cgrp_link_list); - link->cg = cg; + link->cg = cset; link->cgrp = cgrp; atomic_inc(&cgrp->count); list_move(&link->cgrp_link_list, &cgrp->css_sets); @@ -622,7 +623,7 @@ static void link_css_set(struct list_head *tmp_cg_links, * Always add links to the tail of the list so that the list * is sorted by order of hierarchy creation */ - list_add_tail(&link->cg_link_list, &cg->cg_links); + list_add_tail(&link->cg_link_list, &cset->cg_links); } /* @@ -632,10 +633,10 @@ static void link_css_set(struct list_head *tmp_cg_links, * substituted into the appropriate hierarchy. Must be called with * cgroup_mutex held */ -static struct css_set *find_css_set( - struct css_set *oldcg, struct cgroup *cgrp) +static struct css_set *find_css_set(struct css_set *old_cset, + struct cgroup *cgrp) { - struct css_set *res; + struct css_set *cset; struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT]; struct list_head tmp_cg_links; @@ -646,40 +647,40 @@ static struct css_set *find_css_set( /* First see if we already have a cgroup group that matches * the desired set */ read_lock(&css_set_lock); - res = find_existing_css_set(oldcg, cgrp, template); - if (res) - get_css_set(res); + cset = find_existing_css_set(old_cset, cgrp, template); + if (cset) + get_css_set(cset); read_unlock(&css_set_lock); - if (res) - return res; + if (cset) + return cset; - res = kmalloc(sizeof(*res), GFP_KERNEL); - if (!res) + cset = kmalloc(sizeof(*cset), GFP_KERNEL); + if (!cset) return NULL; /* Allocate all the cg_cgroup_link objects that we'll need */ if (allocate_cg_links(root_count, &tmp_cg_links) < 0) { - kfree(res); + kfree(cset); return NULL; } - atomic_set(&res->refcount, 1); - INIT_LIST_HEAD(&res->cg_links); - INIT_LIST_HEAD(&res->tasks); - INIT_HLIST_NODE(&res->hlist); + atomic_set(&cset->refcount, 1); + INIT_LIST_HEAD(&cset->cg_links); + INIT_LIST_HEAD(&cset->tasks); + INIT_HLIST_NODE(&cset->hlist); /* Copy the set of subsystem state objects generated in * find_existing_css_set() */ - memcpy(res->subsys, template, sizeof(res->subsys)); + memcpy(cset->subsys, template, sizeof(cset->subsys)); write_lock(&css_set_lock); /* Add reference counts and links from the new css_set. */ - list_for_each_entry(link, &oldcg->cg_links, cg_link_list) { + list_for_each_entry(link, &old_cset->cg_links, cg_link_list) { struct cgroup *c = link->cgrp; if (c->root == cgrp->root) c = cgrp; - link_css_set(&tmp_cg_links, res, c); + link_css_set(&tmp_cg_links, cset, c); } BUG_ON(!list_empty(&tmp_cg_links)); @@ -687,12 +688,12 @@ static struct css_set *find_css_set( css_set_count++; /* Add this cgroup group to the hash table */ - key = css_set_hash(res->subsys); - hash_add(css_set_table, &res->hlist, key); + key = css_set_hash(cset->subsys); + hash_add(css_set_table, &cset->hlist, key); write_unlock(&css_set_lock); - return res; + return cset; } /* @@ -702,7 +703,7 @@ static struct css_set *find_css_set( static struct cgroup *task_cgroup_from_root(struct task_struct *task, struct cgroupfs_root *root) { - struct css_set *css; + struct css_set *cset; struct cgroup *res = NULL; BUG_ON(!mutex_is_locked(&cgroup_mutex)); @@ -712,12 +713,12 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task, * task can't change groups, so the only thing that can happen * is that it exits and its css is set back to init_css_set. */ - css = task->cgroups; - if (css == &init_css_set) { + cset = task->cgroups; + if (cset == &init_css_set) { res = &root->top_cgroup; } else { struct cg_cgroup_link *link; - list_for_each_entry(link, &css->cg_links, cg_link_list) { + list_for_each_entry(link, &cset->cg_links, cg_link_list) { struct cgroup *c = link->cgrp; if (c->root == root) { res = c; @@ -1608,7 +1609,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, struct cgroupfs_root *existing_root; const struct cred *cred; int i; - struct css_set *cg; + struct css_set *cset; BUG_ON(sb->s_root != NULL); @@ -1666,8 +1667,8 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, /* Link the top cgroup in this hierarchy into all * the css_set objects */ write_lock(&css_set_lock); - hash_for_each(css_set_table, i, cg, hlist) - link_css_set(&tmp_cg_links, cg, root_cgrp); + hash_for_each(css_set_table, i, cset, hlist) + link_css_set(&tmp_cg_links, cset, root_cgrp); write_unlock(&css_set_lock); free_cg_links(&tmp_cg_links); @@ -1947,10 +1948,11 @@ EXPORT_SYMBOL_GPL(cgroup_taskset_size); * * Must be called with cgroup_mutex and threadgroup locked. */ -static void cgroup_task_migrate(struct cgroup *oldcgrp, - struct task_struct *tsk, struct css_set *newcg) +static void cgroup_task_migrate(struct cgroup *old_cgrp, + struct task_struct *tsk, + struct css_set *new_cset) { - struct css_set *oldcg; + struct css_set *old_cset; /* * We are synchronized through threadgroup_lock() against PF_EXITING @@ -1958,25 +1960,25 @@ static void cgroup_task_migrate(struct cgroup *oldcgrp, * css_set to init_css_set and dropping the old one. */ WARN_ON_ONCE(tsk->flags & PF_EXITING); - oldcg = tsk->cgroups; + old_cset = tsk->cgroups; task_lock(tsk); - rcu_assign_pointer(tsk->cgroups, newcg); + rcu_assign_pointer(tsk->cgroups, new_cset); task_unlock(tsk); /* Update the css_set linked lists if we're using them */ write_lock(&css_set_lock); if (!list_empty(&tsk->cg_list)) - list_move(&tsk->cg_list, &newcg->tasks); + list_move(&tsk->cg_list, &new_cset->tasks); write_unlock(&css_set_lock); /* - * We just gained a reference on oldcg by taking it from the task. As - * trading it for newcg is protected by cgroup_mutex, we're safe to drop - * it here; it will be freed under RCU. + * We just gained a reference on old_cset by taking it from the + * task. As trading it for new_cset is protected by cgroup_mutex, + * we're safe to drop it here; it will be freed under RCU. */ - set_bit(CGRP_RELEASABLE, &oldcgrp->flags); - put_css_set(oldcg); + set_bit(CGRP_RELEASABLE, &old_cgrp->flags); + put_css_set(old_cset); } /** @@ -2928,7 +2930,7 @@ static void cgroup_advance_iter(struct cgroup *cgrp, { struct list_head *l = it->cg_link; struct cg_cgroup_link *link; - struct css_set *cg; + struct css_set *cset; /* Advance to the next non-empty css_set */ do { @@ -2938,10 +2940,10 @@ static void cgroup_advance_iter(struct cgroup *cgrp, return; } link = list_entry(l, struct cg_cgroup_link, cgrp_link_list); - cg = link->cg; - } while (list_empty(&cg->tasks)); + cset = link->cg; + } while (list_empty(&cset->tasks)); it->cg_link = l; - it->task = cg->tasks.next; + it->task = cset->tasks.next; } /* @@ -4519,7 +4521,7 @@ int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss) struct cgroup_subsys_state *css; int i, ret; struct hlist_node *tmp; - struct css_set *cg; + struct css_set *cset; unsigned long key; /* check name and function validity */ @@ -4586,17 +4588,17 @@ int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss) * this is all done under the css_set_lock. */ write_lock(&css_set_lock); - hash_for_each_safe(css_set_table, i, tmp, cg, hlist) { + hash_for_each_safe(css_set_table, i, tmp, cset, hlist) { /* skip entries that we already rehashed */ - if (cg->subsys[ss->subsys_id]) + if (cset->subsys[ss->subsys_id]) continue; /* remove existing entry */ - hash_del(&cg->hlist); + hash_del(&cset->hlist); /* set new value */ - cg->subsys[ss->subsys_id] = css; + cset->subsys[ss->subsys_id] = css; /* recompute hash and restore entry */ - key = css_set_hash(cg->subsys); - hash_add(css_set_table, &cg->hlist, key); + key = css_set_hash(cset->subsys); + hash_add(css_set_table, &cset->hlist, key); } write_unlock(&css_set_lock); @@ -4656,13 +4658,13 @@ void cgroup_unload_subsys(struct cgroup_subsys *ss) */ write_lock(&css_set_lock); list_for_each_entry(link, &dummytop->css_sets, cgrp_link_list) { - struct css_set *cg = link->cg; + struct css_set *cset = link->cg; unsigned long key; - hash_del(&cg->hlist); - cg->subsys[ss->subsys_id] = NULL; - key = css_set_hash(cg->subsys); - hash_add(css_set_table, &cg->hlist, key); + hash_del(&cset->hlist); + cset->subsys[ss->subsys_id] = NULL; + key = css_set_hash(cset->subsys); + hash_add(css_set_table, &cset->hlist, key); } write_unlock(&css_set_lock); @@ -5009,7 +5011,7 @@ void cgroup_post_fork(struct task_struct *child) */ void cgroup_exit(struct task_struct *tsk, int run_callbacks) { - struct css_set *cg; + struct css_set *cset; int i; /* @@ -5026,7 +5028,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks) /* Reassign the task to the init_css_set. */ task_lock(tsk); - cg = tsk->cgroups; + cset = tsk->cgroups; tsk->cgroups = &init_css_set; if (run_callbacks && need_forkexit_callback) { @@ -5039,7 +5041,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks) if (ss->exit) { struct cgroup *old_cgrp = - rcu_dereference_raw(cg->subsys[i])->cgroup; + rcu_dereference_raw(cset->subsys[i])->cgroup; struct cgroup *cgrp = task_cgroup(tsk, i); ss->exit(cgrp, old_cgrp, tsk); } @@ -5047,7 +5049,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks) } task_unlock(tsk); - put_css_set_taskexit(cg); + put_css_set_taskexit(cset); } static void check_for_release(struct cgroup *cgrp) @@ -5456,12 +5458,12 @@ static int current_css_set_cg_links_read(struct cgroup *cont, struct seq_file *seq) { struct cg_cgroup_link *link; - struct css_set *cg; + struct css_set *cset; read_lock(&css_set_lock); rcu_read_lock(); - cg = rcu_dereference(current->cgroups); - list_for_each_entry(link, &cg->cg_links, cg_link_list) { + cset = rcu_dereference(current->cgroups); + list_for_each_entry(link, &cset->cg_links, cg_link_list) { struct cgroup *c = link->cgrp; const char *name; @@ -5486,11 +5488,11 @@ static int cgroup_css_links_read(struct cgroup *cont, read_lock(&css_set_lock); list_for_each_entry(link, &cont->css_sets, cgrp_link_list) { - struct css_set *cg = link->cg; + struct css_set *cset = link->cg; struct task_struct *task; int count = 0; - seq_printf(seq, "css_set %p\n", cg); - list_for_each_entry(task, &cg->tasks, cg_list) { + seq_printf(seq, "css_set %p\n", cset); + list_for_each_entry(task, &cset->tasks, cg_list) { if (count++ > MAX_TASKS_SHOWN_PER_CSS) { seq_puts(seq, " ...\n"); break; -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 03/11] cgroup: bring some sanity to naming around cg_cgroup_link [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-06-13 4:04 ` [PATCH 01/11] cgroup: remove now unused css_depth() Tejun Heo 2013-06-13 4:04 ` [PATCH 02/11] cgroup: consistently use @cset for struct css_set variables Tejun Heo @ 2013-06-13 4:04 ` Tejun Heo 2013-06-13 4:04 ` [PATCH 04/11] cgroup: use kzalloc() instead of kmalloc() Tejun Heo ` (4 subsequent siblings) 7 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA cgroups and css_sets are mapped M:N and this M:N mapping is represented by struct cg_cgroup_link which forms linked lists on both sides. The naming around this mapping is already confusing and struct cg_cgroup_link exacerbates the situation quite a bit. From cgroup side, it starts off ->css_sets and runs through ->cgrp_link_list. From css_set side, it starts off ->cg_links and runs through ->cg_link_list. This is rather reversed as cgrp_link_list is used to iterate css_sets and cg_link_list cgroups. Also, this is the only place which is still using the confusing "cg" for css_sets. This patch cleans it up a bit. * s/cgroup->css_sets/cgroup->cset_links/ s/css_set->cg_links/css_set->cgrp_links/ s/cgroup_iter->cg_link/cgroup_iter->cset_link/ * s/cg_cgroup_link/cgrp_cset_link/ * s/cgrp_cset_link->cg/cgrp_cset_link->cset/ s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/ s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/ * s/init_css_set_link/init_cgrp_cset_link/ s/free_cg_links/free_cgrp_cset_links/ s/allocate_cg_links/allocate_cgrp_cset_links/ * s/cgl[12]/link[12]/ in compare_css_sets() * s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar adustments. * Comment and whiteline adjustments. After the changes, we have list_for_each_entry(link, &cont->cset_links, cset_link) { struct css_set *cset = link->cset; instead of list_for_each_entry(link, &cont->css_sets, cgrp_link_list) { struct css_set *cset = link->cg; This patch is purely cosmetic. v2: Fix broken sentences in the patch description. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- include/linux/cgroup.h | 15 ++-- kernel/cgroup.c | 226 ++++++++++++++++++++++++------------------------- 2 files changed, 120 insertions(+), 121 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 5830592..0e32855 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -215,10 +215,10 @@ struct cgroup { struct cgroupfs_root *root; /* - * List of cg_cgroup_links pointing at css_sets with - * tasks in this cgroup. Protected by css_set_lock + * List of cgrp_cset_links pointing at css_sets with tasks in this + * cgroup. Protected by css_set_lock. */ - struct list_head css_sets; + struct list_head cset_links; struct list_head allcg_node; /* cgroupfs_root->allcg_list */ struct list_head cft_q_node; /* used during cftype add/rm */ @@ -365,11 +365,10 @@ struct css_set { struct list_head tasks; /* - * List of cg_cgroup_link objects on link chains from - * cgroups referenced from this css_set. Protected by - * css_set_lock + * List of cgrp_cset_links pointing at cgroups referenced from this + * css_set. Protected by css_set_lock. */ - struct list_head cg_links; + struct list_head cgrp_links; /* * Set of subsystem states, one for each subsystem. This array @@ -792,7 +791,7 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos, /* A cgroup_iter should be treated as an opaque object */ struct cgroup_iter { - struct list_head *cg_link; + struct list_head *cset_link; struct list_head *task; }; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index e7f6845..0b5cd0e 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -315,20 +315,24 @@ static void cgroup_release_agent(struct work_struct *work); static DECLARE_WORK(release_agent_work, cgroup_release_agent); static void check_for_release(struct cgroup *cgrp); -/* Link structure for associating css_set objects with cgroups */ -struct cg_cgroup_link { - /* - * List running through cg_cgroup_links associated with a - * cgroup, anchored on cgroup->css_sets - */ - struct list_head cgrp_link_list; - struct cgroup *cgrp; - /* - * List running through cg_cgroup_links pointing at a - * single css_set object, anchored on css_set->cg_links - */ - struct list_head cg_link_list; - struct css_set *cg; +/* + * A cgroup can be associated with multiple css_sets as different tasks may + * belong to different cgroups on different hierarchies. In the other + * direction, a css_set is naturally associated with multiple cgroups. + * This M:N relationship is represented by the following link structure + * which exists for each association and allows traversing the associations + * from both sides. + */ +struct cgrp_cset_link { + /* the cgroup and css_set this link associates */ + struct cgroup *cgrp; + struct css_set *cset; + + /* list of cgrp_cset_links anchored at cgrp->cset_links */ + struct list_head cset_link; + + /* list of cgrp_cset_links anchored at css_set->cgrp_links */ + struct list_head cgrp_link; }; /* The default css_set - used by init and its children prior to any @@ -339,7 +343,7 @@ struct cg_cgroup_link { */ static struct css_set init_css_set; -static struct cg_cgroup_link init_css_set_link; +static struct cgrp_cset_link init_cgrp_cset_link; static int cgroup_init_idr(struct cgroup_subsys *ss, struct cgroup_subsys_state *css); @@ -378,8 +382,7 @@ static int use_task_css_set_links __read_mostly; static void __put_css_set(struct css_set *cset, int taskexit) { - struct cg_cgroup_link *link; - struct cg_cgroup_link *saved_link; + struct cgrp_cset_link *link, *tmp_link; /* * Ensure that the refcount doesn't hit zero while any readers @@ -398,12 +401,11 @@ static void __put_css_set(struct css_set *cset, int taskexit) hash_del(&cset->hlist); css_set_count--; - list_for_each_entry_safe(link, saved_link, &cset->cg_links, - cg_link_list) { + list_for_each_entry_safe(link, tmp_link, &cset->cgrp_links, cgrp_link) { struct cgroup *cgrp = link->cgrp; - list_del(&link->cg_link_list); - list_del(&link->cgrp_link_list); + list_del(&link->cset_link); + list_del(&link->cgrp_link); /* * We may not be holding cgroup_mutex, and if cgrp->count is @@ -475,26 +477,26 @@ static bool compare_css_sets(struct css_set *cset, * candidates. */ - l1 = &cset->cg_links; - l2 = &old_cset->cg_links; + l1 = &cset->cgrp_links; + l2 = &old_cset->cgrp_links; while (1) { - struct cg_cgroup_link *cgl1, *cgl2; + struct cgrp_cset_link *link1, *link2; struct cgroup *cgrp1, *cgrp2; l1 = l1->next; l2 = l2->next; /* See if we reached the end - both lists are equal length. */ - if (l1 == &cset->cg_links) { - BUG_ON(l2 != &old_cset->cg_links); + if (l1 == &cset->cgrp_links) { + BUG_ON(l2 != &old_cset->cgrp_links); break; } else { - BUG_ON(l2 == &old_cset->cg_links); + BUG_ON(l2 == &old_cset->cgrp_links); } /* Locate the cgroups associated with these links. */ - cgl1 = list_entry(l1, struct cg_cgroup_link, cg_link_list); - cgl2 = list_entry(l2, struct cg_cgroup_link, cg_link_list); - cgrp1 = cgl1->cgrp; - cgrp2 = cgl2->cgrp; + link1 = list_entry(l1, struct cgrp_cset_link, cgrp_link); + link2 = list_entry(l2, struct cgrp_cset_link, cgrp_link); + cgrp1 = link1->cgrp; + cgrp2 = link2->cgrp; /* Hierarchies should be linked in the same order. */ BUG_ON(cgrp1->root != cgrp2->root); @@ -569,61 +571,64 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset, return NULL; } -static void free_cg_links(struct list_head *tmp) +static void free_cgrp_cset_links(struct list_head *links_to_free) { - struct cg_cgroup_link *link; - struct cg_cgroup_link *saved_link; + struct cgrp_cset_link *link, *tmp_link; - list_for_each_entry_safe(link, saved_link, tmp, cgrp_link_list) { - list_del(&link->cgrp_link_list); + list_for_each_entry_safe(link, tmp_link, links_to_free, cset_link) { + list_del(&link->cset_link); kfree(link); } } -/* - * allocate_cg_links() allocates "count" cg_cgroup_link structures - * and chains them on tmp through their cgrp_link_list fields. Returns 0 on - * success or a negative error +/** + * allocate_cgrp_cset_links - allocate cgrp_cset_links + * @count: the number of links to allocate + * @tmp_links: list_head the allocated links are put on + * + * Allocate @count cgrp_cset_link structures and chain them on @tmp_links + * through ->cset_link. Returns 0 on success or -errno. */ -static int allocate_cg_links(int count, struct list_head *tmp) +static int allocate_cgrp_cset_links(int count, struct list_head *tmp_links) { - struct cg_cgroup_link *link; + struct cgrp_cset_link *link; int i; - INIT_LIST_HEAD(tmp); + + INIT_LIST_HEAD(tmp_links); + for (i = 0; i < count; i++) { link = kmalloc(sizeof(*link), GFP_KERNEL); if (!link) { - free_cg_links(tmp); + free_cgrp_cset_links(tmp_links); return -ENOMEM; } - list_add(&link->cgrp_link_list, tmp); + list_add(&link->cset_link, tmp_links); } return 0; } /** * link_css_set - a helper function to link a css_set to a cgroup - * @tmp_cg_links: cg_cgroup_link objects allocated by allocate_cg_links() + * @tmp_links: cgrp_cset_link objects allocated by allocate_cgrp_cset_links() * @cset: the css_set to be linked * @cgrp: the destination cgroup */ -static void link_css_set(struct list_head *tmp_cg_links, - struct css_set *cset, struct cgroup *cgrp) +static void link_css_set(struct list_head *tmp_links, struct css_set *cset, + struct cgroup *cgrp) { - struct cg_cgroup_link *link; + struct cgrp_cset_link *link; - BUG_ON(list_empty(tmp_cg_links)); - link = list_first_entry(tmp_cg_links, struct cg_cgroup_link, - cgrp_link_list); - link->cg = cset; + BUG_ON(list_empty(tmp_links)); + link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link); + link->cset = cset; link->cgrp = cgrp; atomic_inc(&cgrp->count); - list_move(&link->cgrp_link_list, &cgrp->css_sets); + list_move(&link->cset_link, &cgrp->cset_links); /* * Always add links to the tail of the list so that the list * is sorted by order of hierarchy creation */ - list_add_tail(&link->cg_link_list, &cset->cg_links); + list_add_tail(&link->cgrp_link, &cset->cgrp_links); } /* @@ -638,10 +643,8 @@ static struct css_set *find_css_set(struct css_set *old_cset, { struct css_set *cset; struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT]; - - struct list_head tmp_cg_links; - - struct cg_cgroup_link *link; + struct list_head tmp_links; + struct cgrp_cset_link *link; unsigned long key; /* First see if we already have a cgroup group that matches @@ -659,14 +662,14 @@ static struct css_set *find_css_set(struct css_set *old_cset, if (!cset) return NULL; - /* Allocate all the cg_cgroup_link objects that we'll need */ - if (allocate_cg_links(root_count, &tmp_cg_links) < 0) { + /* Allocate all the cgrp_cset_link objects that we'll need */ + if (allocate_cgrp_cset_links(root_count, &tmp_links) < 0) { kfree(cset); return NULL; } atomic_set(&cset->refcount, 1); - INIT_LIST_HEAD(&cset->cg_links); + INIT_LIST_HEAD(&cset->cgrp_links); INIT_LIST_HEAD(&cset->tasks); INIT_HLIST_NODE(&cset->hlist); @@ -676,14 +679,15 @@ static struct css_set *find_css_set(struct css_set *old_cset, write_lock(&css_set_lock); /* Add reference counts and links from the new css_set. */ - list_for_each_entry(link, &old_cset->cg_links, cg_link_list) { + list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) { struct cgroup *c = link->cgrp; + if (c->root == cgrp->root) c = cgrp; - link_css_set(&tmp_cg_links, cset, c); + link_css_set(&tmp_links, cset, c); } - BUG_ON(!list_empty(&tmp_cg_links)); + BUG_ON(!list_empty(&tmp_links)); css_set_count++; @@ -717,9 +721,11 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task, if (cset == &init_css_set) { res = &root->top_cgroup; } else { - struct cg_cgroup_link *link; - list_for_each_entry(link, &cset->cg_links, cg_link_list) { + struct cgrp_cset_link *link; + + list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { struct cgroup *c = link->cgrp; + if (c->root == root) { res = c; break; @@ -1405,7 +1411,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) INIT_LIST_HEAD(&cgrp->sibling); INIT_LIST_HEAD(&cgrp->children); INIT_LIST_HEAD(&cgrp->files); - INIT_LIST_HEAD(&cgrp->css_sets); + INIT_LIST_HEAD(&cgrp->cset_links); INIT_LIST_HEAD(&cgrp->allcg_node); INIT_LIST_HEAD(&cgrp->release_list); INIT_LIST_HEAD(&cgrp->pidlists); @@ -1604,7 +1610,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, BUG_ON(!root); if (root == opts.new_root) { /* We used the new root structure, so this is a new hierarchy */ - struct list_head tmp_cg_links; + struct list_head tmp_links; struct cgroup *root_cgrp = &root->top_cgroup; struct cgroupfs_root *existing_root; const struct cred *cred; @@ -1636,7 +1642,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, * that's us. The worst that can happen is that we * have some link structures left over */ - ret = allocate_cg_links(css_set_count, &tmp_cg_links); + ret = allocate_cgrp_cset_links(css_set_count, &tmp_links); if (ret) goto unlock_drop; @@ -1646,7 +1652,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, ret = rebind_subsystems(root, root->subsys_mask); if (ret == -EBUSY) { - free_cg_links(&tmp_cg_links); + free_cgrp_cset_links(&tmp_links); goto unlock_drop; } /* @@ -1668,10 +1674,10 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, * the css_set objects */ write_lock(&css_set_lock); hash_for_each(css_set_table, i, cset, hlist) - link_css_set(&tmp_cg_links, cset, root_cgrp); + link_css_set(&tmp_links, cset, root_cgrp); write_unlock(&css_set_lock); - free_cg_links(&tmp_cg_links); + free_cgrp_cset_links(&tmp_links); BUG_ON(!list_empty(&root_cgrp->children)); BUG_ON(root->number_of_cgroups != 1); @@ -1725,9 +1731,8 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, static void cgroup_kill_sb(struct super_block *sb) { struct cgroupfs_root *root = sb->s_fs_info; struct cgroup *cgrp = &root->top_cgroup; + struct cgrp_cset_link *link, *tmp_link; int ret; - struct cg_cgroup_link *link; - struct cg_cgroup_link *saved_link; BUG_ON(!root); @@ -1743,15 +1748,14 @@ static void cgroup_kill_sb(struct super_block *sb) { BUG_ON(ret); /* - * Release all the links from css_sets to this hierarchy's + * Release all the links from cset_links to this hierarchy's * root cgroup */ write_lock(&css_set_lock); - list_for_each_entry_safe(link, saved_link, &cgrp->css_sets, - cgrp_link_list) { - list_del(&link->cg_link_list); - list_del(&link->cgrp_link_list); + list_for_each_entry_safe(link, tmp_link, &cgrp->cset_links, cset_link) { + list_del(&link->cset_link); + list_del(&link->cgrp_link); kfree(link); } write_unlock(&css_set_lock); @@ -2911,12 +2915,11 @@ int cgroup_rm_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) int cgroup_task_count(const struct cgroup *cgrp) { int count = 0; - struct cg_cgroup_link *link; + struct cgrp_cset_link *link; read_lock(&css_set_lock); - list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) { - count += atomic_read(&link->cg->refcount); - } + list_for_each_entry(link, &cgrp->cset_links, cset_link) + count += atomic_read(&link->cset->refcount); read_unlock(&css_set_lock); return count; } @@ -2925,24 +2928,23 @@ int cgroup_task_count(const struct cgroup *cgrp) * Advance a list_head iterator. The iterator should be positioned at * the start of a css_set */ -static void cgroup_advance_iter(struct cgroup *cgrp, - struct cgroup_iter *it) +static void cgroup_advance_iter(struct cgroup *cgrp, struct cgroup_iter *it) { - struct list_head *l = it->cg_link; - struct cg_cgroup_link *link; + struct list_head *l = it->cset_link; + struct cgrp_cset_link *link; struct css_set *cset; /* Advance to the next non-empty css_set */ do { l = l->next; - if (l == &cgrp->css_sets) { - it->cg_link = NULL; + if (l == &cgrp->cset_links) { + it->cset_link = NULL; return; } - link = list_entry(l, struct cg_cgroup_link, cgrp_link_list); - cset = link->cg; + link = list_entry(l, struct cgrp_cset_link, cset_link); + cset = link->cset; } while (list_empty(&cset->tasks)); - it->cg_link = l; + it->cset_link = l; it->task = cset->tasks.next; } @@ -3163,7 +3165,7 @@ void cgroup_iter_start(struct cgroup *cgrp, struct cgroup_iter *it) cgroup_enable_task_cg_lists(); read_lock(&css_set_lock); - it->cg_link = &cgrp->css_sets; + it->cset_link = &cgrp->cset_links; cgroup_advance_iter(cgrp, it); } @@ -3172,16 +3174,16 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp, { struct task_struct *res; struct list_head *l = it->task; - struct cg_cgroup_link *link; + struct cgrp_cset_link *link; /* If the iterator cg is NULL, we have no tasks */ - if (!it->cg_link) + if (!it->cset_link) return NULL; res = list_entry(l, struct task_struct, cg_list); /* Advance iterator to find next entry */ l = l->next; - link = list_entry(it->cg_link, struct cg_cgroup_link, cgrp_link_list); - if (l == &link->cg->tasks) { + link = list_entry(it->cset_link, struct cgrp_cset_link, cset_link); + if (l == &link->cset->tasks) { /* We reached the end of this task list - move on to * the next cg_cgroup_link */ cgroup_advance_iter(cgrp, it); @@ -4628,7 +4630,7 @@ EXPORT_SYMBOL_GPL(cgroup_load_subsys); */ void cgroup_unload_subsys(struct cgroup_subsys *ss) { - struct cg_cgroup_link *link; + struct cgrp_cset_link *link; BUG_ON(ss->module == NULL); @@ -4657,8 +4659,8 @@ void cgroup_unload_subsys(struct cgroup_subsys *ss) * in loading, we need to pay our respects to the hashtable gods. */ write_lock(&css_set_lock); - list_for_each_entry(link, &dummytop->css_sets, cgrp_link_list) { - struct css_set *cset = link->cg; + list_for_each_entry(link, &dummytop->cset_links, cset_link) { + struct css_set *cset = link->cset; unsigned long key; hash_del(&cset->hlist); @@ -4691,7 +4693,7 @@ int __init cgroup_init_early(void) { int i; atomic_set(&init_css_set.refcount, 1); - INIT_LIST_HEAD(&init_css_set.cg_links); + INIT_LIST_HEAD(&init_css_set.cgrp_links); INIT_LIST_HEAD(&init_css_set.tasks); INIT_HLIST_NODE(&init_css_set.hlist); css_set_count = 1; @@ -4699,12 +4701,10 @@ int __init cgroup_init_early(void) root_count = 1; init_task.cgroups = &init_css_set; - init_css_set_link.cg = &init_css_set; - init_css_set_link.cgrp = dummytop; - list_add(&init_css_set_link.cgrp_link_list, - &rootnode.top_cgroup.css_sets); - list_add(&init_css_set_link.cg_link_list, - &init_css_set.cg_links); + init_cgrp_cset_link.cset = &init_css_set; + init_cgrp_cset_link.cgrp = dummytop; + list_add(&init_cgrp_cset_link.cset_link, &rootnode.top_cgroup.cset_links); + list_add(&init_cgrp_cset_link.cgrp_link, &init_css_set.cgrp_links); for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { struct cgroup_subsys *ss = subsys[i]; @@ -5457,13 +5457,13 @@ static int current_css_set_cg_links_read(struct cgroup *cont, struct cftype *cft, struct seq_file *seq) { - struct cg_cgroup_link *link; + struct cgrp_cset_link *link; struct css_set *cset; read_lock(&css_set_lock); rcu_read_lock(); cset = rcu_dereference(current->cgroups); - list_for_each_entry(link, &cset->cg_links, cg_link_list) { + list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { struct cgroup *c = link->cgrp; const char *name; @@ -5484,11 +5484,11 @@ static int cgroup_css_links_read(struct cgroup *cont, struct cftype *cft, struct seq_file *seq) { - struct cg_cgroup_link *link; + struct cgrp_cset_link *link; read_lock(&css_set_lock); - list_for_each_entry(link, &cont->css_sets, cgrp_link_list) { - struct css_set *cset = link->cg; + list_for_each_entry(link, &cont->cset_links, cset_link) { + struct css_set *cset = link->cset; struct task_struct *task; int count = 0; seq_printf(seq, "css_set %p\n", cset); -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 04/11] cgroup: use kzalloc() instead of kmalloc() [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (2 preceding siblings ...) 2013-06-13 4:04 ` [PATCH 03/11] cgroup: bring some sanity to naming around cg_cgroup_link Tejun Heo @ 2013-06-13 4:04 ` Tejun Heo 2013-06-13 4:04 ` [PATCH 06/11] cgroup: rename CGRP_REMOVED to CGRP_DEAD Tejun Heo ` (3 subsequent siblings) 7 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA There's no point in using kmalloc() instead of the clearing variant for trivial stuff. We can live dangerously elsewhere. Use kzalloc() instead and drop 0 inits. While at it, do trivial code reorganization in cgroup_file_open(). This patch doesn't introduce any functional changes. v2: I was caught in the very distant past where list_del() didn't poison and the initial version converted list_del()s to list_del_init()s too. Li and Kent took me out of the stasis chamber. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Cc: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- kernel/cgroup.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 0b5cd0e..e308ee7 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -597,7 +597,7 @@ static int allocate_cgrp_cset_links(int count, struct list_head *tmp_links) INIT_LIST_HEAD(tmp_links); for (i = 0; i < count; i++) { - link = kmalloc(sizeof(*link), GFP_KERNEL); + link = kzalloc(sizeof(*link), GFP_KERNEL); if (!link) { free_cgrp_cset_links(tmp_links); return -ENOMEM; @@ -658,7 +658,7 @@ static struct css_set *find_css_set(struct css_set *old_cset, if (cset) return cset; - cset = kmalloc(sizeof(*cset), GFP_KERNEL); + cset = kzalloc(sizeof(*cset), GFP_KERNEL); if (!cset) return NULL; @@ -2478,10 +2478,12 @@ static int cgroup_file_open(struct inode *inode, struct file *file) cft = __d_cft(file->f_dentry); if (cft->read_map || cft->read_seq_string) { - struct cgroup_seqfile_state *state = - kzalloc(sizeof(*state), GFP_USER); + struct cgroup_seqfile_state *state; + + state = kzalloc(sizeof(*state), GFP_USER); if (!state) return -ENOMEM; + state->cft = cft; state->cgroup = __d_cgrp(file->f_dentry->d_parent); file->f_op = &cgroup_seqfile_operations; @@ -3514,7 +3516,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp, } } /* entry not found; create a new one */ - l = kmalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL); + l = kzalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL); if (!l) { mutex_unlock(&cgrp->pidlist_mutex); return l; @@ -3523,8 +3525,6 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp, down_write(&l->mutex); l->key.type = type; l->key.ns = get_pid_ns(ns); - l->use_count = 0; /* don't increment here */ - l->list = NULL; l->owner = cgrp; list_add(&l->links, &cgrp->pidlists); mutex_unlock(&cgrp->pidlist_mutex); -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 06/11] cgroup: rename CGRP_REMOVED to CGRP_DEAD [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (3 preceding siblings ...) 2013-06-13 4:04 ` [PATCH 04/11] cgroup: use kzalloc() instead of kmalloc() Tejun Heo @ 2013-06-13 4:04 ` Tejun Heo 2013-06-13 4:04 ` [PATCH 07/11] cgroup: drop unnecessary RCU dancing from __put_css_set() Tejun Heo ` (2 subsequent siblings) 7 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA We will add another flag indicating that the cgroup is in the process of being killed. REMOVING / REMOVED is more difficult to distinguish and cgroup_is_removing()/cgroup_is_removed() are a bit awkward. Also, later percpu_ref usage will involve "kill"ing the refcnt. s/CGRP_REMOVED/CGRP_DEAD/ s/cgroup_is_removed()/cgroup_is_dead() This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- include/linux/cgroup.h | 2 +- kernel/cgroup.c | 30 ++++++++++++++---------------- 2 files changed, 15 insertions(+), 17 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index a494636..c86a93a 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -143,7 +143,7 @@ static inline void css_put(struct cgroup_subsys_state *css) /* bits in struct cgroup flags field */ enum { /* Control Group is dead */ - CGRP_REMOVED, + CGRP_DEAD, /* * Control Group has previously had a child cgroup or a task, * but no longer (only if CGRP_NOTIFY_ON_RELEASE is set) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index e308ee7..8f5fbcb 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -226,9 +226,9 @@ static int css_refcnt(struct cgroup_subsys_state *css) } /* convenient tests for these bits */ -static inline bool cgroup_is_removed(const struct cgroup *cgrp) +static inline bool cgroup_is_dead(const struct cgroup *cgrp) { - return test_bit(CGRP_REMOVED, &cgrp->flags); + return test_bit(CGRP_DEAD, &cgrp->flags); } /** @@ -300,7 +300,7 @@ static inline struct cftype *__d_cft(struct dentry *dentry) static bool cgroup_lock_live_group(struct cgroup *cgrp) { mutex_lock(&cgroup_mutex); - if (cgroup_is_removed(cgrp)) { + if (cgroup_is_dead(cgrp)) { mutex_unlock(&cgroup_mutex); return false; } @@ -892,7 +892,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode) if (S_ISDIR(inode->i_mode)) { struct cgroup *cgrp = dentry->d_fsdata; - BUG_ON(!(cgroup_is_removed(cgrp))); + BUG_ON(!(cgroup_is_dead(cgrp))); call_rcu(&cgrp->rcu_head, cgroup_free_rcu); } else { struct cfent *cfe = __d_cfe(dentry); @@ -2366,7 +2366,7 @@ static ssize_t cgroup_file_write(struct file *file, const char __user *buf, struct cftype *cft = __d_cft(file->f_dentry); struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent); - if (cgroup_is_removed(cgrp)) + if (cgroup_is_dead(cgrp)) return -ENODEV; if (cft->write) return cft->write(cgrp, cft, file, buf, nbytes, ppos); @@ -2411,7 +2411,7 @@ static ssize_t cgroup_file_read(struct file *file, char __user *buf, struct cftype *cft = __d_cft(file->f_dentry); struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent); - if (cgroup_is_removed(cgrp)) + if (cgroup_is_dead(cgrp)) return -ENODEV; if (cft->read) @@ -2834,7 +2834,7 @@ static void cgroup_cfts_commit(struct cgroup_subsys *ss, mutex_lock(&inode->i_mutex); mutex_lock(&cgroup_mutex); - if (!cgroup_is_removed(cgrp)) + if (!cgroup_is_dead(cgrp)) cgroup_addrm_files(cgrp, ss, cfts, is_add); mutex_unlock(&cgroup_mutex); mutex_unlock(&inode->i_mutex); @@ -3002,14 +3002,14 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos) /* * @pos could already have been removed. Once a cgroup is removed, * its ->sibling.next is no longer updated when its next sibling - * changes. As CGRP_REMOVED is set on removal which is fully + * changes. As CGRP_DEAD is set on removal which is fully * serialized, if we see it unasserted, it's guaranteed that the * next sibling hasn't finished its grace period even if it's * already removed, and thus safe to dereference from this RCU * critical section. If ->sibling.next is inaccessible, - * cgroup_is_removed() is guaranteed to be visible as %true here. + * cgroup_is_dead() is guaranteed to be visible as %true here. */ - if (likely(!cgroup_is_removed(pos))) { + if (likely(!cgroup_is_dead(pos))) { next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling); if (&next->sibling != &pos->parent->children) return next; @@ -4386,7 +4386,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) * attempts fail thus maintaining the removal conditions verified * above. * - * Note that CGRP_REMVOED clearing is depended upon by + * Note that CGRP_DEAD assertion is depended upon by * cgroup_next_sibling() to resume iteration after dropping RCU * read lock. See cgroup_next_sibling() for details. */ @@ -4396,7 +4396,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) WARN_ON(atomic_read(&css->refcnt) < 0); atomic_add(CSS_DEACT_BIAS, &css->refcnt); } - set_bit(CGRP_REMOVED, &cgrp->flags); + set_bit(CGRP_DEAD, &cgrp->flags); /* tell subsystems to initate destruction */ for_each_subsys(cgrp->root, ss) @@ -5066,7 +5066,7 @@ static void check_for_release(struct cgroup *cgrp) int need_schedule_work = 0; raw_spin_lock(&release_list_lock); - if (!cgroup_is_removed(cgrp) && + if (!cgroup_is_dead(cgrp) && list_empty(&cgrp->release_list)) { list_add(&cgrp->release_list, &release_list); need_schedule_work = 1; @@ -5212,9 +5212,7 @@ __setup("cgroup_disable=", cgroup_disable); * Functons for CSS ID. */ -/* - *To get ID other than 0, this should be called when !cgroup_is_removed(). - */ +/* to get ID other than 0, this should be called when !cgroup_is_dead() */ unsigned short css_id(struct cgroup_subsys_state *css) { struct css_id *cssid; -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 07/11] cgroup: drop unnecessary RCU dancing from __put_css_set() [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (4 preceding siblings ...) 2013-06-13 4:04 ` [PATCH 06/11] cgroup: rename CGRP_REMOVED to CGRP_DEAD Tejun Heo @ 2013-06-13 4:04 ` Tejun Heo 2013-06-13 4:04 ` [PATCH 09/11] cgroup: reorder the operations in cgroup_destroy_locked() Tejun Heo 2013-06-13 6:04 ` [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Li Zefan 7 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA __put_css_set() does RCU read access on @cgrp across dropping @cgrp->count so that it can continue accessing @cgrp even if the count reached zero and destruction of the cgroup commenced. Given that both sides - __css_put() and cgroup_destroy_locked() - are cold paths, this is unnecessary. Just making cgroup_destroy_locked() grab css_set_lock while checking @cgrp->count is enough. Remove the RCU read locking from __put_css_set() and make cgroup_destroy_locked() read-lock css_set_lock when checking @cgrp->count. This will also allow removing @cgrp->count. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- kernel/cgroup.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 8f5fbcb..22f9729 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -407,19 +407,13 @@ static void __put_css_set(struct css_set *cset, int taskexit) list_del(&link->cset_link); list_del(&link->cgrp_link); - /* - * We may not be holding cgroup_mutex, and if cgrp->count is - * dropped to 0 the cgroup can be destroyed at any time, hence - * rcu_read_lock is used to keep it alive. - */ - rcu_read_lock(); + /* @cgrp can't go away while we're holding css_set_lock */ if (atomic_dec_and_test(&cgrp->count) && notify_on_release(cgrp)) { if (taskexit) set_bit(CGRP_RELEASABLE, &cgrp->flags); check_for_release(cgrp); } - rcu_read_unlock(); kfree(link); } @@ -4373,11 +4367,19 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) struct cgroup *parent = cgrp->parent; struct cgroup_event *event, *tmp; struct cgroup_subsys *ss; + bool empty; lockdep_assert_held(&d->d_inode->i_mutex); lockdep_assert_held(&cgroup_mutex); - if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) + /* + * css_set_lock prevents @cgrp from being removed while + * __put_css_set() is in progress. + */ + read_lock(&css_set_lock); + empty = !atomic_read(&cgrp->count) && list_empty(&cgrp->children); + read_unlock(&css_set_lock); + if (!empty) return -EBUSY; /* @@ -5054,8 +5056,6 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks) static void check_for_release(struct cgroup *cgrp) { - /* All of these checks rely on RCU to keep the cgroup - * structure alive */ if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count) && list_empty(&cgrp->children)) { /* -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 09/11] cgroup: reorder the operations in cgroup_destroy_locked() [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (5 preceding siblings ...) 2013-06-13 4:04 ` [PATCH 07/11] cgroup: drop unnecessary RCU dancing from __put_css_set() Tejun Heo @ 2013-06-13 4:04 ` Tejun Heo 2013-06-13 6:04 ` [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Li Zefan 7 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA Cc: cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA This patch reorders the operations in cgroup_destroy_locked() such that the userland visible parts happen before css offlining and removal from the ->sibling list. This will be used to make css use percpu refcnt. While at it, split out CGRP_DEAD related comment from the refcnt deactivation one and correct / clarify how different guarantees are met. While this patch changes the specific order of operations, it shouldn't cause any noticeable behavior difference. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- kernel/cgroup.c | 61 +++++++++++++++++++++++++++++++++------------------------ 1 file changed, 35 insertions(+), 26 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 062653f..9261acc 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4382,13 +4382,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) /* * Block new css_tryget() by deactivating refcnt and mark @cgrp - * removed. This makes future css_tryget() and child creation - * attempts fail thus maintaining the removal conditions verified - * above. - * - * Note that CGRP_DEAD assertion is depended upon by - * cgroup_next_sibling() to resume iteration after dropping RCU - * read lock. See cgroup_next_sibling() for details. + * removed. This makes future css_tryget() attempts fail which we + * guarantee to ->css_offline() callbacks. */ for_each_subsys(cgrp->root, ss) { struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; @@ -4396,8 +4391,41 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) WARN_ON(atomic_read(&css->refcnt) < 0); atomic_add(CSS_DEACT_BIAS, &css->refcnt); } + + /* + * Mark @cgrp dead. This prevents further task migration and child + * creation by disabling cgroup_lock_live_group(). Note that + * CGRP_DEAD assertion is depended upon by cgroup_next_sibling() to + * resume iteration after dropping RCU read lock. See + * cgroup_next_sibling() for details. + */ set_bit(CGRP_DEAD, &cgrp->flags); + /* CGRP_DEAD is set, remove from ->release_list for the last time */ + raw_spin_lock(&release_list_lock); + if (!list_empty(&cgrp->release_list)) + list_del_init(&cgrp->release_list); + raw_spin_unlock(&release_list_lock); + + /* + * Remove @cgrp directory. The removal puts the base ref but we + * aren't quite done with @cgrp yet, so hold onto it. + */ + dget(d); + cgroup_d_remove_dir(d); + + /* + * Unregister events and notify userspace. + * Notify userspace about cgroup removing only after rmdir of cgroup + * directory to avoid race between userspace and kernelspace. + */ + spin_lock(&cgrp->event_list_lock); + list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) { + list_del_init(&event->list); + schedule_work(&event->remove); + } + spin_unlock(&cgrp->event_list_lock); + /* tell subsystems to initate destruction */ for_each_subsys(cgrp->root, ss) offline_css(ss, cgrp); @@ -4412,34 +4440,15 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) for_each_subsys(cgrp->root, ss) css_put(cgrp->subsys[ss->subsys_id]); - raw_spin_lock(&release_list_lock); - if (!list_empty(&cgrp->release_list)) - list_del_init(&cgrp->release_list); - raw_spin_unlock(&release_list_lock); - /* delete this cgroup from parent->children */ list_del_rcu(&cgrp->sibling); list_del_init(&cgrp->allcg_node); - dget(d); - cgroup_d_remove_dir(d); dput(d); set_bit(CGRP_RELEASABLE, &parent->flags); check_for_release(parent); - /* - * Unregister events and notify userspace. - * Notify userspace about cgroup removing only after rmdir of cgroup - * directory to avoid race between userspace and kernelspace. - */ - spin_lock(&cgrp->event_list_lock); - list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) { - list_del_init(&event->list); - schedule_work(&event->remove); - } - spin_unlock(&cgrp->event_list_lock); - return 0; } -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (6 preceding siblings ...) 2013-06-13 4:04 ` [PATCH 09/11] cgroup: reorder the operations in cgroup_destroy_locked() Tejun Heo @ 2013-06-13 6:04 ` Li Zefan [not found] ` <51B960FF.7070604-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> 7 siblings, 1 reply; 27+ messages in thread From: Li Zefan @ 2013-06-13 6:04 UTC (permalink / raw) To: Tejun Heo Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 2013/6/13 12:04, Tejun Heo wrote: > Hello, > > The changes from the last take[L] are, > > * Rebased on top of further percpu-refcount updates. > > * 0003: Broken patch description updated. > > * 0004: Stupid list_del_init() conversions from the last decade > dropped. > > * 0005: Typo fix. > > * 0011: percpu_ref_init() error handling fixed. Premature and > duplicate base css ref puts fixed. > > This patchset does a lot of cleanup and then updates css > (cgroup_subsys_state) refcnt to use the new percpu reference counter. > A css (cgroup_subsys_state) is how each cgroup is represented to a > controller. As such, it can be used in hot paths across the various > subsystems different controllers are associated with. > > One of the common operations is reference counting, which up until now > has been implemented using a global atomic counter and can have > significant adverse impact on scalability. For example, css refcnt > can be gotten and put multiple times by blkcg for each IO request. > For highops configurations which try to do as much per-cpu as > possible, the global frequent refcnting can be very expensive. > > In general, given the various hugely diverse paths css's end up being > used from, we need to make it cheap and highly scalable. In its > usage, css refcnting isn't very different from module refcnting. > > This patchset contains the following 11 patches. > > 0001-cgroup-remove-now-unused-css_depth.patch > 0002-cgroup-consistently-use-cset-for-struct-css_set-vari.patch > 0003-cgroup-bring-some-sanity-to-naming-around-cg_cgroup_.patch > 0004-cgroup-use-kzalloc-instead-of-kmalloc.patch > 0005-cgroup-clean-up-css_-try-get-and-css_put.patch > 0006-cgroup-rename-CGRP_REMOVED-to-CGRP_DEAD.patch > 0007-cgroup-drop-unnecessary-RCU-dancing-from-__put_css_s.patch > 0008-cgroup-remove-cgroup-count-and-use.patch > 0009-cgroup-reorder-the-operations-in-cgroup_destroy_lock.patch > 0010-cgroup-split-cgroup-destruction-into-two-steps.patch > 0011-cgroup-use-percpu-refcnt-for-cgroup_subsys_states.patch > > 0001-0007 are cleanups, many of them cosmetic. They clean up the code > paths which are related with the refcnting changes. If you're only > interested in the precpu usage, you can probably skip these. > > 0008-0010 updates the cgroup destruction path so that percpu refcnting > can be used. > > 0011 updates css refcnting to use percpu_ref. > > This patchset is on top of > > cgroup/for-3.11 d5c56ced77 ("cgroup: clean up the cftype array for the base cgroup files") > + [1] percpu/review-percpu-ref-tryget > > and available in the following git branch. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-css-percpu-ref > > diffstat follows. Thanks. > > include/linux/cgroup.h | 87 ++--- > kernel/cgroup.c | 723 ++++++++++++++++++++++++++----------------------- > 2 files changed, 420 insertions(+), 390 deletions(-) > Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <51B960FF.7070604-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]
* Re: [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref [not found] ` <51B960FF.7070604-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> @ 2013-06-13 17:56 ` Tejun Heo [not found] ` <20130613175627.GC10799-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-06-13 17:56 UTC (permalink / raw) To: Li Zefan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Thu, Jun 13, 2013 at 02:04:47PM +0800, Li Zefan wrote: > > 0001-cgroup-remove-now-unused-css_depth.patch > > 0002-cgroup-consistently-use-cset-for-struct-css_set-vari.patch > > 0003-cgroup-bring-some-sanity-to-naming-around-cg_cgroup_.patch > > 0004-cgroup-use-kzalloc-instead-of-kmalloc.patch > > 0005-cgroup-clean-up-css_-try-get-and-css_put.patch > > 0006-cgroup-rename-CGRP_REMOVED-to-CGRP_DEAD.patch > > 0007-cgroup-drop-unnecessary-RCU-dancing-from-__put_css_s.patch > > 0008-cgroup-remove-cgroup-count-and-use.patch > > 0009-cgroup-reorder-the-operations-in-cgroup_destroy_lock.patch > > 0010-cgroup-split-cgroup-destruction-into-two-steps.patch > > 0011-cgroup-use-percpu-refcnt-for-cgroup_subsys_states.patch > > Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Applied 0001-0008 to cgroup/for-3.11. Will apply the reset once the percpu-refcount side is settled. Thanks! -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20130613175627.GC10799-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref [not found] ` <20130613175627.GC10799-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2013-06-14 2:41 ` Tejun Heo 0 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-14 2:41 UTC (permalink / raw) To: Li Zefan Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu, Jun 13, 2013 at 10:56:27AM -0700, Tejun Heo wrote: > On Thu, Jun 13, 2013 at 02:04:47PM +0800, Li Zefan wrote: > > > 0001-cgroup-remove-now-unused-css_depth.patch > > > 0002-cgroup-consistently-use-cset-for-struct-css_set-vari.patch > > > 0003-cgroup-bring-some-sanity-to-naming-around-cg_cgroup_.patch > > > 0004-cgroup-use-kzalloc-instead-of-kmalloc.patch > > > 0005-cgroup-clean-up-css_-try-get-and-css_put.patch > > > 0006-cgroup-rename-CGRP_REMOVED-to-CGRP_DEAD.patch > > > 0007-cgroup-drop-unnecessary-RCU-dancing-from-__put_css_s.patch > > > 0008-cgroup-remove-cgroup-count-and-use.patch > > > 0009-cgroup-reorder-the-operations-in-cgroup_destroy_lock.patch > > > 0010-cgroup-split-cgroup-destruction-into-two-steps.patch > > > 0011-cgroup-use-percpu-refcnt-for-cgroup_subsys_states.patch > > > > Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> > > Applied 0001-0008 to cgroup/for-3.11. Will apply the reset once the > percpu-refcount side is settled. Applied 0009-0010 to cgroup/for-3.11; then pulled in percpu/for-3.11 for percpu_ref and applied 0011. Thanks! -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 08/11] cgroup: remove cgroup->count and use 2013-06-13 4:04 [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Tejun Heo 2013-06-13 4:04 ` [PATCH 05/11] cgroup: clean up css_[try]get() and css_put() Tejun Heo [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2013-06-13 4:04 ` Tejun Heo 2013-06-13 4:04 ` [PATCH 10/11] cgroup: split cgroup destruction into two steps Tejun Heo 2013-06-13 4:04 ` [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states Tejun Heo 4 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan; +Cc: containers, cgroups, koverstreet, linux-kernel, cl, Tejun Heo cgroup->count tracks the number of css_sets associated with the cgroup and used only to verify that no css_set is associated when the cgroup is being destroyed. It's superflous as the destruction path can simply check whether cgroup->cset_links is empty instead. Drop cgroup->count and check ->cset_links directly from cgroup_destroy_locked(). Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/cgroup.h | 6 ------ kernel/cgroup.c | 21 +++++---------------- 2 files changed, 5 insertions(+), 22 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index c86a93a..81bfd02 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -169,12 +169,6 @@ struct cgroup_name { struct cgroup { unsigned long flags; /* "unsigned long" so bitops work */ - /* - * count users of this cgroup. >0 means busy, but doesn't - * necessarily indicate the number of tasks in the cgroup - */ - atomic_t count; - int id; /* ida allocated in-hierarchy ID */ /* diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 22f9729..062653f 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -408,8 +408,7 @@ static void __put_css_set(struct css_set *cset, int taskexit) list_del(&link->cgrp_link); /* @cgrp can't go away while we're holding css_set_lock */ - if (atomic_dec_and_test(&cgrp->count) && - notify_on_release(cgrp)) { + if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) { if (taskexit) set_bit(CGRP_RELEASABLE, &cgrp->flags); check_for_release(cgrp); @@ -616,7 +615,6 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset, link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link); link->cset = cset; link->cgrp = cgrp; - atomic_inc(&cgrp->count); list_move(&link->cset_link, &cgrp->cset_links); /* * Always add links to the tail of the list so that the list @@ -4373,11 +4371,11 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) lockdep_assert_held(&cgroup_mutex); /* - * css_set_lock prevents @cgrp from being removed while - * __put_css_set() is in progress. + * css_set_lock synchronizes access to ->cset_links and prevents + * @cgrp from being removed while __put_css_set() is in progress. */ read_lock(&css_set_lock); - empty = !atomic_read(&cgrp->count) && list_empty(&cgrp->children); + empty = list_empty(&cgrp->cset_links) && list_empty(&cgrp->children); read_unlock(&css_set_lock); if (!empty) return -EBUSY; @@ -5057,7 +5055,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks) static void check_for_release(struct cgroup *cgrp) { if (cgroup_is_releasable(cgrp) && - !atomic_read(&cgrp->count) && list_empty(&cgrp->children)) { + list_empty(&cgrp->cset_links) && list_empty(&cgrp->children)) { /* * Control Group is currently removeable. If it's not * already queued for a userspace notification, queue @@ -5425,11 +5423,6 @@ static void debug_css_free(struct cgroup *cont) kfree(cont->subsys[debug_subsys_id]); } -static u64 cgroup_refcount_read(struct cgroup *cont, struct cftype *cft) -{ - return atomic_read(&cont->count); -} - static u64 debug_taskcount_read(struct cgroup *cont, struct cftype *cft) { return cgroup_task_count(cont); @@ -5511,10 +5504,6 @@ static u64 releasable_read(struct cgroup *cgrp, struct cftype *cft) static struct cftype debug_files[] = { { - .name = "cgroup_refcount", - .read_u64 = cgroup_refcount_read, - }, - { .name = "taskcount", .read_u64 = debug_taskcount_read, }, -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 10/11] cgroup: split cgroup destruction into two steps 2013-06-13 4:04 [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Tejun Heo ` (2 preceding siblings ...) 2013-06-13 4:04 ` [PATCH 08/11] cgroup: remove cgroup->count and use Tejun Heo @ 2013-06-13 4:04 ` Tejun Heo 2013-06-13 4:04 ` [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states Tejun Heo 4 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan; +Cc: containers, cgroups, koverstreet, linux-kernel, cl, Tejun Heo Split cgroup_destroy_locked() into two steps and put the latter half into cgroup_offline_fn() which is executed from a work item. The latter half is responsible for offlining the css's, removing the cgroup from internal lists, and propagating release notification to the parent. The separation is to allow using percpu refcnt for css. Note that this allows for other cgroup operations to happen between the first and second halves of destruction, including creating a new cgroup with the same name. As the target cgroup is marked DEAD in the first half and cgroup internals don't care about the names of cgroups, this should be fine. A comment explaining this will be added by the next patch which implements the actual percpu refcnting. As RCU freeing is guaranteed to happen after the second step of destruction, we can use the same work item for both. This patch renames cgroup->free_work to ->destroy_work and uses it for both purposes. INIT_WORK() is now performed right before queueing the work item. Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/cgroup.h | 2 +- kernel/cgroup.c | 38 +++++++++++++++++++++++++++----------- 2 files changed, 28 insertions(+), 12 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 81bfd02..e345d8b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -233,7 +233,7 @@ struct cgroup { /* For RCU-protected deletion */ struct rcu_head rcu_head; - struct work_struct free_work; + struct work_struct destroy_work; /* List of events which userspace want to receive */ struct list_head event_list; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 9261acc..ebbfc04 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -208,6 +208,7 @@ static struct cgroup_name root_cgroup_name = { .name = "/" }; */ static int need_forkexit_callback __read_mostly; +static void cgroup_offline_fn(struct work_struct *work); static int cgroup_destroy_locked(struct cgroup *cgrp); static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys, struct cftype cfts[], bool is_add); @@ -830,7 +831,7 @@ static struct cgroup_name *cgroup_alloc_name(struct dentry *dentry) static void cgroup_free_fn(struct work_struct *work) { - struct cgroup *cgrp = container_of(work, struct cgroup, free_work); + struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work); struct cgroup_subsys *ss; mutex_lock(&cgroup_mutex); @@ -875,7 +876,8 @@ static void cgroup_free_rcu(struct rcu_head *head) { struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); - schedule_work(&cgrp->free_work); + INIT_WORK(&cgrp->destroy_work, cgroup_free_fn); + schedule_work(&cgrp->destroy_work); } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -1407,7 +1409,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) INIT_LIST_HEAD(&cgrp->allcg_node); INIT_LIST_HEAD(&cgrp->release_list); INIT_LIST_HEAD(&cgrp->pidlists); - INIT_WORK(&cgrp->free_work, cgroup_free_fn); mutex_init(&cgrp->pidlist_mutex); INIT_LIST_HEAD(&cgrp->event_list); spin_lock_init(&cgrp->event_list_lock); @@ -2994,12 +2995,13 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos) /* * @pos could already have been removed. Once a cgroup is removed, * its ->sibling.next is no longer updated when its next sibling - * changes. As CGRP_DEAD is set on removal which is fully - * serialized, if we see it unasserted, it's guaranteed that the - * next sibling hasn't finished its grace period even if it's - * already removed, and thus safe to dereference from this RCU - * critical section. If ->sibling.next is inaccessible, - * cgroup_is_dead() is guaranteed to be visible as %true here. + * changes. As CGRP_DEAD assertion is serialized and happens + * before the cgroup is taken off the ->sibling list, if we see it + * unasserted, it's guaranteed that the next sibling hasn't + * finished its grace period even if it's already removed, and thus + * safe to dereference from this RCU critical section. If + * ->sibling.next is inaccessible, cgroup_is_dead() is guaranteed + * to be visible as %true here. */ if (likely(!cgroup_is_dead(pos))) { next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling); @@ -4362,7 +4364,6 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) __releases(&cgroup_mutex) __acquires(&cgroup_mutex) { struct dentry *d = cgrp->dentry; - struct cgroup *parent = cgrp->parent; struct cgroup_event *event, *tmp; struct cgroup_subsys *ss; bool empty; @@ -4426,6 +4427,21 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) } spin_unlock(&cgrp->event_list_lock); + INIT_WORK(&cgrp->destroy_work, cgroup_offline_fn); + schedule_work(&cgrp->destroy_work); + + return 0; +}; + +static void cgroup_offline_fn(struct work_struct *work) +{ + struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work); + struct cgroup *parent = cgrp->parent; + struct dentry *d = cgrp->dentry; + struct cgroup_subsys *ss; + + mutex_lock(&cgroup_mutex); + /* tell subsystems to initate destruction */ for_each_subsys(cgrp->root, ss) offline_css(ss, cgrp); @@ -4449,7 +4465,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) set_bit(CGRP_RELEASABLE, &parent->flags); check_for_release(parent); - return 0; + mutex_unlock(&cgroup_mutex); } static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry) -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states 2013-06-13 4:04 [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Tejun Heo ` (3 preceding siblings ...) 2013-06-13 4:04 ` [PATCH 10/11] cgroup: split cgroup destruction into two steps Tejun Heo @ 2013-06-13 4:04 ` Tejun Heo [not found] ` <1371096298-24402-12-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 4 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-06-13 4:04 UTC (permalink / raw) To: lizefan Cc: containers, cgroups, koverstreet, linux-kernel, cl, Tejun Heo, Michal Hocko, Mike Snitzer, Vivek Goyal, Alasdair G. Kergon, Jens Axboe, Mikulas Patocka, Glauber Costa A css (cgroup_subsys_state) is how each cgroup is represented to a controller. As such, it can be used in hot paths across the various subsystems different controllers are associated with. One of the common operations is reference counting, which up until now has been implemented using a global atomic counter and can have significant adverse impact on scalability. For example, css refcnt can be gotten and put multiple times by blkcg for each IO request. For highops configurations which try to do as much per-cpu as possible, the global frequent refcnting can be very expensive. In general, given the various and hugely diverse paths css's end up being used from, we need to make it cheap and highly scalable. In its usage, css refcnting isn't very different from module refcnting. This patch converts css refcnting to use the recently added percpu_ref. css_get/tryget/put() directly maps to the matching percpu_ref operations and the deactivation logic is no longer necessary as percpu_ref already has refcnt killing. The only complication is that as the refcnt is per-cpu, percpu_ref_kill() in itself doesn't ensure that further tryget operations will fail, which we need to guarantee before invoking ->css_offline()'s. This is resolved collecting kill confirmation using percpu_ref_kill_and_confirm() and initiating the offline phase of destruction after all css refcnt's are confirmed to be seen as killed on all CPUs. The previous patches already splitted destruction into two phases, so percpu_ref_kill_and_confirm() can be hooked up easily. This patch removes css_refcnt() which is used for rcu dereference sanity check in css_id(). While we can add a percpu refcnt API to ask the same question, css_id() itself is scheduled to be removed fairly soon, so let's not bother with it. Just drop the sanity check and use rcu_dereference_raw() instead. v2: - init_cgroup_css() was calling percpu_ref_init() without checking the return value. This causes two problems - the obvious lack of error handling and percpu_ref_init() being called from cgroup_init_subsys() before the allocators are up, which triggers warnings but doesn't cause actual problems as the refcnt isn't used for roots anyway. Fix both by moving percpu_ref_init() to cgroup_create(). - The base references were put too early by percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the refs one extra time. This wasn't noticeable because css's go through another RCU grace period before being freed. Update cgroup_destroy_locked() to grab an extra reference before killing the refcnts. This problem was noticed by Kent. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Alasdair G. Kergon" <agk@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Kent Overstreet <koverstreet@google.com> --- include/linux/cgroup.h | 23 +++---- kernel/cgroup.c | 165 +++++++++++++++++++++++++++++++------------------ 2 files changed, 112 insertions(+), 76 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index e345d8b..b7bd4be 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -20,6 +20,7 @@ #include <linux/workqueue.h> #include <linux/xattr.h> #include <linux/fs.h> +#include <linux/percpu-refcount.h> #ifdef CONFIG_CGROUPS @@ -72,13 +73,8 @@ struct cgroup_subsys_state { */ struct cgroup *cgroup; - /* - * State maintained by the cgroup system to allow subsystems - * to be "busy". Should be accessed via css_get(), - * css_tryget() and css_put(). - */ - - atomic_t refcnt; + /* reference count - access via css_[try]get() and css_put() */ + struct percpu_ref refcnt; unsigned long flags; /* ID for this css, if possible */ @@ -104,11 +100,9 @@ static inline void css_get(struct cgroup_subsys_state *css) { /* We don't need to reference count the root state */ if (!(css->flags & CSS_ROOT)) - atomic_inc(&css->refcnt); + percpu_ref_get(&css->refcnt); } -extern bool __css_tryget(struct cgroup_subsys_state *css); - /** * css_tryget - try to obtain a reference on the specified css * @css: target css @@ -123,11 +117,9 @@ static inline bool css_tryget(struct cgroup_subsys_state *css) { if (css->flags & CSS_ROOT) return true; - return __css_tryget(css); + return percpu_ref_tryget(&css->refcnt); } -extern void __css_put(struct cgroup_subsys_state *css); - /** * css_put - put a css reference * @css: target css @@ -137,7 +129,7 @@ extern void __css_put(struct cgroup_subsys_state *css); static inline void css_put(struct cgroup_subsys_state *css) { if (!(css->flags & CSS_ROOT)) - __css_put(css); + percpu_ref_put(&css->refcnt); } /* bits in struct cgroup flags field */ @@ -231,9 +223,10 @@ struct cgroup { struct list_head pidlists; struct mutex pidlist_mutex; - /* For RCU-protected deletion */ + /* For css percpu_ref killing and RCU-protected deletion */ struct rcu_head rcu_head; struct work_struct destroy_work; + atomic_t css_kill_cnt; /* List of events which userspace want to receive */ struct list_head event_list; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index ebbfc04..2e9da7b 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -63,9 +63,6 @@ #include <linux/atomic.h> -/* css deactivation bias, makes css->refcnt negative to deny new trygets */ -#define CSS_DEACT_BIAS INT_MIN - /* * cgroup_mutex is the master lock. Any modification to cgroup or its * hierarchy must be performed while holding it. @@ -213,19 +210,6 @@ static int cgroup_destroy_locked(struct cgroup *cgrp); static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys, struct cftype cfts[], bool is_add); -static int css_unbias_refcnt(int refcnt) -{ - return refcnt >= 0 ? refcnt : refcnt - CSS_DEACT_BIAS; -} - -/* the current nr of refs, always >= 0 whether @css is deactivated or not */ -static int css_refcnt(struct cgroup_subsys_state *css) -{ - int v = atomic_read(&css->refcnt); - - return css_unbias_refcnt(v); -} - /* convenient tests for these bits */ static inline bool cgroup_is_dead(const struct cgroup *cgrp) { @@ -4139,12 +4123,19 @@ static void css_dput_fn(struct work_struct *work) deactivate_super(sb); } +static void css_release(struct percpu_ref *ref) +{ + struct cgroup_subsys_state *css = + container_of(ref, struct cgroup_subsys_state, refcnt); + + schedule_work(&css->dput_work); +} + static void init_cgroup_css(struct cgroup_subsys_state *css, struct cgroup_subsys *ss, struct cgroup *cgrp) { css->cgroup = cgrp; - atomic_set(&css->refcnt, 1); css->flags = 0; css->id = NULL; if (cgrp == dummytop) @@ -4266,7 +4257,13 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry, err = PTR_ERR(css); goto err_free_all; } + + err = percpu_ref_init(&css->refcnt, css_release); + if (err) + goto err_free_all; + init_cgroup_css(css, ss, cgrp); + if (ss->use_id) { err = alloc_css_id(ss, parent, cgrp); if (err) @@ -4331,8 +4328,12 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry, err_free_all: for_each_subsys(root, ss) { - if (cgrp->subsys[ss->subsys_id]) + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; + + if (css) { + percpu_ref_cancel_init(&css->refcnt); ss->css_free(cgrp); + } } mutex_unlock(&cgroup_mutex); /* Release the reference count that we took on the superblock */ @@ -4360,6 +4361,48 @@ static int cgroup_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) return cgroup_create(c_parent, dentry, mode | S_IFDIR); } +static void cgroup_css_killed(struct cgroup *cgrp) +{ + if (!atomic_dec_and_test(&cgrp->css_kill_cnt)) + return; + + /* percpu ref's of all css's are killed, kick off the next step */ + INIT_WORK(&cgrp->destroy_work, cgroup_offline_fn); + schedule_work(&cgrp->destroy_work); +} + +static void css_ref_killed_fn(struct percpu_ref *ref) +{ + struct cgroup_subsys_state *css = + container_of(ref, struct cgroup_subsys_state, refcnt); + + cgroup_css_killed(css->cgroup); +} + +/** + * cgroup_destroy_locked - the first stage of cgroup destruction + * @cgrp: cgroup to be destroyed + * + * css's make use of percpu refcnts whose killing latency shouldn't be + * exposed to userland and are RCU protected. Also, cgroup core needs to + * guarantee that css_tryget() won't succeed by the time ->css_offline() is + * invoked. To satisfy all the requirements, destruction is implemented in + * the following two steps. + * + * s1. Verify @cgrp can be destroyed and mark it dying. Remove all + * userland visible parts and start killing the percpu refcnts of + * css's. Set up so that the next stage will be kicked off once all + * the percpu refcnts are confirmed to be killed. + * + * s2. Invoke ->css_offline(), mark the cgroup dead and proceed with the + * rest of destruction. Once all cgroup references are gone, the + * cgroup is RCU-freed. + * + * This function implements s1. After this step, @cgrp is gone as far as + * the userland is concerned and a new cgroup with the same name may be + * created. As cgroup doesn't care about the names internally, this + * doesn't cause any problem. + */ static int cgroup_destroy_locked(struct cgroup *cgrp) __releases(&cgroup_mutex) __acquires(&cgroup_mutex) { @@ -4382,16 +4425,34 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) return -EBUSY; /* - * Block new css_tryget() by deactivating refcnt and mark @cgrp - * removed. This makes future css_tryget() attempts fail which we - * guarantee to ->css_offline() callbacks. + * Block new css_tryget() by killing css refcnts. cgroup core + * guarantees that, by the time ->css_offline() is invoked, no new + * css reference will be given out via css_tryget(). We can't + * simply call percpu_ref_kill() and proceed to offlining css's + * because percpu_ref_kill() doesn't guarantee that the ref is seen + * as killed on all CPUs on return. + * + * Use percpu_ref_kill_and_confirm() to get notifications as each + * css is confirmed to be seen as killed on all CPUs. The + * notification callback keeps track of the number of css's to be + * killed and schedules cgroup_offline_fn() to perform the rest of + * destruction once the percpu refs of all css's are confirmed to + * be killed. */ + atomic_set(&cgrp->css_kill_cnt, 1); for_each_subsys(cgrp->root, ss) { struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; - WARN_ON(atomic_read(&css->refcnt) < 0); - atomic_add(CSS_DEACT_BIAS, &css->refcnt); + /* + * Killing would put the base ref, but we need to keep it + * alive until after ->css_offline. + */ + percpu_ref_get(&css->refcnt); + + atomic_inc(&cgrp->css_kill_cnt); + percpu_ref_kill_and_confirm(&css->refcnt, css_ref_killed_fn); } + cgroup_css_killed(cgrp); /* * Mark @cgrp dead. This prevents further task migration and child @@ -4427,12 +4488,19 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) } spin_unlock(&cgrp->event_list_lock); - INIT_WORK(&cgrp->destroy_work, cgroup_offline_fn); - schedule_work(&cgrp->destroy_work); - return 0; }; +/** + * cgroup_offline_fn - the second step of cgroup destruction + * @work: cgroup->destroy_free_work + * + * This function is invoked from a work item for a cgroup which is being + * destroyed after the percpu refcnts of all css's are guaranteed to be + * seen as killed on all CPUs, and performs the rest of destruction. This + * is the second step of destruction described in the comment above + * cgroup_destroy_locked(). + */ static void cgroup_offline_fn(struct work_struct *work) { struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work); @@ -4442,16 +4510,19 @@ static void cgroup_offline_fn(struct work_struct *work) mutex_lock(&cgroup_mutex); - /* tell subsystems to initate destruction */ + /* + * css_tryget() is guaranteed to fail now. Tell subsystems to + * initate destruction. + */ for_each_subsys(cgrp->root, ss) offline_css(ss, cgrp); /* - * Put all the base refs. Each css holds an extra reference to the - * cgroup's dentry and cgroup removal proceeds regardless of css - * refs. On the last put of each css, whenever that may be, the - * extra dentry ref is put so that dentry destruction happens only - * after all css's are released. + * Put the css refs from cgroup_destroy_locked(). Each css holds + * an extra reference to the cgroup's dentry and cgroup removal + * proceeds regardless of css refs. On the last put of each css, + * whenever that may be, the extra dentry ref is put so that dentry + * destruction happens only after all css's are released. */ for_each_subsys(cgrp->root, ss) css_put(cgrp->subsys[ss->subsys_id]); @@ -5100,34 +5171,6 @@ static void check_for_release(struct cgroup *cgrp) } } -/* Caller must verify that the css is not for root cgroup */ -bool __css_tryget(struct cgroup_subsys_state *css) -{ - while (true) { - int t, v; - - v = css_refcnt(css); - t = atomic_cmpxchg(&css->refcnt, v, v + 1); - if (likely(t == v)) - return true; - else if (t < 0) - return false; - cpu_relax(); - } -} -EXPORT_SYMBOL_GPL(__css_tryget); - -/* Caller must verify that the css is not for root cgroup */ -void __css_put(struct cgroup_subsys_state *css) -{ - int v; - - v = css_unbias_refcnt(atomic_dec_return(&css->refcnt)); - if (v == 0) - schedule_work(&css->dput_work); -} -EXPORT_SYMBOL_GPL(__css_put); - /* * Notify userspace when a cgroup is released, by running the * configured release agent with the name of the cgroup (path @@ -5245,7 +5288,7 @@ unsigned short css_id(struct cgroup_subsys_state *css) * on this or this is under rcu_read_lock(). Once css->id is allocated, * it's unchanged until freed. */ - cssid = rcu_dereference_check(css->id, css_refcnt(css)); + cssid = rcu_dereference_raw(css->id); if (cssid) return cssid->id; -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
[parent not found: <1371096298-24402-12-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <1371096298-24402-12-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2013-06-13 23:16 ` Kent Overstreet 2013-06-14 12:55 ` Michal Hocko 2013-06-14 13:20 ` Michal Hocko 2 siblings, 0 replies; 27+ messages in thread From: Kent Overstreet @ 2013-06-13 23:16 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Glauber Costa, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Mikulas Patocka, Alasdair G. Kergon, cgroups-u79uwXL29TY76Z2rM5mHXA, Vivek Goyal On Wed, Jun 12, 2013 at 09:04:58PM -0700, Tejun Heo wrote: > A css (cgroup_subsys_state) is how each cgroup is represented to a > controller. As such, it can be used in hot paths across the various > subsystems different controllers are associated with. > > One of the common operations is reference counting, which up until now > has been implemented using a global atomic counter and can have > significant adverse impact on scalability. For example, css refcnt > can be gotten and put multiple times by blkcg for each IO request. > For highops configurations which try to do as much per-cpu as > possible, the global frequent refcnting can be very expensive. > > In general, given the various and hugely diverse paths css's end up > being used from, we need to make it cheap and highly scalable. In its > usage, css refcnting isn't very different from module refcnting. > > This patch converts css refcnting to use the recently added > percpu_ref. css_get/tryget/put() directly maps to the matching > percpu_ref operations and the deactivation logic is no longer > necessary as percpu_ref already has refcnt killing. > > The only complication is that as the refcnt is per-cpu, > percpu_ref_kill() in itself doesn't ensure that further tryget > operations will fail, which we need to guarantee before invoking > ->css_offline()'s. This is resolved collecting kill confirmation > using percpu_ref_kill_and_confirm() and initiating the offline phase > of destruction after all css refcnt's are confirmed to be seen as > killed on all CPUs. The previous patches already splitted destruction > into two phases, so percpu_ref_kill_and_confirm() can be hooked up > easily. > > This patch removes css_refcnt() which is used for rcu dereference > sanity check in css_id(). While we can add a percpu refcnt API to ask > the same question, css_id() itself is scheduled to be removed fairly > soon, so let's not bother with it. Just drop the sanity check and use > rcu_dereference_raw() instead. > > v2: - init_cgroup_css() was calling percpu_ref_init() without checking > the return value. This causes two problems - the obvious lack > of error handling and percpu_ref_init() being called from > cgroup_init_subsys() before the allocators are up, which > triggers warnings but doesn't cause actual problems as the > refcnt isn't used for roots anyway. Fix both by moving > percpu_ref_init() to cgroup_create(). > > - The base references were put too early by > percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the > refs one extra time. This wasn't noticeable because css's go > through another RCU grace period before being freed. Update > cgroup_destroy_locked() to grab an extra reference before > killing the refcnts. This problem was noticed by Kent. Reviewed-by: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <1371096298-24402-12-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-06-13 23:16 ` Kent Overstreet @ 2013-06-14 12:55 ` Michal Hocko [not found] ` <20130614125539.GC10084-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2013-06-14 13:20 ` Michal Hocko 2 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2013-06-14 12:55 UTC (permalink / raw) To: Glauber Costa Cc: Jens Axboe, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Glauber Costa, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mikulas Patocka, Alasdair G. Kergon, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, Vivek Goyal On Wed 12-06-13 21:04:58, Tejun Heo wrote: [...] > +/** > + * cgroup_destroy_locked - the first stage of cgroup destruction > + * @cgrp: cgroup to be destroyed > + * > + * css's make use of percpu refcnts whose killing latency shouldn't be > + * exposed to userland and are RCU protected. Also, cgroup core needs to > + * guarantee that css_tryget() won't succeed by the time ->css_offline() is > + * invoked. To satisfy all the requirements, destruction is implemented in > + * the following two steps. > + * > + * s1. Verify @cgrp can be destroyed and mark it dying. Remove all > + * userland visible parts and start killing the percpu refcnts of > + * css's. Set up so that the next stage will be kicked off once all > + * the percpu refcnts are confirmed to be killed. > + * > + * s2. Invoke ->css_offline(), mark the cgroup dead and proceed with the > + * rest of destruction. Once all cgroup references are gone, the > + * cgroup is RCU-freed. > + * > + * This function implements s1. After this step, @cgrp is gone as far as > + * the userland is concerned and a new cgroup with the same name may be > + * created. As cgroup doesn't care about the names internally, this > + * doesn't cause any problem. Glauber is this asumption correct for kmem caches naming scheme? I guess it should, but I would rather be sure this won't blow up later specially when caches might live longer than css_offline. > + */ > static int cgroup_destroy_locked(struct cgroup *cgrp) > __releases(&cgroup_mutex) __acquires(&cgroup_mutex) > { -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20130614125539.GC10084-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <20130614125539.GC10084-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-06-14 14:15 ` Glauber Costa [not found] ` <20130614141510.GA19021-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Glauber Costa @ 2013-06-14 14:15 UTC (permalink / raw) To: Michal Hocko Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Vivek Goyal, Alasdair G. Kergon, Jens Axboe, Mikulas Patocka, Tejun Heo On Fri, Jun 14, 2013 at 02:55:39PM +0200, Michal Hocko wrote: > On Wed 12-06-13 21:04:58, Tejun Heo wrote: > [...] > > +/** > > + * cgroup_destroy_locked - the first stage of cgroup destruction > > + * @cgrp: cgroup to be destroyed > > + * > > + * css's make use of percpu refcnts whose killing latency shouldn't be > > + * exposed to userland and are RCU protected. Also, cgroup core needs to > > + * guarantee that css_tryget() won't succeed by the time ->css_offline() is > > + * invoked. To satisfy all the requirements, destruction is implemented in > > + * the following two steps. > > + * > > + * s1. Verify @cgrp can be destroyed and mark it dying. Remove all > > + * userland visible parts and start killing the percpu refcnts of > > + * css's. Set up so that the next stage will be kicked off once all > > + * the percpu refcnts are confirmed to be killed. > > + * > > + * s2. Invoke ->css_offline(), mark the cgroup dead and proceed with the > > + * rest of destruction. Once all cgroup references are gone, the > > + * cgroup is RCU-freed. > > + * > > + * This function implements s1. After this step, @cgrp is gone as far as > > + * the userland is concerned and a new cgroup with the same name may be > > + * created. As cgroup doesn't care about the names internally, this > > + * doesn't cause any problem. > > Glauber is this asumption correct for kmem caches naming scheme? > I guess it should, but I would rather be sure this won't blow up later > specially when caches might live longer than css_offline. > We append names to caches, but never the name alone, always (kmemcg_id:name) So you can reuse the same name as many times as you want (from the kmemcg perspective), provided you use different kmemcg_ids. Because kmemcg_ids are dissociated from css_ids (partly because I was already seeing people talking about wanting to free the css_ids earlier), we should not have any problems with this. ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20130614141510.GA19021-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <20130614141510.GA19021-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2013-06-14 14:22 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-06-14 14:22 UTC (permalink / raw) To: Glauber Costa Cc: Jens Axboe, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mikulas Patocka, Alasdair G. Kergon, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, Vivek Goyal On Fri 14-06-13 18:15:12, Glauber Costa wrote: > On Fri, Jun 14, 2013 at 02:55:39PM +0200, Michal Hocko wrote: > > On Wed 12-06-13 21:04:58, Tejun Heo wrote: > > [...] > > > +/** > > > + * cgroup_destroy_locked - the first stage of cgroup destruction > > > + * @cgrp: cgroup to be destroyed > > > + * > > > + * css's make use of percpu refcnts whose killing latency shouldn't be > > > + * exposed to userland and are RCU protected. Also, cgroup core needs to > > > + * guarantee that css_tryget() won't succeed by the time ->css_offline() is > > > + * invoked. To satisfy all the requirements, destruction is implemented in > > > + * the following two steps. > > > + * > > > + * s1. Verify @cgrp can be destroyed and mark it dying. Remove all > > > + * userland visible parts and start killing the percpu refcnts of > > > + * css's. Set up so that the next stage will be kicked off once all > > > + * the percpu refcnts are confirmed to be killed. > > > + * > > > + * s2. Invoke ->css_offline(), mark the cgroup dead and proceed with the > > > + * rest of destruction. Once all cgroup references are gone, the > > > + * cgroup is RCU-freed. > > > + * > > > + * This function implements s1. After this step, @cgrp is gone as far as > > > + * the userland is concerned and a new cgroup with the same name may be > > > + * created. As cgroup doesn't care about the names internally, this > > > + * doesn't cause any problem. > > > > Glauber is this asumption correct for kmem caches naming scheme? > > I guess it should, but I would rather be sure this won't blow up later > > specially when caches might live longer than css_offline. > > > > We append names to caches, but never the name alone, always (kmemcg_id:name) > So you can reuse the same name as many times as you want (from the kmemcg > perspective), provided you use different kmemcg_ids. Because kmemcg_ids are > dissociated from css_ids (partly because I was already seeing people talking > about wanting to free the css_ids earlier), we should not have any problems > with this. Thanks for double checking, Glauber! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <1371096298-24402-12-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2013-06-13 23:16 ` Kent Overstreet 2013-06-14 12:55 ` Michal Hocko @ 2013-06-14 13:20 ` Michal Hocko [not found] ` <20130614132026.GD10084-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2013-06-14 13:20 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Vivek Goyal, Alasdair G. Kergon, Jens Axboe, Mikulas Patocka, Glauber Costa On Wed 12-06-13 21:04:58, Tejun Heo wrote: > A css (cgroup_subsys_state) is how each cgroup is represented to a > controller. As such, it can be used in hot paths across the various > subsystems different controllers are associated with. > > One of the common operations is reference counting, which up until now > has been implemented using a global atomic counter and can have > significant adverse impact on scalability. For example, css refcnt > can be gotten and put multiple times by blkcg for each IO request. > For highops configurations which try to do as much per-cpu as > possible, the global frequent refcnting can be very expensive. > > In general, given the various and hugely diverse paths css's end up > being used from, we need to make it cheap and highly scalable. In its > usage, css refcnting isn't very different from module refcnting. > > This patch converts css refcnting to use the recently added > percpu_ref. I have no objections to change css reference counting scheme if the guarantees we used to have are still valid. I am just missing some comparisons. Do you have any numbers that would show benefits clearly? You are mentioning that especially controllers that are strongly per-cpu oriented will see the biggest improvements. What about others? A single atomic_add resp. atomic_dec_return is much less heavy than the new ref counting. Is it possible that those could regress? Or the differences would be only within the noise? I do realize that it is not possible to test all controllers but I would be interested to see at least that those for which it really matters get a nice boost. Memcg uses css with caution although there are places which are in really hot paths (e.g. charging) so any improvement would be really welcome. Sorry, if this information has been posted along with the series. I was CCed only on this one and didn't get to look at the rest yet (apart from "percpu: implement generic percpu refcounting" in your review-css-percpu-ref branch). [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20130614132026.GD10084-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <20130614132026.GD10084-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-06-14 22:31 ` Tejun Heo [not found] ` <20130614223125.GD6593-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-06-14 22:31 UTC (permalink / raw) To: Michal Hocko Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Vivek Goyal, Alasdair G. Kergon, Jens Axboe, Mikulas Patocka, Glauber Costa [-- Attachment #1: Type: text/plain, Size: 3244 bytes --] Hello, Michal. On Fri, Jun 14, 2013 at 03:20:26PM +0200, Michal Hocko wrote: > I have no objections to change css reference counting scheme if the > guarantees we used to have are still valid. I am just missing some > comparisons. Do you have any numbers that would show benefits clearly? Mikulas' high scalability dm test case on top of ramdisk was affected severely when css refcnting was added to track the original issuer's cgroup context. That probably is one of the more severe cases. > You are mentioning that especially controllers that are strongly per-cpu > oriented will see the biggest improvements. What about others? > A single atomic_add resp. atomic_dec_return is much less heavy than the Even with preemption enabled, the percpu ref get/put will be under ten instructions which touch two memory areas - the preemption counter which is usually very hot anyway and the percpu refcnt itself. It shouldn't be much slower than the atomic ops. If the kernel has preemption disabled, percpu_ref is actually likely to be cheaper even on single CPU. So, here are some numbers from the attached test program. The test is very simple - inc ref, copy N bytes into per-cpu buf, dec ref - and see how many times it can do that in given amount of time - 15s. Both single CPU and all CPUs scenarios are tested. The test is run inside qemu on my laptop - mobile i7 2 core / 4 threads. Yeah, I know. I'll run it on a proper test machine later today. Single CPU case. Preemption enabled. This is the best scenario for atomic_t. No cacheline bouncing at all. copy size atomic_t percpu_ref diff 0 1198217077 1747505555 +45.84% 32 505504457 465714742 -7.87% 64 511521639 470741887 -7.97% 128 485319952 434920137 -10.38% 256 421809359 384871731 -8.76% 512 330527059 307587622 -6.94% For some reason, percpu_ref wins if copy_size is zero. I don't know why that is. The body isn't optimized out so it's still doing all the refcnting. Maybe the CPU doesn't have enough work to mask pipeline bubbles from atomic ops? In other cases, it's slower by around or under 10% which isn't exactly noise but this is the worst possible scenario. Unless this is the only thing a pinned CPU is doing, it's unlikely to be noticeable. Now doing the same thing on multiple CPUs. Note that while this is the best scenario for percpu_ref, the hardware the test is run on is very favorable to atomic_t - it's just two cores on the same package sharing the L3 cache, so cacheline ping-poinging is relatively cheap. copy size atomic_t percpu_ref diff 0 342659959 3794775739 +1007.45% 32 401933493 1337286466 +232.71% 64 385459752 1353978982 +251.26% 128 401379622 1234780968 +207.63% 256 401170676 1052511682 +162.36% 512 387796881 794101245 +104.77% Even on this machine, the difference is huge. If the refcnt is used from different CPUs in any frequency, percpu_ref will destroy atomic_t. Also note that percpu_ref will scale perfectly as the number of CPUs increases while atomic_t will get worse. I'll play with it a bit more on an actual machine and post more results. Test program attached. Thanks. -- tejun [-- Attachment #2: test-pcpuref.c --] [-- Type: text/plain, Size: 2951 bytes --] #include <linux/module.h> #include <linux/moduleparam.h> #include <asm/atomic.h> #include <linux/percpu-refcount.h> #include <linux/workqueue.h> #include <linux/completion.h> static struct workqueue_struct *my_wq; static int test_all_cpus = 0; module_param_named(all_cpus, test_all_cpus, int, 0444); static int test_duration = 15; module_param_named(duration, test_duration, int, 0444); static int test_copy_bytes = 32; module_param_named(copy_bytes, test_copy_bytes, int, 0444); static char test_src_buf[1024]; static DEFINE_PER_CPU(char [1024], test_dst_buf); static DEFINE_PER_CPU(struct work_struct, test_work); static atomic_t test_atomic_ref = ATOMIC_INIT(1); static void test_atomic_workfn(struct work_struct *work) { unsigned long end = jiffies + test_duration * HZ; unsigned long cnt = 0; do { int bytes = test_copy_bytes; atomic_inc(&test_atomic_ref); while (bytes) { int todo = min_t(int, bytes, 1024); memcpy(*this_cpu_ptr(&test_dst_buf), test_src_buf, todo); bytes -= todo; } cnt++; atomic_dec(&test_atomic_ref); } while (time_before(jiffies, end)); printk("XXX atomic on CPU %d completed %lu loops\n", smp_processor_id(), cnt); } static struct percpu_ref test_pcpu_ref; static DECLARE_COMPLETION(test_cpu_ref_done); static void test_pcpu_release(struct percpu_ref *ref) { complete(&test_cpu_ref_done); } static void test_pcpu_workfn(struct work_struct *work) { unsigned long end = jiffies + test_duration * HZ; unsigned long cnt = 0; do { int bytes = test_copy_bytes; percpu_ref_get(&test_pcpu_ref); while (bytes) { int todo = min_t(int, bytes, 1024); memcpy(*this_cpu_ptr(&test_dst_buf), test_src_buf, todo); bytes -= todo; } cnt++; percpu_ref_put(&test_pcpu_ref); } while (time_before(jiffies, end)); printk("XXX percpu on CPU %d completed %lu loops\n", smp_processor_id(), cnt); } static int test_pcpuref_init(void) { int cpu; if (percpu_ref_init(&test_pcpu_ref, test_pcpu_release)) return -ENOMEM; my_wq = alloc_workqueue("test-pcpuref", WQ_CPU_INTENSIVE, 0); if (!my_wq) { percpu_ref_cancel_init(&test_pcpu_ref); return -ENOMEM; } printk("XXX testing atomic_t for %d secs\n", test_duration); for_each_online_cpu(cpu) { INIT_WORK(per_cpu_ptr(&test_work, cpu), test_atomic_workfn); queue_work_on(cpu, my_wq, per_cpu_ptr(&test_work, cpu)); if (!test_all_cpus) break; } flush_workqueue(my_wq); printk("XXX testing percpu_ref for %d secs\n", test_duration); for_each_online_cpu(cpu) { INIT_WORK(per_cpu_ptr(&test_work, cpu), test_pcpu_workfn); queue_work_on(cpu, my_wq, per_cpu_ptr(&test_work, cpu)); if (!test_all_cpus) break; } flush_workqueue(my_wq); percpu_ref_kill(&test_pcpu_ref); wait_for_completion(&test_cpu_ref_done); return 0; } module_init(test_pcpuref_init); static void test_pcpuref_exit(void) { destroy_workqueue(my_wq); } module_exit(test_pcpuref_exit); MODULE_LICENSE("GPL v2"); ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20130614223125.GD6593-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <20130614223125.GD6593-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2013-06-15 5:35 ` Tejun Heo [not found] ` <20130615053522.GA7017-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-06-15 5:35 UTC (permalink / raw) To: Michal Hocko Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Vivek Goyal, Alasdair G. Kergon, Jens Axboe, Mikulas Patocka, Glauber Costa On Fri, Jun 14, 2013 at 03:31:25PM -0700, Tejun Heo wrote: > I'll play with it a bit more on an actual machine and post more > results. Test program attached. So, here are the results from the same test on a dual-socket 2-way NUMA opteron 8 core machine. Running on one CPU. copy size atomic percpu diff in pct 0 535964443 616756827 +15.07% 32 399988186 378678713 -5.33% 64 389067476 355073979 -8.74% 128 342192631 315615300 -7.77% 256 281208005 260598931 -7.33% 512 188070912 193225269 +2.74% Running on all eight cores. copy size atomic percpu diff in pct 0 121324328 4889425511 +3,930.05% 32 96170193 2999613380 +3,019.07% 64 98139061 2813894184 +2,767.25% 128 112610025 2503229487 +2,122.92% 256 96828114 2069865752 +2,037.67% 512 95858297 1537726109 +1,504.17% Ration of all cores / single core. copy size atomic percpu 0 0.23 7.93 32 0.24 7.92 64 0.25 7.92 128 0.33 7.93 256 0.34 7.94 512 0.51 7.96 Note exactly 8 - the cores do share quite a bit of resources after all - but pretty close. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20130615053522.GA7017-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>]
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <20130615053522.GA7017-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org> @ 2013-06-15 5:39 ` Tejun Heo 2013-06-15 6:31 ` Tejun Heo 2013-06-17 13:27 ` Michal Hocko 2 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-15 5:39 UTC (permalink / raw) To: Michal Hocko Cc: Jens Axboe, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Glauber Costa, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mikulas Patocka, Alasdair G. Kergon, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, Vivek Goyal On Fri, Jun 14, 2013 at 10:35:22PM -0700, Tejun Heo wrote: > On Fri, Jun 14, 2013 at 03:31:25PM -0700, Tejun Heo wrote: > > I'll play with it a bit more on an actual machine and post more > > results. Test program attached. > > So, here are the results from the same test on a dual-socket 2-way > NUMA opteron 8 core machine. > > Running on one CPU. > > copy size atomic percpu diff in pct > 0 535964443 616756827 +15.07% > 32 399988186 378678713 -5.33% > 64 389067476 355073979 -8.74% > 128 342192631 315615300 -7.77% > 256 281208005 260598931 -7.33% > 512 188070912 193225269 +2.74% > > Running on all eight cores. > > copy size atomic percpu diff in pct > 0 121324328 4889425511 +3,930.05% > 32 96170193 2999613380 +3,019.07% > 64 98139061 2813894184 +2,767.25% > 128 112610025 2503229487 +2,122.92% > 256 96828114 2069865752 +2,037.67% > 512 95858297 1537726109 +1,504.17% A bit of addition, this of course is completely synthetic and exaggerates the differences both ways, but it's pretty clear that this is gonna be a clear gain in any kind of workload which would generate some amount of cross-CPU refcnting, which would be the norm anyway. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <20130615053522.GA7017-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org> 2013-06-15 5:39 ` Tejun Heo @ 2013-06-15 6:31 ` Tejun Heo 2013-06-17 13:27 ` Michal Hocko 2 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-15 6:31 UTC (permalink / raw) To: Michal Hocko Cc: Jens Axboe, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Glauber Costa, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mikulas Patocka, Alasdair G. Kergon, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, Vivek Goyal On Fri, Jun 14, 2013 at 10:35:22PM -0700, Tejun Heo wrote: > On Fri, Jun 14, 2013 at 03:31:25PM -0700, Tejun Heo wrote: > > I'll play with it a bit more on an actual machine and post more > > results. Test program attached. > > So, here are the results from the same test on a dual-socket 2-way > NUMA opteron 8 core machine. > > Running on one CPU. > > copy size atomic percpu diff in pct > 0 535964443 616756827 +15.07% > 32 399988186 378678713 -5.33% > 64 389067476 355073979 -8.74% > 128 342192631 315615300 -7.77% > 256 281208005 260598931 -7.33% > 512 188070912 193225269 +2.74% > > Running on all eight cores. > > copy size atomic percpu diff in pct > 0 121324328 4889425511 +3,930.05% > 32 96170193 2999613380 +3,019.07% > 64 98139061 2813894184 +2,767.25% > 128 112610025 2503229487 +2,122.92% > 256 96828114 2069865752 +2,037.67% > 512 95858297 1537726109 +1,504.17% > > Ration of all cores / single core. > > copy size atomic percpu > 0 0.23 7.93 > 32 0.24 7.92 > 64 0.25 7.92 > 128 0.33 7.93 > 256 0.34 7.94 > 512 0.51 7.96 I was testing with CONFIG_PREEMPT, which makes rcu_read_[un]lock()s quite a bit more expensive. The following is the same test results with CONFIG_PREEMPT_VOLUNTARY which would the most preemptive server distros would get anyway. One CPU. copy size atomic percpu diff in pct 0 534583387 1521561724 +184.63% 32 399098138 615962137 + 54.34% 64 388128431 555599274 + 43.15% 128 341336474 464502792 + 36.08% 256 280471681 354186740 + 26.28% 512 203784802 240067596 + 17.80% All eight CPUs. copy size atomic percpu diff in pct 0 117213982 12488998111 +10,554.87% 32 103545751 4940695158 + 4,671.51% 64 98135094 4456370409 + 4,441.06% 128 117729659 3725434154 + 3,064.40% 256 95916768 2840992396 + 2,861.94% 512 95795993 1926044518 + 1,910.57% Ration of all cores / single core. copy size atomic percpu 0 0.22 8.21 32 0.26 8.02 64 0.25 8.02 128 0.34 8.02 256 0.34 8.02 512 0.47 8.02 So, it's faster even with only one CPU. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <20130615053522.GA7017-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org> 2013-06-15 5:39 ` Tejun Heo 2013-06-15 6:31 ` Tejun Heo @ 2013-06-17 13:27 ` Michal Hocko [not found] ` <20130617132744.GB5018-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2013-06-17 13:27 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Vivek Goyal, Alasdair G. Kergon, Jens Axboe, Mikulas Patocka, Glauber Costa On Fri 14-06-13 22:35:22, Tejun Heo wrote: > On Fri, Jun 14, 2013 at 03:31:25PM -0700, Tejun Heo wrote: > > I'll play with it a bit more on an actual machine and post more > > results. Test program attached. > > So, here are the results from the same test on a dual-socket 2-way > NUMA opteron 8 core machine. > > Running on one CPU. > > copy size atomic percpu diff in pct > 0 535964443 616756827 +15.07% > 32 399988186 378678713 -5.33% > 64 389067476 355073979 -8.74% > 128 342192631 315615300 -7.77% > 256 281208005 260598931 -7.33% > 512 188070912 193225269 +2.74% > > Running on all eight cores. > > copy size atomic percpu diff in pct > 0 121324328 4889425511 +3,930.05% > 32 96170193 2999613380 +3,019.07% > 64 98139061 2813894184 +2,767.25% > 128 112610025 2503229487 +2,122.92% > 256 96828114 2069865752 +2,037.67% > 512 95858297 1537726109 +1,504.17% > > Ration of all cores / single core. > > copy size atomic percpu > 0 0.23 7.93 > 32 0.24 7.92 > 64 0.25 7.92 > 128 0.33 7.93 > 256 0.34 7.94 > 512 0.51 7.96 > > Note exactly 8 - the cores do share quite a bit of resources after all > - but pretty close. > Thanks for your results! Yes the boost can be really high. I will try to measure some memcg workloads as soon as I have some spare cycles. I do not expect a big win but I also do not think this would regress. Thanks. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20130617132744.GB5018-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states [not found] ` <20130617132744.GB5018-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-06-17 17:16 ` Tejun Heo 0 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-17 17:16 UTC (permalink / raw) To: Michal Hocko Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, koverstreet-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Mike Snitzer, Vivek Goyal, Alasdair G. Kergon, Jens Axboe, Mikulas Patocka, Glauber Costa Hello, Michal. On Mon, Jun 17, 2013 at 03:27:44PM +0200, Michal Hocko wrote: > Thanks for your results! Yes the boost can be really high. I will try to > measure some memcg workloads as soon as I have some spare cycles. I do > not expect a big win but I also do not think this would regress. With the just committed RCU-sched conversion, the slowdown even on single CPU configuration w/ CONFIG_PREEMPT should be in the range of a few percent points in the worst cases and will mostly be in the noise in any realistic scenarios. So, yeah, looks pretty good to me. :) Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCHSET cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref
@ 2013-06-12 21:03 Tejun Heo
  2013-06-12 21:03 ` [PATCH 08/11] cgroup: remove cgroup->count and use Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2013-06-12 21:03 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	koverstreet-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
Hello,
This patchset does a lot of cleanup and then updates css
(cgroup_subsys_state) refcnt to use the new percpu reference counter.
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various hugely diverse paths css's end up being
used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patchset contains the following 11 patches.
 0001-cgroup-remove-now-unused-css_depth.patch
 0002-cgroup-consistently-use-cset-for-struct-css_set-vari.patch
 0003-cgroup-bring-some-sanity-to-naming-around-cg_cgroup_.patch
 0004-cgroup-use-kzalloc-and-list_del_init.patch
 0005-cgroup-clean-up-css_-try-get-and-css_put.patch
 0006-cgroup-rename-CGRP_REMOVED-to-CGRP_DEAD.patch
 0007-cgroup-drop-unnecessary-RCU-dancing-from-__put_css_s.patch
 0008-cgroup-remove-cgroup-count-and-use.patch
 0009-cgroup-reorder-the-operations-in-cgroup_destroy_lock.patch
 0010-cgroup-split-cgroup-destruction-into-two-steps.patch
 0011-cgroup-use-percpu-refcnt-for-cgroup_subsys_states.patch
0001-0007 are cleanups, many of them cosmetic.  They clean up the code
paths which are related with the refcnting changes.  If you're only
interested in the precpu usage, you can probably skip these.
0008-0010 updates the cgroup destruction path so that percpu refcnting
can be used.
0011 updates css refcnting to use percpu_ref.
This patchset is on top of
  cgroup/for-3.11 d5c56ced77 ("cgroup: clean up the cftype array for the base cgroup files")
+ [1] percpu/review-percpu-ref-tryget
and available in the following git branch.
 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-css-percpu-ref
diffstat follows.  Thanks.
 include/linux/cgroup.h |   87 ++----
 kernel/cgroup.c        |  710 +++++++++++++++++++++++++------------------------
 2 files changed, 406 insertions(+), 391 deletions(-)
--
tejun
[1] http://thread.gmane.org/gmane.linux.kernel/1507258
^ permalink raw reply	[flat|nested] 27+ messages in thread* [PATCH 08/11] cgroup: remove cgroup->count and use 2013-06-12 21:03 [PATCHSET cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Tejun Heo @ 2013-06-12 21:03 ` Tejun Heo 0 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-06-12 21:03 UTC (permalink / raw) To: lizefan; +Cc: containers, cgroups, koverstreet, linux-kernel, cl, Tejun Heo cgroup->count tracks the number of css_sets associated with the cgroup and used only to verify that no css_set is associated when the cgroup is being destroyed. It's superflous as the destruction path can simply check whether cgroup->cset_links is empty instead. Drop cgroup->count and check ->cset_links directly from cgroup_destroy_locked(). Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/cgroup.h | 6 ------ kernel/cgroup.c | 21 +++++---------------- 2 files changed, 5 insertions(+), 22 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 75a6733..5428738 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -169,12 +169,6 @@ struct cgroup_name { struct cgroup { unsigned long flags; /* "unsigned long" so bitops work */ - /* - * count users of this cgroup. >0 means busy, but doesn't - * necessarily indicate the number of tasks in the cgroup - */ - atomic_t count; - int id; /* ida allocated in-hierarchy ID */ /* diff --git a/kernel/cgroup.c b/kernel/cgroup.c index efe500b..aefda90 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -408,8 +408,7 @@ static void __put_css_set(struct css_set *cset, int taskexit) list_del_init(&link->cgrp_link); /* @cgrp can't go away while we're holding css_set_lock */ - if (atomic_dec_and_test(&cgrp->count) && - notify_on_release(cgrp)) { + if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) { if (taskexit) set_bit(CGRP_RELEASABLE, &cgrp->flags); check_for_release(cgrp); @@ -616,7 +615,6 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset, link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link); link->cset = cset; link->cgrp = cgrp; - atomic_inc(&cgrp->count); list_move(&link->cset_link, &cgrp->cset_links); /* * Always add links to the tail of the list so that the list @@ -4373,11 +4371,11 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) lockdep_assert_held(&cgroup_mutex); /* - * css_set_lock prevents @cgrp from being removed while - * __put_css_set() is in progress. + * css_set_lock synchronizes access to ->cset_links and prevents + * @cgrp from being removed while __put_css_set() is in progress. */ read_lock(&css_set_lock); - empty = !atomic_read(&cgrp->count) && list_empty(&cgrp->children); + empty = list_empty(&cgrp->cset_links) && list_empty(&cgrp->children); read_unlock(&css_set_lock); if (!empty) return -EBUSY; @@ -5057,7 +5055,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks) static void check_for_release(struct cgroup *cgrp) { if (cgroup_is_releasable(cgrp) && - !atomic_read(&cgrp->count) && list_empty(&cgrp->children)) { + list_empty(&cgrp->cset_links) && list_empty(&cgrp->children)) { /* * Control Group is currently removeable. If it's not * already queued for a userspace notification, queue @@ -5425,11 +5423,6 @@ static void debug_css_free(struct cgroup *cont) kfree(cont->subsys[debug_subsys_id]); } -static u64 cgroup_refcount_read(struct cgroup *cont, struct cftype *cft) -{ - return atomic_read(&cont->count); -} - static u64 debug_taskcount_read(struct cgroup *cont, struct cftype *cft) { return cgroup_task_count(cont); @@ -5511,10 +5504,6 @@ static u64 releasable_read(struct cgroup *cgrp, struct cftype *cft) static struct cftype debug_files[] = { { - .name = "cgroup_refcount", - .read_u64 = cgroup_refcount_read, - }, - { .name = "taskcount", .read_u64 = debug_taskcount_read, }, -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
end of thread, other threads:[~2013-06-17 17:16 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-13  4:04 [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Tejun Heo
2013-06-13  4:04 ` [PATCH 05/11] cgroup: clean up css_[try]get() and css_put() Tejun Heo
     [not found] ` <1371096298-24402-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-06-13  4:04   ` [PATCH 01/11] cgroup: remove now unused css_depth() Tejun Heo
2013-06-13  4:04   ` [PATCH 02/11] cgroup: consistently use @cset for struct css_set variables Tejun Heo
2013-06-13  4:04   ` [PATCH 03/11] cgroup: bring some sanity to naming around cg_cgroup_link Tejun Heo
2013-06-13  4:04   ` [PATCH 04/11] cgroup: use kzalloc() instead of kmalloc() Tejun Heo
2013-06-13  4:04   ` [PATCH 06/11] cgroup: rename CGRP_REMOVED to CGRP_DEAD Tejun Heo
2013-06-13  4:04   ` [PATCH 07/11] cgroup: drop unnecessary RCU dancing from __put_css_set() Tejun Heo
2013-06-13  4:04   ` [PATCH 09/11] cgroup: reorder the operations in cgroup_destroy_locked() Tejun Heo
2013-06-13  6:04   ` [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Li Zefan
     [not found]     ` <51B960FF.7070604-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-13 17:56       ` Tejun Heo
     [not found]         ` <20130613175627.GC10799-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-06-14  2:41           ` Tejun Heo
2013-06-13  4:04 ` [PATCH 08/11] cgroup: remove cgroup->count and use Tejun Heo
2013-06-13  4:04 ` [PATCH 10/11] cgroup: split cgroup destruction into two steps Tejun Heo
2013-06-13  4:04 ` [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states Tejun Heo
     [not found]   ` <1371096298-24402-12-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-06-13 23:16     ` Kent Overstreet
2013-06-14 12:55     ` Michal Hocko
     [not found]       ` <20130614125539.GC10084-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-06-14 14:15         ` Glauber Costa
     [not found]           ` <20130614141510.GA19021-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2013-06-14 14:22             ` Michal Hocko
2013-06-14 13:20     ` Michal Hocko
     [not found]       ` <20130614132026.GD10084-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-06-14 22:31         ` Tejun Heo
     [not found]           ` <20130614223125.GD6593-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-06-15  5:35             ` Tejun Heo
     [not found]               ` <20130615053522.GA7017-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-06-15  5:39                 ` Tejun Heo
2013-06-15  6:31                 ` Tejun Heo
2013-06-17 13:27                 ` Michal Hocko
     [not found]                   ` <20130617132744.GB5018-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-06-17 17:16                     ` Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2013-06-12 21:03 [PATCHSET cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref Tejun Heo
2013-06-12 21:03 ` [PATCH 08/11] cgroup: remove cgroup->count and use Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).