* [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly
@ 2014-05-09 21:31 Tejun Heo
2014-05-09 21:31 ` [PATCH 01/14] cgroup: remove css_parent() Tejun Heo
` (10 more replies)
0 siblings, 11 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
Hello,
Currently, while csses (cgroup_subsys_states) have ->parent linkage
too, only cgroups form full tree through their ->children and
->sibling fields and css iterations naturally is implemented by
iterating cgroups and then dereferencing the css for the specified
subsystem.
There are now use cases where controllers need to iterate through
csses regardless of their online state as long as they have positive
reference. This can't easily be achieved by iterating cgroups because
its css pointer array needs to be cleared on offline and there may be
multiple dying csses for a cgroup for the same subsystem and there's
only one pointer per cgroup-subsystem pair.
This patchset moves ->children and ->sibling from cgroup to css and
link all csses in proper trees and then make css iterators walk csses
directly instead of going through cgroups. This achieves iteration of
all non-released csses while also simplifying the iteration
implementation. This is also in line with the general direction of
using csses as the primary structural component.
This patchset contains the following fourteen patches.
0001-cgroup-remove-css_parent.patch
0002-cgroup-remove-pointless-has-tasks-children-test-from.patch
0003-memcg-update-memcg_has_children-to-use-css_next_chil.patch
0004-device_cgroup-remove-direct-access-to-cgroup-childre.patch
0005-cgroup-remove-cgroup-parent.patch
0006-cgroup-move-cgroup-sibling-and-children-into-cgroup_.patch
0007-cgroup-link-all-cgroup_subsys_states-in-their-siblin.patch
0008-cgroup-move-cgroup-serial_nr-into-cgroup_subsys_stat.patch
0009-cgroup-introduce-CSS_RELEASED-and-reduce-css-iterati.patch
0010-cgroup-iterate-cgroup_subsys_states-directly.patch
0011-cgroup-use-CSS_ONLINE-instead-of-CGRP_DEAD.patch
0012-cgroup-convert-cgroup_has_live_children-into-css_has.patch
0013-device_cgroup-use-css_has_online_children-instead-of.patch
0014-cgroup-implement-css_tryget.patch
0001-0004 are prep patches.
0005-0008 move fields from cgroup to css and link csses in tree
structure instead of cgroups.
0009-0010 implement direct css iteration.
0011-0013 convert a cgroup based interface to a css one, which is now
possible as both are the same in terms of the tree structure, and fix
devcg brekage using it.
0014 implements css_tryget() which is to be used to gain access to
offline but not-yet-released csses.
This pachset is on top of
b9a63d0116e8 ("Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.16")
+ [1] [PATCHSET v2 cgroup/for-3.16] cgroup: post unified hierarchy fixes and updates
+ [2] (REFRESHED) [PATCHSET cgroup/for-3.16] cgroup: implement cftype->write()
+ [3] (REFRESHED) [PATCHSET cgroup/for-3.16] cgroup: remove cgroup_tree_mutex
+ [4] [PATCHSET cgroup/for-3.16] cgroup: use css->refcnt for cgroup reference counting
and available in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-direct-css-iteration
diffstat follows. Thanks.
block/blk-cgroup.h | 2
include/linux/cgroup.h | 122 +++++++++++---------
kernel/cgroup.c | 257 ++++++++++++++++++++++++-------------------
kernel/cgroup_freezer.c | 2
kernel/cpuset.c | 2
kernel/sched/core.c | 2
kernel/sched/cpuacct.c | 2
mm/hugetlb_cgroup.c | 2
mm/memcontrol.c | 45 +++----
net/core/netclassid_cgroup.c | 2
net/core/netprio_cgroup.c | 2
security/device_cgroup.c | 17 --
12 files changed, 251 insertions(+), 206 deletions(-)
--
tejun
[1] http://lkml.kernel.org/g/1399663975-315-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
[2] http://lkml.kernel.org/g/20140509195059.GE4486-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org
[3] http://lkml.kernel.org/g/20140509195827.GG4486-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org
[4] http://lkml.kernel.org/g/1399670015-23463-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 01/14] cgroup: remove css_parent()
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
[not found] ` <1399671091-23867-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-12 13:16 ` [PATCH " Michal Hocko
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (9 subsequent siblings)
10 siblings, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan
Cc: cgroups, linux-kernel, hannes, Tejun Heo, Vivek Goyal, Jens Axboe,
Peter Zijlstra, Michal Hocko, Neil Horman, David S. Miller
cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent. It was quite some
time ago and we're moving forward with making css more prominent.
This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent. While at it, explicitly mark fields of css
which are public and immutable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
---
block/blk-cgroup.h | 2 +-
include/linux/cgroup.h | 29 +++++++++++------------------
kernel/cgroup.c | 8 ++++----
kernel/cgroup_freezer.c | 2 +-
kernel/cpuset.c | 2 +-
kernel/sched/core.c | 2 +-
kernel/sched/cpuacct.c | 2 +-
mm/hugetlb_cgroup.c | 2 +-
mm/memcontrol.c | 14 +++++++-------
net/core/netclassid_cgroup.c | 2 +-
net/core/netprio_cgroup.c | 2 +-
security/device_cgroup.c | 6 +++---
12 files changed, 33 insertions(+), 40 deletions(-)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 371fe8e..d692b29 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -204,7 +204,7 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
*/
static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
{
- return css_to_blkcg(css_parent(&blkcg->css));
+ return css_to_blkcg(blkcg->css.parent);
}
/**
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 76dadd77..88c4d03 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -48,22 +48,28 @@ enum cgroup_subsys_id {
};
#undef SUBSYS
-/* Per-subsystem/per-cgroup state maintained by the system. */
+/*
+ * Per-subsystem/per-cgroup state maintained by the system. This is the
+ * fundamental structural building block that controllers deal with.
+ *
+ * Fields marked with "PI:" are public and immutable and may be accessed
+ * directly without synchronization.
+ */
struct cgroup_subsys_state {
- /* the cgroup that this css is attached to */
+ /* PI: the cgroup that this css is attached to */
struct cgroup *cgroup;
- /* the cgroup subsystem that this css is attached to */
+ /* PI: the cgroup subsystem that this css is attached to */
struct cgroup_subsys *ss;
/* reference count - access via css_[try]get() and css_put() */
struct percpu_ref refcnt;
- /* the parent css */
+ /* PI: the parent css */
struct cgroup_subsys_state *parent;
/*
- * Subsys-unique ID. 0 is unused and root is always 1. The
+ * PI: Subsys-unique ID. 0 is unused and root is always 1. The
* matching css can be looked up using css_from_id().
*/
int id;
@@ -665,19 +671,6 @@ struct cgroup_subsys {
#undef SUBSYS
/**
- * css_parent - find the parent css
- * @css: the target cgroup_subsys_state
- *
- * Return the parent css of @css. This function is guaranteed to return
- * non-NULL parent as long as @css isn't the root.
- */
-static inline
-struct cgroup_subsys_state *css_parent(struct cgroup_subsys_state *css)
-{
- return css->parent;
-}
-
-/**
* task_css_set_check - obtain a task's css_set with extra access conditions
* @task: the task to obtain css_set for
* @__c: extra condition expression to be passed to rcu_dereference_check()
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 64ff413..dab8f35 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3175,10 +3175,10 @@ css_next_descendant_pre(struct cgroup_subsys_state *pos,
/* no child, visit my or the closest ancestor's next sibling */
while (pos != root) {
- next = css_next_child(pos, css_parent(pos));
+ next = css_next_child(pos, pos->parent);
if (next)
return next;
- pos = css_parent(pos);
+ pos = pos->parent;
}
return NULL;
@@ -3260,12 +3260,12 @@ css_next_descendant_post(struct cgroup_subsys_state *pos,
return NULL;
/* if there's an unvisited sibling, visit its leftmost descendant */
- next = css_next_child(pos, css_parent(pos));
+ next = css_next_child(pos, pos->parent);
if (next)
return css_leftmost_descendant(next);
/* no sibling left, visit parent */
- return css_parent(pos);
+ return pos->parent;
}
static bool cgroup_has_live_children(struct cgroup *cgrp)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index f52c443..fee1ae63 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -59,7 +59,7 @@ static inline struct freezer *task_freezer(struct task_struct *task)
static struct freezer *parent_freezer(struct freezer *freezer)
{
- return css_freezer(css_parent(&freezer->css));
+ return css_freezer(freezer->css.parent);
}
bool cgroup_freezing(struct task_struct *task)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 2f4b08b..5b2a310 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -124,7 +124,7 @@ static inline struct cpuset *task_cs(struct task_struct *task)
static inline struct cpuset *parent_cs(struct cpuset *cs)
{
- return css_cs(css_parent(&cs->css));
+ return css_cs(cs->css.parent);
}
#ifdef CONFIG_NUMA
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..ac61ad1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7586,7 +7586,7 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
- struct task_group *parent = css_tg(css_parent(css));
+ struct task_group *parent = css_tg(css->parent);
if (parent)
sched_online_group(tg, parent);
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index c143ee3..9cf350c 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -46,7 +46,7 @@ static inline struct cpuacct *task_ca(struct task_struct *tsk)
static inline struct cpuacct *parent_ca(struct cpuacct *ca)
{
- return css_ca(css_parent(&ca->css));
+ return css_ca(ca->css.parent);
}
static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index a380681..493f758 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -52,7 +52,7 @@ static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
static inline struct hugetlb_cgroup *
parent_hugetlb_cgroup(struct hugetlb_cgroup *h_cg)
{
- return hugetlb_cgroup_from_css(css_parent(&h_cg->css));
+ return hugetlb_cgroup_from_css(h_cg->css.parent);
}
static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b638a79..a5e0417 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1540,7 +1540,7 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
/* root ? */
- if (!css_parent(&memcg->css))
+ if (!memcg->css.parent)
return vm_swappiness;
return memcg->swappiness;
@@ -4909,7 +4909,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
{
int retval = 0;
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- struct mem_cgroup *parent_memcg = mem_cgroup_from_css(css_parent(&memcg->css));
+ struct mem_cgroup *parent_memcg = mem_cgroup_from_css(memcg->css.parent);
mutex_lock(&memcg_create_mutex);
@@ -5207,8 +5207,8 @@ static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
if (!memcg->use_hierarchy)
goto out;
- while (css_parent(&memcg->css)) {
- memcg = mem_cgroup_from_css(css_parent(&memcg->css));
+ while (memcg->css.parent) {
+ memcg = mem_cgroup_from_css(memcg->css.parent);
if (!memcg->use_hierarchy)
break;
tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
@@ -5443,7 +5443,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
struct cftype *cft, u64 val)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(&memcg->css));
+ struct mem_cgroup *parent = mem_cgroup_from_css(memcg->css.parent);
if (val > 100 || !parent)
return -EINVAL;
@@ -5790,7 +5790,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
struct cftype *cft, u64 val)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(&memcg->css));
+ struct mem_cgroup *parent = mem_cgroup_from_css(memcg->css.parent);
/* cannot set to root cgroup and only 0 and 1 are allowed */
if (!parent || !((val == 0) || (val == 1)))
@@ -6407,7 +6407,7 @@ static int
mem_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(css));
+ struct mem_cgroup *parent = mem_cgroup_from_css(css->parent);
if (css->id > MEM_CGROUP_ID_MAX)
return -ENOSPC;
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 22931e1..30d903b 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -42,7 +42,7 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
static int cgrp_css_online(struct cgroup_subsys_state *css)
{
struct cgroup_cls_state *cs = css_cls_state(css);
- struct cgroup_cls_state *parent = css_cls_state(css_parent(css));
+ struct cgroup_cls_state *parent = css_cls_state(css->parent);
if (parent)
cs->classid = parent->classid;
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b990cef..2f385b9 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -140,7 +140,7 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
static int cgrp_css_online(struct cgroup_subsys_state *css)
{
- struct cgroup_subsys_state *parent_css = css_parent(css);
+ struct cgroup_subsys_state *parent_css = css->parent;
struct net_device *dev;
int ret = 0;
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 82d6b4f..3116015 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -182,7 +182,7 @@ static inline bool is_devcg_online(const struct dev_cgroup *devcg)
static int devcgroup_online(struct cgroup_subsys_state *css)
{
struct dev_cgroup *dev_cgroup = css_to_devcgroup(css);
- struct dev_cgroup *parent_dev_cgroup = css_to_devcgroup(css_parent(css));
+ struct dev_cgroup *parent_dev_cgroup = css_to_devcgroup(css->parent);
int ret = 0;
mutex_lock(&devcgroup_mutex);
@@ -374,7 +374,7 @@ static bool may_access(struct dev_cgroup *dev_cgroup,
static int parent_has_perm(struct dev_cgroup *childcg,
struct dev_exception_item *ex)
{
- struct dev_cgroup *parent = css_to_devcgroup(css_parent(&childcg->css));
+ struct dev_cgroup *parent = css_to_devcgroup(childcg->css.parent);
if (!parent)
return 1;
@@ -502,7 +502,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
char temp[12]; /* 11 + 1 characters needed for a u32 */
int count, rc = 0;
struct dev_exception_item ex;
- struct dev_cgroup *parent = css_to_devcgroup(css_parent(&devcgroup->css));
+ struct dev_cgroup *parent = css_to_devcgroup(devcgroup->css.parent);
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty()
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-09 21:31 ` Tejun Heo
[not found] ` <1399671091-23867-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-09 21:31 ` [PATCH 03/14] memcg: update memcg_has_children() to use css_next_child() Tejun Heo
` (6 subsequent siblings)
7 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, Tejun Heo, Michal Hocko
mem_cgroup_force_empty() is used only from
mem_cgroup_force_empty_write() and tests whether the target memcg has
any tasks or children without any synchronization and then returns
-EBUSY if so.
This is just weird. The tests don't really mean anything as tasks and
children may be added after the tests and it also makes the behavior
of the knob arbitrary because there may be lingering offline and
removed children on the children list for extended period of time -
writes to the knob can return -EBUSY for reasons completely invisible
to userland.
The knob is best-effort anyway and the broken business test doesn't
affect its operation. Remove it.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
mm/memcontrol.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a5e0417..036453a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4857,11 +4857,6 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
{
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct cgroup *cgrp = memcg->css.cgroup;
-
- /* returns EBUSY if there is a task or if we come here twice. */
- if (cgroup_has_tasks(cgrp) || !list_empty(&cgrp->children))
- return -EBUSY;
/* we call try-to-free pages for make this cgroup empty */
lru_add_drain_all();
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 03/14] memcg: update memcg_has_children() to use css_next_child()
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-09 21:31 ` [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty() Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
[not found] ` <1399671091-23867-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-13 16:53 ` [PATCH v2 " Tejun Heo
2014-05-09 21:31 ` [PATCH 06/14] cgroup: move cgroup->sibling and ->children into cgroup_subsys_state Tejun Heo
` (5 subsequent siblings)
7 siblings, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, Tejun Heo, Michal Hocko
Currently, memcg_has_children() and mem_cgroup_hierarchy_write()
directly test cgroup->children for list emptiness. It's semantically
correct in traditional hierarchies as it actually wants to test for
any children dead or alive; however, cgroup->children is not a
published field and scheduled to go away.
This patch moves out .use_hierarchy test out of memcg_has_children()
and updates it to use css_next_child() to test whether there exists
any children. With .use_hierarchy test moved out, it can also be used
by mem_cgroup_hierarchy_write().
A side note: As .use_hierarchy is going away, it doesn't really matter
but I'm not sure about how it's used in __memcg_activate_kmem(). The
condition tested by memcg_has_children() is mushy when seen from
userland as its result is affected by dead csses which aren't visible
from userland. I think the rule would be a lot clearer if we have a
dedicated "freshly minted" flag which gets cleared when the first task
is migrated into it or the first child is created and then gate
activation with that.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
mm/memcontrol.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 036453a..eb6e1b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4834,18 +4834,23 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
} while (usage > 0);
}
+/* test whether @memcg has children, dead or alive */
static inline bool memcg_has_children(struct mem_cgroup *memcg)
{
- lockdep_assert_held(&memcg_create_mutex);
+ bool ret;
+
/*
- * The lock does not prevent addition or deletion to the list
- * of children, but it prevents a new child from being
- * initialized based on this parent in css_online(), so it's
- * enough to decide whether hierarchically inherited
- * attributes can still be changed or not.
+ * The lock does not prevent addition or deletion of children, but
+ * it prevents a new child from being initialized based on this
+ * parent in css_online(), so it's enough to decide whether
+ * hierarchically inherited attributes can still be changed or not.
*/
- return memcg->use_hierarchy &&
- !list_empty(&memcg->css.cgroup->children);
+ lockdep_assert_held(&memcg_create_mutex);
+
+ rcu_read_lock();
+ ret = css_next_child(NULL, &memcg->css);
+ rcu_read_unlock();
+ return ret;
}
/*
@@ -4921,7 +4926,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
*/
if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
(val == 1 || val == 0)) {
- if (list_empty(&memcg->css.cgroup->children))
+ if (!memcg_has_children(memcg))
memcg->use_hierarchy = val;
else
retval = -EBUSY;
@@ -5038,7 +5043,8 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
* of course permitted.
*/
mutex_lock(&memcg_create_mutex);
- if (cgroup_has_tasks(memcg->css.cgroup) || memcg_has_children(memcg))
+ if (cgroup_has_tasks(memcg->css.cgroup) ||
+ (memcg->use_hierarchy && memcg_has_children(memcg)))
err = -EBUSY;
mutex_unlock(&memcg_create_mutex);
if (err)
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 04/14] device_cgroup: remove direct access to cgroup->children
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
2014-05-09 21:31 ` [PATCH 01/14] cgroup: remove css_parent() Tejun Heo
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-09 21:31 ` Tejun Heo
[not found] ` <1399671091-23867-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-09 21:31 ` [PATCH 05/14] cgroup: remove cgroup->parent Tejun Heo
` (7 subsequent siblings)
10 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan
Cc: cgroups, linux-kernel, hannes, Tejun Heo, Aristeu Rozanski,
Serge Hallyn
Currently, devcg::has_children() directly tests cgroup->children for
list emptiness. The field is not a published field and scheduled to
go away. In addition, the test isn't strictly correct as devcg should
only care about children which are visible to userland.
This patch converts has_children() to use css_next_child() instead.
The subtle incorrectness is noted and will be dealt with later.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
---
security/device_cgroup.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 3116015..75b4b18 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -477,9 +477,17 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
static inline bool has_children(struct dev_cgroup *devcgroup)
{
- struct cgroup *cgrp = devcgroup->css.cgroup;
+ bool ret;
- return !list_empty(&cgrp->children);
+ /*
+ * FIXME: There may be lingering offline csses and this function
+ * may return %true when there isn't any userland-visible child
+ * which is incorrect for our purposes.
+ */
+ rcu_read_lock();
+ ret = css_next_child(NULL, &devcgroup->css);
+ rcu_read_unlock();
+ return ret;
}
/*
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 05/14] cgroup: remove cgroup->parent
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (2 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 04/14] device_cgroup: remove direct access to cgroup->children Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
2014-05-09 21:31 ` [PATCH 08/14] cgroup: move cgroup->serial_nr into cgroup_subsys_state Tejun Heo
` (6 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan; +Cc: cgroups, linux-kernel, hannes, Tejun Heo
cgroup->parent is redundant as cgroup->self.parent can also be used to
determine the parent cgroup and we're moving towards using
cgroup_subsys_states as the fundamental structural blocks. This patch
introduces cgroup_parent() which follows cgroup->self.parent and
removes cgroup->parent.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup.h | 1 -
kernel/cgroup.c | 52 +++++++++++++++++++++++++++++---------------------
2 files changed, 30 insertions(+), 23 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 88c4d03..5fda088 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -173,7 +173,6 @@ struct cgroup {
struct list_head sibling; /* my parent's children */
struct list_head children; /* my children */
- struct cgroup *parent; /* my parent */
struct kernfs_node *kn; /* cgroup kernfs entry */
struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dab8f35..d61c716 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -218,6 +218,15 @@ static void cgroup_idr_remove(struct idr *idr, int id)
spin_unlock_bh(&cgroup_idr_lock);
}
+static struct cgroup *cgroup_parent(struct cgroup *cgrp)
+{
+ struct cgroup_subsys_state *parent_css = cgrp->self.parent;
+
+ if (parent_css)
+ return container_of(parent_css, struct cgroup, self);
+ return NULL;
+}
+
/**
* cgroup_css - obtain a cgroup's css for the specified subsystem
* @cgrp: the cgroup of interest
@@ -260,9 +269,9 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
if (!(cgrp->root->subsys_mask & (1 << ss->id)))
return NULL;
- while (cgrp->parent &&
- !(cgrp->parent->child_subsys_mask & (1 << ss->id)))
- cgrp = cgrp->parent;
+ while (cgroup_parent(cgrp) &&
+ !(cgroup_parent(cgrp)->child_subsys_mask & (1 << ss->id)))
+ cgrp = cgroup_parent(cgrp);
return cgroup_css(cgrp, ss);
}
@@ -307,7 +316,7 @@ bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor)
while (cgrp) {
if (cgrp == ancestor)
return true;
- cgrp = cgrp->parent;
+ cgrp = cgroup_parent(cgrp);
}
return false;
}
@@ -454,7 +463,7 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
if (cgrp->populated_kn)
kernfs_notify(cgrp->populated_kn);
- cgrp = cgrp->parent;
+ cgrp = cgroup_parent(cgrp);
} while (cgrp);
}
@@ -2017,7 +2026,7 @@ static int cgroup_migrate_prepare_dst(struct cgroup *dst_cgrp,
* Except for the root, child_subsys_mask must be zero for a cgroup
* with tasks so that child cgroups don't compete against tasks.
*/
- if (dst_cgrp && cgroup_on_dfl(dst_cgrp) && dst_cgrp->parent &&
+ if (dst_cgrp && cgroup_on_dfl(dst_cgrp) && cgroup_parent(dst_cgrp) &&
dst_cgrp->child_subsys_mask)
return -EBUSY;
@@ -2426,7 +2435,7 @@ static int cgroup_controllers_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- cgroup_print_ss_mask(seq, cgrp->parent->child_subsys_mask);
+ cgroup_print_ss_mask(seq, cgroup_parent(cgrp)->child_subsys_mask);
return 0;
}
@@ -2609,8 +2618,8 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
/* unavailable or not enabled on the parent? */
if (!(cgrp_dfl_root.subsys_mask & (1 << ssid)) ||
- (cgrp->parent &&
- !(cgrp->parent->child_subsys_mask & (1 << ssid)))) {
+ (cgroup_parent(cgrp) &&
+ !(cgroup_parent(cgrp)->child_subsys_mask & (1 << ssid)))) {
ret = -ENOENT;
goto out_unlock;
}
@@ -2639,7 +2648,7 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
* Except for the root, child_subsys_mask must be zero for a cgroup
* with tasks so that child cgroups don't compete against tasks.
*/
- if (enable && cgrp->parent && !list_empty(&cgrp->cset_links)) {
+ if (enable && cgroup_parent(cgrp) && !list_empty(&cgrp->cset_links)) {
ret = -EBUSY;
goto out_unlock;
}
@@ -2897,9 +2906,9 @@ static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
continue;
if ((cft->flags & CFTYPE_INSANE) && cgroup_sane_behavior(cgrp))
continue;
- if ((cft->flags & CFTYPE_NOT_ON_ROOT) && !cgrp->parent)
+ if ((cft->flags & CFTYPE_NOT_ON_ROOT) && !cgroup_parent(cgrp))
continue;
- if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgrp->parent)
+ if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp))
continue;
if (is_add) {
@@ -4091,14 +4100,14 @@ static void css_free_work_fn(struct work_struct *work)
atomic_dec(&cgrp->root->nr_cgrps);
cgroup_pidlist_destroy_all(cgrp);
- if (cgrp->parent) {
+ if (cgroup_parent(cgrp)) {
/*
* We get a ref to the parent, and put the ref when
* this cgroup is being freed, so it's guaranteed
* that the parent won't be destroyed before its
* children.
*/
- cgroup_put(cgrp->parent);
+ cgroup_put(cgroup_parent(cgrp));
kernfs_put(cgrp->kn);
kfree(cgrp);
} else {
@@ -4162,8 +4171,8 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
css->ss = ss;
css->flags = 0;
- if (cgrp->parent) {
- css->parent = cgroup_css(cgrp->parent, ss);
+ if (cgroup_parent(cgrp)) {
+ css->parent = cgroup_css(cgroup_parent(cgrp), ss);
css_get(css->parent);
}
@@ -4217,7 +4226,7 @@ static void offline_css(struct cgroup_subsys_state *css)
*/
static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss)
{
- struct cgroup *parent = cgrp->parent;
+ struct cgroup *parent = cgroup_parent(cgrp);
struct cgroup_subsys_state *css;
int err;
@@ -4250,7 +4259,7 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss)
goto err_clear_dir;
if (ss->broken_hierarchy && !ss->warned_broken_hierarchy &&
- parent->parent) {
+ cgroup_parent(parent)) {
pr_warn("%s (%d) created nested cgroup for controller \"%s\" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.\n",
current->comm, current->pid, ss->name);
if (!strcmp(ss->name, "memory"))
@@ -4308,7 +4317,6 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
init_cgroup_housekeeping(cgrp);
- cgrp->parent = parent;
cgrp->self.parent = &parent->self;
cgrp->root = root;
@@ -4335,7 +4343,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
cgrp->serial_nr = cgroup_serial_nr_next++;
/* allocation complete, commit to creation */
- list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children);
+ list_add_tail_rcu(&cgrp->sibling, &cgroup_parent(cgrp)->children);
atomic_inc(&root->nr_cgrps);
cgroup_get(parent);
@@ -4530,8 +4538,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
*/
kernfs_remove(cgrp->kn);
- set_bit(CGRP_RELEASABLE, &cgrp->parent->flags);
- check_for_release(cgrp->parent);
+ set_bit(CGRP_RELEASABLE, &cgroup_parent(cgrp)->flags);
+ check_for_release(cgroup_parent(cgrp));
/* put the base reference */
percpu_ref_kill(&cgrp->self.refcnt);
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 06/14] cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-09 21:31 ` [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty() Tejun Heo
2014-05-09 21:31 ` [PATCH 03/14] memcg: update memcg_has_children() to use css_next_child() Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
2014-05-09 21:31 ` [PATCH 07/14] cgroup: link all cgroup_subsys_states in their sibling lists Tejun Heo
` (4 subsequent siblings)
7 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, Tejun Heo
We're moving towards using cgroup_subsys_states as the fundamental
structural blocks. Let's move cgroup->sibling and ->children into
cgroup_subsys_state. This is pure move without functional change and
only cgroup->self's fields are actually used. Other csses will make
use of the fields later.
While at it, update init_and_link_css() so that it zeroes the whole
css before initializing it and remove explicit zeroing of ->flags.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
include/linux/cgroup.h | 11 ++++-------
kernel/cgroup.c | 38 ++++++++++++++++++++------------------
2 files changed, 24 insertions(+), 25 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5fda088..db8d8c2 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -68,6 +68,10 @@ struct cgroup_subsys_state {
/* PI: the parent css */
struct cgroup_subsys_state *parent;
+ /* siblings list anchored at the parent's ->children */
+ struct list_head sibling;
+ struct list_head children;
+
/*
* PI: Subsys-unique ID. 0 is unused and root is always 1. The
* matching css can be looked up using css_from_id().
@@ -166,13 +170,6 @@ struct cgroup {
*/
int populated_cnt;
- /*
- * We link our 'sibling' struct into our parent's 'children'.
- * Our children link their 'sibling' into our 'children'.
- */
- struct list_head sibling; /* my parent's children */
- struct list_head children; /* my children */
-
struct kernfs_node *kn; /* cgroup kernfs entry */
struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d61c716..3d6c0aa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -378,7 +378,7 @@ static int notify_on_release(const struct cgroup *cgrp)
/* iterate over child cgrps, lock should be held throughout iteration */
#define cgroup_for_each_live_child(child, cgrp) \
- list_for_each_entry((child), &(cgrp)->children, sibling) \
+ list_for_each_entry((child), &(cgrp)->self.children, self.sibling) \
if (({ lockdep_assert_held(&cgroup_mutex); \
cgroup_is_dead(child); })) \
; \
@@ -870,7 +870,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
mutex_lock(&cgroup_mutex);
BUG_ON(atomic_read(&root->nr_cgrps));
- BUG_ON(!list_empty(&cgrp->children));
+ BUG_ON(!list_empty(&cgrp->self.children));
/* Rebind all subsystems back to the default hierarchy */
rebind_subsystems(&cgrp_dfl_root, root->subsys_mask);
@@ -1432,7 +1432,7 @@ static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data)
}
/* remounting is not allowed for populated hierarchies */
- if (!list_empty(&root->cgrp.children)) {
+ if (!list_empty(&root->cgrp.self.children)) {
ret = -EBUSY;
goto out_unlock;
}
@@ -1512,8 +1512,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
struct cgroup_subsys *ss;
int ssid;
- INIT_LIST_HEAD(&cgrp->sibling);
- INIT_LIST_HEAD(&cgrp->children);
+ INIT_LIST_HEAD(&cgrp->self.sibling);
+ INIT_LIST_HEAD(&cgrp->self.children);
INIT_LIST_HEAD(&cgrp->cset_links);
INIT_LIST_HEAD(&cgrp->release_list);
INIT_LIST_HEAD(&cgrp->pidlists);
@@ -1612,7 +1612,7 @@ static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
link_css_set(&tmp_links, cset, root_cgrp);
up_write(&css_set_rwsem);
- BUG_ON(!list_empty(&root_cgrp->children));
+ BUG_ON(!list_empty(&root_cgrp->self.children));
BUG_ON(atomic_read(&root->nr_cgrps) != 1);
kernfs_activate(root_cgrp->kn);
@@ -3127,11 +3127,11 @@ css_next_child(struct cgroup_subsys_state *pos_css,
* cgroup is removed or iteration and removal race.
*/
if (!pos) {
- next = list_entry_rcu(cgrp->children.next, struct cgroup, sibling);
+ next = list_entry_rcu(cgrp->self.children.next, struct cgroup, self.sibling);
} else if (likely(!cgroup_is_dead(pos))) {
- next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
+ next = list_entry_rcu(pos->self.sibling.next, struct cgroup, self.sibling);
} else {
- list_for_each_entry_rcu(next, &cgrp->children, sibling)
+ list_for_each_entry_rcu(next, &cgrp->self.children, self.sibling)
if (next->serial_nr > pos->serial_nr)
break;
}
@@ -3141,12 +3141,12 @@ css_next_child(struct cgroup_subsys_state *pos_css,
* the next sibling; however, it might have @ss disabled. If so,
* fast-forward to the next enabled one.
*/
- while (&next->sibling != &cgrp->children) {
+ while (&next->self.sibling != &cgrp->self.children) {
struct cgroup_subsys_state *next_css = cgroup_css(next, parent_css->ss);
if (next_css)
return next_css;
- next = list_entry_rcu(next->sibling.next, struct cgroup, sibling);
+ next = list_entry_rcu(next->self.sibling.next, struct cgroup, self.sibling);
}
return NULL;
}
@@ -3282,7 +3282,7 @@ static bool cgroup_has_live_children(struct cgroup *cgrp)
struct cgroup *child;
rcu_read_lock();
- list_for_each_entry_rcu(child, &cgrp->children, sibling) {
+ list_for_each_entry_rcu(child, &cgrp->self.children, self.sibling) {
if (!cgroup_is_dead(child)) {
rcu_read_unlock();
return true;
@@ -4143,7 +4143,7 @@ static void css_release_work_fn(struct work_struct *work)
} else {
/* cgroup release path */
mutex_lock(&cgroup_mutex);
- list_del_rcu(&cgrp->sibling);
+ list_del_rcu(&cgrp->self.sibling);
mutex_unlock(&cgroup_mutex);
cgroup_idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
@@ -4167,9 +4167,11 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
{
cgroup_get(cgrp);
+ memset(css, 0, sizeof(*css));
css->cgroup = cgrp;
css->ss = ss;
- css->flags = 0;
+ INIT_LIST_HEAD(&css->sibling);
+ INIT_LIST_HEAD(&css->children);
if (cgroup_parent(cgrp)) {
css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -4343,7 +4345,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
cgrp->serial_nr = cgroup_serial_nr_next++;
/* allocation complete, commit to creation */
- list_add_tail_rcu(&cgrp->sibling, &cgroup_parent(cgrp)->children);
+ list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children);
atomic_inc(&root->nr_cgrps);
cgroup_get(parent);
@@ -4506,9 +4508,9 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
return -EBUSY;
/*
- * Make sure there's no live children. We can't test ->children
- * emptiness as dead children linger on it while being destroyed;
- * otherwise, "rmdir parent/child parent" may fail with -EBUSY.
+ * Make sure there's no live children. We can't test emptiness of
+ * ->self.children as dead children linger on it while being
+ * drained; otherwise, "rmdir parent/child parent" may fail.
*/
if (cgroup_has_live_children(cgrp))
return -EBUSY;
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 07/14] cgroup: link all cgroup_subsys_states in their sibling lists
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (2 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 06/14] cgroup: move cgroup->sibling and ->children into cgroup_subsys_state Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
2014-05-13 16:59 ` [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (3 subsequent siblings)
7 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, Tejun Heo
Currently, while all csses have ->children and ->sibling, only the
self csses of cgroups make use of them. This patch makes all other
csses to link themselves on the sibling lists too. This will be used
to update css iteration.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
kernel/cgroup.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3d6c0aa..6930b78 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4137,19 +4137,21 @@ static void css_release_work_fn(struct work_struct *work)
struct cgroup_subsys *ss = css->ss;
struct cgroup *cgrp = css->cgroup;
+ mutex_lock(&cgroup_mutex);
+
+ list_del_rcu(&css->sibling);
+
if (ss) {
/* css release path */
cgroup_idr_remove(&ss->css_idr, css->id);
} else {
/* cgroup release path */
- mutex_lock(&cgroup_mutex);
- list_del_rcu(&cgrp->self.sibling);
- mutex_unlock(&cgroup_mutex);
-
cgroup_idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
cgrp->id = -1;
}
+ mutex_unlock(&cgroup_mutex);
+
call_rcu(&css->rcu_head, css_free_rcu_fn);
}
@@ -4229,12 +4231,13 @@ static void offline_css(struct cgroup_subsys_state *css)
static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss)
{
struct cgroup *parent = cgroup_parent(cgrp);
+ struct cgroup_subsys_state *parent_css = cgroup_css(parent, ss);
struct cgroup_subsys_state *css;
int err;
lockdep_assert_held(&cgroup_mutex);
- css = ss->css_alloc(cgroup_css(parent, ss));
+ css = ss->css_alloc(parent_css);
if (IS_ERR(css))
return PTR_ERR(css);
@@ -4254,11 +4257,12 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss)
goto err_free_id;
/* @css is ready to be brought online now, make it visible */
+ list_add_tail_rcu(&css->sibling, &parent_css->children);
cgroup_idr_replace(&ss->css_idr, css, css->id);
err = online_css(css);
if (err)
- goto err_clear_dir;
+ goto err_list_del;
if (ss->broken_hierarchy && !ss->warned_broken_hierarchy &&
cgroup_parent(parent)) {
@@ -4271,7 +4275,8 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss)
return 0;
-err_clear_dir:
+err_list_del:
+ list_del_rcu(&css->sibling);
cgroup_clear_dir(css->cgroup, 1 << css->ss->id);
err_free_id:
cgroup_idr_remove(&ss->css_idr, css->id);
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 08/14] cgroup: move cgroup->serial_nr into cgroup_subsys_state
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (3 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 05/14] cgroup: remove cgroup->parent Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
2014-05-09 21:31 ` [PATCH 09/14] cgroup: introduce CSS_RELEASED and reduce css iteration fallback window Tejun Heo
` (5 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan; +Cc: cgroups, linux-kernel, hannes, Tejun Heo
We're moving towards using cgroup_subsys_states as the fundamental
structural blocks. All csses including the cgroup->self and actual
ones now form trees through css->children and ->sibling which follow
the same rules as what cgroup->children and ->sibling followed. This
patch moves cgroup->serial_nr which is used to implement css iteration
into css.
Note that all csses, regardless of their types, allocate their serial
numbers from the same monotonically increasing counter. This doesn't
affect the ordering needed by css iteration or cause any other
material behavior changes. This will be used to update css iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup.h | 16 ++++++++--------
kernel/cgroup.c | 20 +++++++++++---------
2 files changed, 19 insertions(+), 17 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index db8d8c2..e65cc0f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -80,6 +80,14 @@ struct cgroup_subsys_state {
unsigned int flags;
+ /*
+ * Monotonically increasing unique serial number which defines a
+ * uniform order among all csses. It's guaranteed that all
+ * ->children lists are in the ascending order of ->serial_nr and
+ * used to allow interrupting and resuming iterations.
+ */
+ u64 serial_nr;
+
/* percpu_ref killing and RCU release */
struct rcu_head rcu_head;
struct work_struct destroy_work;
@@ -173,14 +181,6 @@ struct cgroup {
struct kernfs_node *kn; /* cgroup kernfs entry */
struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
- /*
- * Monotonically increasing unique serial number which defines a
- * uniform order among all cgroups. It's guaranteed that all
- * ->children lists are in the ascending order of ->serial_nr.
- * It's used to allow interrupting and resuming iterations.
- */
- u64 serial_nr;
-
/* the bitmask of subsystems enabled on the child cgroups */
unsigned int child_subsys_mask;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6930b78..3eda323 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -157,14 +157,13 @@ static int cgroup_root_count;
static DEFINE_IDR(cgroup_hierarchy_idr);
/*
- * Assign a monotonically increasing serial number to cgroups. It
- * guarantees cgroups with bigger numbers are newer than those with smaller
- * numbers. Also, as cgroups are always appended to the parent's
- * ->children list, it guarantees that sibling cgroups are always sorted in
- * the ascending serial number order on the list. Protected by
- * cgroup_mutex.
+ * Assign a monotonically increasing serial number to csses. It guarantees
+ * cgroups with bigger numbers are newer than those with smaller numbers.
+ * Also, as csses are always appended to the parent's ->children list, it
+ * guarantees that sibling csses are always sorted in the ascending serial
+ * number order on the list. Protected by cgroup_mutex.
*/
-static u64 cgroup_serial_nr_next = 1;
+static u64 css_serial_nr_next = 1;
/* This flag indicates whether tasks in the fork and exit paths should
* check for fork/exit handlers to call. This avoids us having to do
@@ -3132,7 +3131,7 @@ css_next_child(struct cgroup_subsys_state *pos_css,
next = list_entry_rcu(pos->self.sibling.next, struct cgroup, self.sibling);
} else {
list_for_each_entry_rcu(next, &cgrp->self.children, self.sibling)
- if (next->serial_nr > pos->serial_nr)
+ if (next->self.serial_nr > pos->self.serial_nr)
break;
}
@@ -4167,6 +4166,8 @@ static void css_release(struct percpu_ref *ref)
static void init_and_link_css(struct cgroup_subsys_state *css,
struct cgroup_subsys *ss, struct cgroup *cgrp)
{
+ lockdep_assert_held(&cgroup_mutex);
+
cgroup_get(cgrp);
memset(css, 0, sizeof(*css));
@@ -4174,6 +4175,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
css->ss = ss;
INIT_LIST_HEAD(&css->sibling);
INIT_LIST_HEAD(&css->children);
+ css->serial_nr = css_serial_nr_next++;
if (cgroup_parent(cgrp)) {
css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -4347,7 +4349,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
*/
kernfs_get(kn);
- cgrp->serial_nr = cgroup_serial_nr_next++;
+ cgrp->self.serial_nr = css_serial_nr_next++;
/* allocation complete, commit to creation */
list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children);
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 09/14] cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (4 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 08/14] cgroup: move cgroup->serial_nr into cgroup_subsys_state Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
[not found] ` <1399671091-23867-10-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-09 21:31 ` [PATCH 10/14] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (4 subsequent siblings)
10 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan; +Cc: cgroups, linux-kernel, hannes, Tejun Heo
css iterations allow the caller to drop RCU read lock. As long as the
caller keeps the current position accessible, it can simply re-grab
RCU read lock later and continue iteration. This is achieved by using
CGRP_DEAD to detect whether the current positions next pointer is safe
to dereference and if not re-iterate from the beginning to the next
position using ->serial_nr.
CGRP_DEAD is used as the marker to invalidate the next pointer and the
only requirement is that the marker is set before the next sibling
starts its RCU grace period. Because CGRP_DEAD is set at the end of
cgroup_destroy_locked() but the cgroup is unlinked when the reference
count reaches zero, we currently have a rather large window where this
fallback re-iteration logic can be triggered.
This patch introduces CSS_RELEASED which is set when a css is unlinked
from its sibling list. This still keeps the re-iteration logic
working while drastically reducing the window of its activation.
While at it, rewrite the comment in css_next_child() to reflect the
new flag and better explain the synchronization.
This will also enable iterating csses directly instead of through
cgroups.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 41 ++++++++++++++++++++---------------------
2 files changed, 21 insertions(+), 21 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e65cc0f..634ecc1 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -95,6 +95,7 @@ struct cgroup_subsys_state {
/* bits in struct cgroup_subsys_state flags field */
enum {
+ CSS_RELEASED = (1 << 0), /* refcnt reached zero, released */
CSS_ONLINE = (1 << 1), /* between ->css_online() and ->css_offline() */
};
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3eda323..eaea062 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3107,27 +3107,28 @@ css_next_child(struct cgroup_subsys_state *pos_css,
cgroup_assert_mutex_or_rcu_locked();
/*
- * @pos could already have been removed. Once a cgroup is removed,
- * its ->sibling.next is no longer updated when its next sibling
- * changes. As CGRP_DEAD assertion is serialized and happens
- * before the cgroup is taken off the ->sibling list, if we see it
- * unasserted, it's guaranteed that the next sibling hasn't
- * finished its grace period even if it's already removed, and thus
- * safe to dereference from this RCU critical section. If
- * ->sibling.next is inaccessible, cgroup_is_dead() is guaranteed
- * to be visible as %true here.
+ * @pos could already have been unlinked from the sibling list.
+ * Once a cgroup is removed, its ->sibling.next is no longer
+ * updated when its next sibling changes. CSS_RELEASED is set when
+ * @pos is taken off list, at which time its next pointer is valid,
+ * and, as releases are serialized, the one pointed to by the next
+ * pointer is guaranteed to not have started release yet. This
+ * implies that if we observe !CSS_RELEASED on @pos in this RCU
+ * critical section, the one pointed to by its next pointer is
+ * guaranteed to not have finished its RCU grace period even if we
+ * have dropped rcu_read_lock() inbetween iterations.
*
- * If @pos is dead, its next pointer can't be dereferenced;
- * however, as each cgroup is given a monotonically increasing
- * unique serial number and always appended to the sibling list,
- * the next one can be found by walking the parent's children until
- * we see a cgroup with higher serial number than @pos's. While
- * this path can be slower, it's taken only when either the current
- * cgroup is removed or iteration and removal race.
+ * If @pos has CSS_RELEASED set, its next pointer can't be
+ * dereferenced; however, as each css is given a monotonically
+ * increasing unique serial number and always appended to the
+ * sibling list, the next one can be found by walking the parent's
+ * children until the first css with higher serial number than
+ * @pos's. While this path can be slower, it happens iff iteration
+ * races against release and the race window is very small.
*/
if (!pos) {
next = list_entry_rcu(cgrp->self.children.next, struct cgroup, self.sibling);
- } else if (likely(!cgroup_is_dead(pos))) {
+ } else if (likely(!(pos->self.flags & CSS_RELEASED))) {
next = list_entry_rcu(pos->self.sibling.next, struct cgroup, self.sibling);
} else {
list_for_each_entry_rcu(next, &cgrp->self.children, self.sibling)
@@ -4138,6 +4139,7 @@ static void css_release_work_fn(struct work_struct *work)
mutex_lock(&cgroup_mutex);
+ css->flags |= CSS_RELEASED;
list_del_rcu(&css->sibling);
if (ss) {
@@ -4524,10 +4526,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
/*
* Mark @cgrp dead. This prevents further task migration and child
- * creation by disabling cgroup_lock_live_group(). Note that
- * CGRP_DEAD assertion is depended upon by css_next_child() to
- * resume iteration after dropping RCU read lock. See
- * css_next_child() for details.
+ * creation by disabling cgroup_lock_live_group().
*/
set_bit(CGRP_DEAD, &cgrp->flags);
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 10/14] cgroup: iterate cgroup_subsys_states directly
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (5 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 09/14] cgroup: introduce CSS_RELEASED and reduce css iteration fallback window Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
2014-05-09 21:31 ` [PATCH 11/14] cgroup: use CSS_ONLINE instead of CGRP_DEAD Tejun Heo
` (3 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan; +Cc: cgroups, linux-kernel, hannes, Tejun Heo
Currently, css_next_child() is implemented as finding the next child
cgroup which has the css enabled, which used to be the only way to do
it as only cgroups participated in sibling lists and thus could be
iteratd. This works as long as what's required during iteration is
not missing online csses; however, it turns out that there are use
cases where offlined but not yet released csses need to be iterated.
This is difficult to implement through cgroup iteration the unified
hierarchy as there may be multiple dying csses for the same subsystem
associated with single cgroup.
After the recent changes, the cgroup self and regular csses behave
identically in how they're linked and unlinked from the sibling lists
including assertion of CSS_RELEASED and css_next_child() can simply
switch to iterating csses directly. This both simplifies the logic
and ensures that all visible non-released csses are included in the
iteration whether there are multiple dying csses for a subsystem or
not.
As all other iterators depend on css_next_child() for sibling
iteration, this changes behaviors of all css iterators. Add and
update explanations on the css states which are included in traversal
to all iterators.
As css iteration could always contain offlined csses, this shouldn't
break any of the current users and new usages which need iteration of
all on and offline csses can make use of the new semantics.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/cgroup.h | 44 ++++++++++++++++++++---------------
kernel/cgroup.c | 62 ++++++++++++++++++++++++++++++--------------------
2 files changed, 63 insertions(+), 43 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 634ecc1..58867ed 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -759,14 +759,14 @@ struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss);
* @pos: the css * to use as the loop cursor
* @parent: css whose children to walk
*
- * Walk @parent's children. Must be called under rcu_read_lock(). A child
- * css which hasn't finished ->css_online() or already has finished
- * ->css_offline() may show up during traversal and it's each subsystem's
- * responsibility to verify that each @pos is alive.
+ * Walk @parent's children. Must be called under rcu_read_lock().
*
- * If a subsystem synchronizes against the parent in its ->css_online() and
- * before starting iterating, a css which finished ->css_online() is
- * guaranteed to be visible in the future iterations.
+ * If a subsystem synchronizes ->css_online() and the start of iteration, a
+ * css which finished ->css_online() is guaranteed to be visible in the
+ * future iterations and will stay visible until the last reference is put.
+ * A css which hasn't finished ->css_online() or already finished
+ * ->css_offline() may show up during traversal. It's each subsystem's
+ * responsibility to synchronize against on/offlining.
*
* It is allowed to temporarily drop RCU read lock during iteration. The
* caller is responsible for ensuring that @pos remains accessible until
@@ -789,17 +789,16 @@ css_rightmost_descendant(struct cgroup_subsys_state *pos);
* @root: css whose descendants to walk
*
* Walk @root's descendants. @root is included in the iteration and the
- * first node to be visited. Must be called under rcu_read_lock(). A
- * descendant css which hasn't finished ->css_online() or already has
- * finished ->css_offline() may show up during traversal and it's each
- * subsystem's responsibility to verify that each @pos is alive.
+ * first node to be visited. Must be called under rcu_read_lock().
*
- * If a subsystem synchronizes against the parent in its ->css_online() and
- * before starting iterating, and synchronizes against @pos on each
- * iteration, any descendant css which finished ->css_online() is
- * guaranteed to be visible in the future iterations.
+ * If a subsystem synchronizes ->css_online() and the start of iteration, a
+ * css which finished ->css_online() is guaranteed to be visible in the
+ * future iterations and will stay visible until the last reference is put.
+ * A css which hasn't finished ->css_online() or already finished
+ * ->css_offline() may show up during traversal. It's each subsystem's
+ * responsibility to synchronize against on/offlining.
*
- * In other words, the following guarantees that a descendant can't escape
+ * For example, the following guarantees that a descendant can't escape
* state updates of its ancestors.
*
* my_online(@css)
@@ -855,8 +854,17 @@ css_next_descendant_post(struct cgroup_subsys_state *pos,
*
* Similar to css_for_each_descendant_pre() but performs post-order
* traversal instead. @root is included in the iteration and the last
- * node to be visited. Note that the walk visibility guarantee described
- * in pre-order walk doesn't apply the same to post-order walks.
+ * node to be visited.
+ *
+ * If a subsystem synchronizes ->css_online() and the start of iteration, a
+ * css which finished ->css_online() is guaranteed to be visible in the
+ * future iterations and will stay visible until the last reference is put.
+ * A css which hasn't finished ->css_online() or already finished
+ * ->css_offline() may show up during traversal. It's each subsystem's
+ * responsibility to synchronize against on/offlining.
+ *
+ * Note that the walk visibility guarantee example described in pre-order
+ * walk doesn't apply the same to post-order walks.
*/
#define css_for_each_descendant_post(pos, css) \
for ((pos) = css_next_descendant_post(NULL, (css)); (pos); \
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index eaea062..e0953f7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3088,21 +3088,25 @@ static int cgroup_task_count(const struct cgroup *cgrp)
/**
* css_next_child - find the next child of a given css
- * @pos_css: the current position (%NULL to initiate traversal)
- * @parent_css: css whose children to walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @parent: css whose children to walk
*
- * This function returns the next child of @parent_css and should be called
+ * This function returns the next child of @parent and should be called
* under either cgroup_mutex or RCU read lock. The only requirement is
- * that @parent_css and @pos_css are accessible. The next sibling is
- * guaranteed to be returned regardless of their states.
+ * that @parent and @pos are accessible. The next sibling is guaranteed to
+ * be returned regardless of their states.
+ *
+ * If a subsystem synchronizes ->css_online() and the start of iteration, a
+ * css which finished ->css_online() is guaranteed to be visible in the
+ * future iterations and will stay visible until the last reference is put.
+ * A css which hasn't finished ->css_online() or already finished
+ * ->css_offline() may show up during traversal. It's each subsystem's
+ * responsibility to synchronize against on/offlining.
*/
-struct cgroup_subsys_state *
-css_next_child(struct cgroup_subsys_state *pos_css,
- struct cgroup_subsys_state *parent_css)
+struct cgroup_subsys_state *css_next_child(struct cgroup_subsys_state *pos,
+ struct cgroup_subsys_state *parent)
{
- struct cgroup *pos = pos_css ? pos_css->cgroup : NULL;
- struct cgroup *cgrp = parent_css->cgroup;
- struct cgroup *next;
+ struct cgroup_subsys_state *next;
cgroup_assert_mutex_or_rcu_locked();
@@ -3127,27 +3131,21 @@ css_next_child(struct cgroup_subsys_state *pos_css,
* races against release and the race window is very small.
*/
if (!pos) {
- next = list_entry_rcu(cgrp->self.children.next, struct cgroup, self.sibling);
- } else if (likely(!(pos->self.flags & CSS_RELEASED))) {
- next = list_entry_rcu(pos->self.sibling.next, struct cgroup, self.sibling);
+ next = list_entry_rcu(parent->children.next, struct cgroup_subsys_state, sibling);
+ } else if (likely(!(pos->flags & CSS_RELEASED))) {
+ next = list_entry_rcu(pos->sibling.next, struct cgroup_subsys_state, sibling);
} else {
- list_for_each_entry_rcu(next, &cgrp->self.children, self.sibling)
- if (next->self.serial_nr > pos->self.serial_nr)
+ list_for_each_entry_rcu(next, &parent->children, sibling)
+ if (next->serial_nr > pos->serial_nr)
break;
}
/*
* @next, if not pointing to the head, can be dereferenced and is
- * the next sibling; however, it might have @ss disabled. If so,
- * fast-forward to the next enabled one.
+ * the next sibling.
*/
- while (&next->self.sibling != &cgrp->self.children) {
- struct cgroup_subsys_state *next_css = cgroup_css(next, parent_css->ss);
-
- if (next_css)
- return next_css;
- next = list_entry_rcu(next->self.sibling.next, struct cgroup, self.sibling);
- }
+ if (&next->sibling != &parent->children)
+ return next;
return NULL;
}
@@ -3164,6 +3162,13 @@ css_next_child(struct cgroup_subsys_state *pos_css,
* doesn't require the whole traversal to be contained in a single critical
* section. This function will return the correct next descendant as long
* as both @pos and @root are accessible and @pos is a descendant of @root.
+ *
+ * If a subsystem synchronizes ->css_online() and the start of iteration, a
+ * css which finished ->css_online() is guaranteed to be visible in the
+ * future iterations and will stay visible until the last reference is put.
+ * A css which hasn't finished ->css_online() or already finished
+ * ->css_offline() may show up during traversal. It's each subsystem's
+ * responsibility to synchronize against on/offlining.
*/
struct cgroup_subsys_state *
css_next_descendant_pre(struct cgroup_subsys_state *pos,
@@ -3251,6 +3256,13 @@ css_leftmost_descendant(struct cgroup_subsys_state *pos)
* section. This function will return the correct next descendant as long
* as both @pos and @cgroup are accessible and @pos is a descendant of
* @cgroup.
+ *
+ * If a subsystem synchronizes ->css_online() and the start of iteration, a
+ * css which finished ->css_online() is guaranteed to be visible in the
+ * future iterations and will stay visible until the last reference is put.
+ * A css which hasn't finished ->css_online() or already finished
+ * ->css_offline() may show up during traversal. It's each subsystem's
+ * responsibility to synchronize against on/offlining.
*/
struct cgroup_subsys_state *
css_next_descendant_post(struct cgroup_subsys_state *pos,
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 11/14] cgroup: use CSS_ONLINE instead of CGRP_DEAD
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (6 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 10/14] cgroup: iterate cgroup_subsys_states directly Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
2014-05-09 21:31 ` [PATCH 12/14] cgroup: convert cgroup_has_live_children() into css_has_online_children() Tejun Heo
` (2 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan; +Cc: cgroups, linux-kernel, hannes, Tejun Heo
Use CSS_ONLINE on the self css to indicate whether a cgroup has been
killed instead of CGRP_DEAD. This will allow re-using css online test
for cgroup liveliness test. This doesn't introduce any functional
change.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup.h | 2 --
kernel/cgroup.c | 7 ++++---
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 58867ed..2cf2063 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -138,8 +138,6 @@ static inline void css_put(struct cgroup_subsys_state *css)
/* bits in struct cgroup flags field */
enum {
- /* Control Group is dead */
- CGRP_DEAD,
/*
* Control Group has previously had a child cgroup or a task,
* but no longer (only if CGRP_NOTIFY_ON_RELEASE is set)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e0953f7..a0cfe36 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -278,7 +278,7 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
/* convenient tests for these bits */
static inline bool cgroup_is_dead(const struct cgroup *cgrp)
{
- return test_bit(CGRP_DEAD, &cgrp->flags);
+ return !(cgrp->self.flags & CSS_ONLINE);
}
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
@@ -1518,6 +1518,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(&cgrp->pidlists);
mutex_init(&cgrp->pidlist_mutex);
cgrp->self.cgroup = cgrp;
+ cgrp->self.flags |= CSS_ONLINE;
for_each_subsys(ss, ssid)
INIT_LIST_HEAD(&cgrp->e_csets[ssid]);
@@ -4540,13 +4541,13 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
* Mark @cgrp dead. This prevents further task migration and child
* creation by disabling cgroup_lock_live_group().
*/
- set_bit(CGRP_DEAD, &cgrp->flags);
+ cgrp->self.flags &= ~CSS_ONLINE;
/* initiate massacre of all css's */
for_each_css(css, ssid, cgrp)
kill_css(css);
- /* CGRP_DEAD is set, remove from ->release_list for the last time */
+ /* CSS_ONLINE is clear, remove from ->release_list for the last time */
raw_spin_lock(&release_list_lock);
if (!list_empty(&cgrp->release_list))
list_del_init(&cgrp->release_list);
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 12/14] cgroup: convert cgroup_has_live_children() into css_has_online_children()
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (7 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 11/14] cgroup: use CSS_ONLINE instead of CGRP_DEAD Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
2014-05-09 21:31 ` [PATCH 13/14] device_cgroup: use css_has_online_children() instead of has_children() Tejun Heo
2014-05-09 21:31 ` [PATCH 14/14] cgroup: implement css_tryget() Tejun Heo
10 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan; +Cc: cgroups, linux-kernel, hannes, Tejun Heo
Now that cgroup liveliness and css onliness are the same state,
convert cgroup_has_live_children() into css_has_online_children() so
that it can be used for actual csses too. The function now uses
css_for_each_child() for iteration and is published.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup.h | 2 ++
kernel/cgroup.c | 32 ++++++++++++++++++++------------
2 files changed, 22 insertions(+), 12 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2cf2063..c8eb66e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -868,6 +868,8 @@ css_next_descendant_post(struct cgroup_subsys_state *pos,
for ((pos) = css_next_descendant_post(NULL, (css)); (pos); \
(pos) = css_next_descendant_post((pos), (css)))
+bool css_has_online_children(struct cgroup_subsys_state *css);
+
/* A css_task_iter should be treated as an opaque object */
struct css_task_iter {
struct cgroup_subsys *ss;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a0cfe36..0b16631 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -175,7 +175,6 @@ static int need_forkexit_callback __read_mostly;
static struct cftype cgroup_base_files[];
static void cgroup_put(struct cgroup *cgrp);
-static bool cgroup_has_live_children(struct cgroup *cgrp);
static int rebind_subsystems(struct cgroup_root *dst_root,
unsigned int ss_mask);
static int cgroup_destroy_locked(struct cgroup *cgrp);
@@ -1768,7 +1767,7 @@ static void cgroup_kill_sb(struct super_block *sb)
* This prevents new mounts by disabling percpu_ref_tryget_live().
* cgroup_mount() may wait for @root's release.
*/
- if (cgroup_has_live_children(&root->cgrp))
+ if (css_has_online_children(&root->cgrp.self))
cgroup_put(&root->cgrp);
else
percpu_ref_kill(&root->cgrp.self.refcnt);
@@ -3290,19 +3289,28 @@ css_next_descendant_post(struct cgroup_subsys_state *pos,
return pos->parent;
}
-static bool cgroup_has_live_children(struct cgroup *cgrp)
+/**
+ * css_has_online_children - does a css have online children
+ * @css: the target css
+ *
+ * Returns %true if @css has any online children; otherwise, %false. This
+ * function can be called from any context but the caller is responsible
+ * for synchronizing against on/offlining as necessary.
+ */
+bool css_has_online_children(struct cgroup_subsys_state *css)
{
- struct cgroup *child;
+ struct cgroup_subsys_state *child;
+ bool ret = false;
rcu_read_lock();
- list_for_each_entry_rcu(child, &cgrp->self.children, self.sibling) {
- if (!cgroup_is_dead(child)) {
- rcu_read_unlock();
- return true;
+ css_for_each_child(child, css) {
+ if (css->flags & CSS_ONLINE) {
+ ret = true;
+ break;
}
}
rcu_read_unlock();
- return false;
+ return ret;
}
/**
@@ -4534,7 +4542,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
* ->self.children as dead children linger on it while being
* drained; otherwise, "rmdir parent/child parent" may fail.
*/
- if (cgroup_has_live_children(cgrp))
+ if (css_has_online_children(&cgrp->self))
return -EBUSY;
/*
@@ -5006,8 +5014,8 @@ void cgroup_exit(struct task_struct *tsk)
static void check_for_release(struct cgroup *cgrp)
{
- if (cgroup_is_releasable(cgrp) &&
- list_empty(&cgrp->cset_links) && !cgroup_has_live_children(cgrp)) {
+ if (cgroup_is_releasable(cgrp) && list_empty(&cgrp->cset_links) &&
+ !css_has_online_children(&cgrp->self)) {
/*
* Control Group is currently removeable. If it's not
* already queued for a userspace notification, queue
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 13/14] device_cgroup: use css_has_online_children() instead of has_children()
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (8 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 12/14] cgroup: convert cgroup_has_live_children() into css_has_online_children() Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
[not found] ` <1399671091-23867-14-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-14 12:53 ` Serge E. Hallyn
2014-05-09 21:31 ` [PATCH 14/14] cgroup: implement css_tryget() Tejun Heo
10 siblings, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan
Cc: cgroups, linux-kernel, hannes, Tejun Heo, Aristeu Rozanski,
Serge Hallyn
devcgroup_update_access() wants to know whether there are child
cgroups which are online and visible to userland and has_children()
may return false positive. Replace it with css_has_online_children().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
---
security/device_cgroup.c | 19 ++-----------------
1 file changed, 2 insertions(+), 17 deletions(-)
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 75b4b18..22de334 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -475,21 +475,6 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
return rc;
}
-static inline bool has_children(struct dev_cgroup *devcgroup)
-{
- bool ret;
-
- /*
- * FIXME: There may be lingering offline csses and this function
- * may return %true when there isn't any userland-visible child
- * which is incorrect for our purposes.
- */
- rcu_read_lock();
- ret = css_next_child(NULL, &devcgroup->css);
- rcu_read_unlock();
- return ret;
-}
-
/*
* Modify the exception list using allow/deny rules.
* CAP_SYS_ADMIN is needed for this. It's at least separate from CAP_MKNOD
@@ -522,7 +507,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
case 'a':
switch (filetype) {
case DEVCG_ALLOW:
- if (has_children(devcgroup))
+ if (css_has_online_children(&devcgroup->css))
return -EINVAL;
if (!may_allow_all(parent))
@@ -538,7 +523,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
return rc;
break;
case DEVCG_DENY:
- if (has_children(devcgroup))
+ if (css_has_online_children(&devcgroup->css))
return -EINVAL;
dev_exception_clean(devcgroup);
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 14/14] cgroup: implement css_tryget()
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
` (9 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 13/14] device_cgroup: use css_has_online_children() instead of has_children() Tejun Heo
@ 2014-05-09 21:31 ` Tejun Heo
[not found] ` <1399671091-23867-15-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
10 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2014-05-09 21:31 UTC (permalink / raw)
To: lizefan; +Cc: cgroups, linux-kernel, hannes, Tejun Heo
Implement css_tryget() which tries to grab a cgroup_subsys_state's
reference as long as it already hasn't reached zero. Combined with
the recent css iterator changes to include offline && !released csses
during traversal, this can be used to access csses regardless of its
online state.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/cgroup.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c8eb66e..6dd3867 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -111,6 +111,22 @@ static inline void css_get(struct cgroup_subsys_state *css)
}
/**
+ * css_tryget - try to obtain a reference on the specified css
+ * @css: target css
+ *
+ * Obtain a reference on @css unless it already has reached zero and is
+ * being released. This function doesn't care whether @css is on or
+ * offline. The caller naturally needs to ensure that @css is accessible
+ * but doesn't have to be holding a reference on it - IOW, RCU protected
+ * access is good enough for this function. Returns %true if a reference
+ * count was successfully obtained; %false otherwise.
+ */
+static inline bool css_tryget(struct cgroup_subsys_state *css)
+{
+ return percpu_ref_tryget(&css->refcnt);
+}
+
+/**
* css_tryget_online - try to obtain a reference on the specified css if online
* @css: target css
*
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 01/14] cgroup: remove css_parent()
[not found] ` <1399671091-23867-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-11 1:47 ` David Miller
2014-05-11 13:02 ` Neil Horman
2014-05-13 18:50 ` [PATCH v2 " Tejun Heo
2 siblings, 0 replies; 48+ messages in thread
From: David Miller @ 2014-05-11 1:47 UTC (permalink / raw)
To: tj-DgEjT+Ai2ygdnm+yROfE0A
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
axboe-tSWWG44O7X1aa/9Udqfwiw, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
mhocko-AlSwsSmVLrQ, nhorman-2XuSBdqkA4R54TAoqtyWWQ
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Fri, 9 May 2014 17:31:18 -0400
> cgroup in general is moving towards using cgroup_subsys_state as the
> fundamental structural component and css_parent() was introduced to
> convert from using cgroup->parent to css->parent. It was quite some
> time ago and we're moving forward with making css more prominent.
>
> This patch drops the trivial wrapper css_parent() and let the users
> dereference css->parent. While at it, explicitly mark fields of css
> which are public and immutable.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
For networking bits:
Acked-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 14/14] cgroup: implement css_tryget()
[not found] ` <1399671091-23867-15-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-11 4:54 ` Johannes Weiner
[not found] ` <20140511045459.GA25009-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-05-16 16:07 ` [PATCH v2 " Tejun Heo
1 sibling, 1 reply; 48+ messages in thread
From: Johannes Weiner @ 2014-05-11 4:54 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Fri, May 09, 2014 at 05:31:31PM -0400, Tejun Heo wrote:
> Implement css_tryget() which tries to grab a cgroup_subsys_state's
> reference as long as it already hasn't reached zero. Combined with
> the recent css iterator changes to include offline && !released csses
> during traversal, this can be used to access csses regardless of its
> online state.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> ---
> include/linux/cgroup.h | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index c8eb66e..6dd3867 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -111,6 +111,22 @@ static inline void css_get(struct cgroup_subsys_state *css)
> }
>
> /**
> + * css_tryget - try to obtain a reference on the specified css
> + * @css: target css
> + *
> + * Obtain a reference on @css unless it already has reached zero and is
> + * being released. This function doesn't care whether @css is on or
> + * offline. The caller naturally needs to ensure that @css is accessible
> + * but doesn't have to be holding a reference on it - IOW, RCU protected
> + * access is good enough for this function. Returns %true if a reference
> + * count was successfully obtained; %false otherwise.
> + */
> +static inline bool css_tryget(struct cgroup_subsys_state *css)
> +{
> + return percpu_ref_tryget(&css->refcnt);
> +}
percpu_ref_tryget() fails once killed (transitioned from per-cpu to
atomic mode), but exactly this happens during offlining and so this
would actually be equivalent to css_tryget_online(), no?
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 14/14] cgroup: implement css_tryget()
[not found] ` <20140511045459.GA25009-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2014-05-11 12:38 ` Tejun Heo
0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-11 12:38 UTC (permalink / raw)
To: Johannes Weiner
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Sun, May 11, 2014 at 12:54:59AM -0400, Johannes Weiner wrote:
> > /**
> > + * css_tryget - try to obtain a reference on the specified css
> > + * @css: target css
> > + *
> > + * Obtain a reference on @css unless it already has reached zero and is
> > + * being released. This function doesn't care whether @css is on or
> > + * offline. The caller naturally needs to ensure that @css is accessible
> > + * but doesn't have to be holding a reference on it - IOW, RCU protected
> > + * access is good enough for this function. Returns %true if a reference
> > + * count was successfully obtained; %false otherwise.
> > + */
> > +static inline bool css_tryget(struct cgroup_subsys_state *css)
> > +{
> > + return percpu_ref_tryget(&css->refcnt);
> > +}
>
> percpu_ref_tryget() fails once killed (transitioned from per-cpu to
> atomic mode), but exactly this happens during offlining and so this
> would actually be equivalent to css_tryget_online(), no?
Not any more. percpu/for-3.16 already contains the updates to
percpu_ref_tryget[_live](). percpu_ref_tryget() succeeds as long as
the refcnt is above zero.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 01/14] cgroup: remove css_parent()
[not found] ` <1399671091-23867-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-11 1:47 ` David Miller
@ 2014-05-11 13:02 ` Neil Horman
2014-05-13 18:50 ` [PATCH v2 " Tejun Heo
2 siblings, 0 replies; 48+ messages in thread
From: Neil Horman @ 2014-05-11 13:02 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, Vivek Goyal, Jens Axboe,
Peter Zijlstra, Michal Hocko, David S. Miller
On Fri, May 09, 2014 at 05:31:18PM -0400, Tejun Heo wrote:
> cgroup in general is moving towards using cgroup_subsys_state as the
> fundamental structural component and css_parent() was introduced to
> convert from using cgroup->parent to css->parent. It was quite some
> time ago and we're moving forward with making css more prominent.
>
> This patch drops the trivial wrapper css_parent() and let the users
> dereference css->parent. While at it, explicitly mark fields of css
> which are public and immutable.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Cc: Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Cc: Neil Horman <nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
> Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> ---
> block/blk-cgroup.h | 2 +-
> include/linux/cgroup.h | 29 +++++++++++------------------
> kernel/cgroup.c | 8 ++++----
> kernel/cgroup_freezer.c | 2 +-
> kernel/cpuset.c | 2 +-
> kernel/sched/core.c | 2 +-
> kernel/sched/cpuacct.c | 2 +-
> mm/hugetlb_cgroup.c | 2 +-
> mm/memcontrol.c | 14 +++++++-------
> net/core/netclassid_cgroup.c | 2 +-
> net/core/netprio_cgroup.c | 2 +-
> security/device_cgroup.c | 6 +++---
> 12 files changed, 33 insertions(+), 40 deletions(-)
>
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 371fe8e..d692b29 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -204,7 +204,7 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
> */
> static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
> {
> - return css_to_blkcg(css_parent(&blkcg->css));
> + return css_to_blkcg(blkcg->css.parent);
> }
>
> /**
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 76dadd77..88c4d03 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -48,22 +48,28 @@ enum cgroup_subsys_id {
> };
> #undef SUBSYS
>
> -/* Per-subsystem/per-cgroup state maintained by the system. */
> +/*
> + * Per-subsystem/per-cgroup state maintained by the system. This is the
> + * fundamental structural building block that controllers deal with.
> + *
> + * Fields marked with "PI:" are public and immutable and may be accessed
> + * directly without synchronization.
> + */
> struct cgroup_subsys_state {
> - /* the cgroup that this css is attached to */
> + /* PI: the cgroup that this css is attached to */
> struct cgroup *cgroup;
>
> - /* the cgroup subsystem that this css is attached to */
> + /* PI: the cgroup subsystem that this css is attached to */
> struct cgroup_subsys *ss;
>
> /* reference count - access via css_[try]get() and css_put() */
> struct percpu_ref refcnt;
>
> - /* the parent css */
> + /* PI: the parent css */
> struct cgroup_subsys_state *parent;
>
> /*
> - * Subsys-unique ID. 0 is unused and root is always 1. The
> + * PI: Subsys-unique ID. 0 is unused and root is always 1. The
> * matching css can be looked up using css_from_id().
> */
> int id;
> @@ -665,19 +671,6 @@ struct cgroup_subsys {
> #undef SUBSYS
>
> /**
> - * css_parent - find the parent css
> - * @css: the target cgroup_subsys_state
> - *
> - * Return the parent css of @css. This function is guaranteed to return
> - * non-NULL parent as long as @css isn't the root.
> - */
> -static inline
> -struct cgroup_subsys_state *css_parent(struct cgroup_subsys_state *css)
> -{
> - return css->parent;
> -}
> -
> -/**
> * task_css_set_check - obtain a task's css_set with extra access conditions
> * @task: the task to obtain css_set for
> * @__c: extra condition expression to be passed to rcu_dereference_check()
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 64ff413..dab8f35 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -3175,10 +3175,10 @@ css_next_descendant_pre(struct cgroup_subsys_state *pos,
>
> /* no child, visit my or the closest ancestor's next sibling */
> while (pos != root) {
> - next = css_next_child(pos, css_parent(pos));
> + next = css_next_child(pos, pos->parent);
> if (next)
> return next;
> - pos = css_parent(pos);
> + pos = pos->parent;
> }
>
> return NULL;
> @@ -3260,12 +3260,12 @@ css_next_descendant_post(struct cgroup_subsys_state *pos,
> return NULL;
>
> /* if there's an unvisited sibling, visit its leftmost descendant */
> - next = css_next_child(pos, css_parent(pos));
> + next = css_next_child(pos, pos->parent);
> if (next)
> return css_leftmost_descendant(next);
>
> /* no sibling left, visit parent */
> - return css_parent(pos);
> + return pos->parent;
> }
>
> static bool cgroup_has_live_children(struct cgroup *cgrp)
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index f52c443..fee1ae63 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -59,7 +59,7 @@ static inline struct freezer *task_freezer(struct task_struct *task)
>
> static struct freezer *parent_freezer(struct freezer *freezer)
> {
> - return css_freezer(css_parent(&freezer->css));
> + return css_freezer(freezer->css.parent);
> }
>
> bool cgroup_freezing(struct task_struct *task)
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 2f4b08b..5b2a310 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -124,7 +124,7 @@ static inline struct cpuset *task_cs(struct task_struct *task)
>
> static inline struct cpuset *parent_cs(struct cpuset *cs)
> {
> - return css_cs(css_parent(&cs->css));
> + return css_cs(cs->css.parent);
> }
>
> #ifdef CONFIG_NUMA
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 268a45e..ac61ad1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7586,7 +7586,7 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
> {
> struct task_group *tg = css_tg(css);
> - struct task_group *parent = css_tg(css_parent(css));
> + struct task_group *parent = css_tg(css->parent);
>
> if (parent)
> sched_online_group(tg, parent);
> diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
> index c143ee3..9cf350c 100644
> --- a/kernel/sched/cpuacct.c
> +++ b/kernel/sched/cpuacct.c
> @@ -46,7 +46,7 @@ static inline struct cpuacct *task_ca(struct task_struct *tsk)
>
> static inline struct cpuacct *parent_ca(struct cpuacct *ca)
> {
> - return css_ca(css_parent(&ca->css));
> + return css_ca(ca->css.parent);
> }
>
> static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index a380681..493f758 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -52,7 +52,7 @@ static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
> static inline struct hugetlb_cgroup *
> parent_hugetlb_cgroup(struct hugetlb_cgroup *h_cg)
> {
> - return hugetlb_cgroup_from_css(css_parent(&h_cg->css));
> + return hugetlb_cgroup_from_css(h_cg->css.parent);
> }
>
> static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b638a79..a5e0417 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1540,7 +1540,7 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> {
> /* root ? */
> - if (!css_parent(&memcg->css))
> + if (!memcg->css.parent)
> return vm_swappiness;
>
> return memcg->swappiness;
> @@ -4909,7 +4909,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
> {
> int retval = 0;
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> - struct mem_cgroup *parent_memcg = mem_cgroup_from_css(css_parent(&memcg->css));
> + struct mem_cgroup *parent_memcg = mem_cgroup_from_css(memcg->css.parent);
>
> mutex_lock(&memcg_create_mutex);
>
> @@ -5207,8 +5207,8 @@ static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
> if (!memcg->use_hierarchy)
> goto out;
>
> - while (css_parent(&memcg->css)) {
> - memcg = mem_cgroup_from_css(css_parent(&memcg->css));
> + while (memcg->css.parent) {
> + memcg = mem_cgroup_from_css(memcg->css.parent);
> if (!memcg->use_hierarchy)
> break;
> tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
> @@ -5443,7 +5443,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
> struct cftype *cft, u64 val)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> - struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(&memcg->css));
> + struct mem_cgroup *parent = mem_cgroup_from_css(memcg->css.parent);
>
> if (val > 100 || !parent)
> return -EINVAL;
> @@ -5790,7 +5790,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
> struct cftype *cft, u64 val)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> - struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(&memcg->css));
> + struct mem_cgroup *parent = mem_cgroup_from_css(memcg->css.parent);
>
> /* cannot set to root cgroup and only 0 and 1 are allowed */
> if (!parent || !((val == 0) || (val == 1)))
> @@ -6407,7 +6407,7 @@ static int
> mem_cgroup_css_online(struct cgroup_subsys_state *css)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> - struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(css));
> + struct mem_cgroup *parent = mem_cgroup_from_css(css->parent);
>
> if (css->id > MEM_CGROUP_ID_MAX)
> return -ENOSPC;
> diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
> index 22931e1..30d903b 100644
> --- a/net/core/netclassid_cgroup.c
> +++ b/net/core/netclassid_cgroup.c
> @@ -42,7 +42,7 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
> static int cgrp_css_online(struct cgroup_subsys_state *css)
> {
> struct cgroup_cls_state *cs = css_cls_state(css);
> - struct cgroup_cls_state *parent = css_cls_state(css_parent(css));
> + struct cgroup_cls_state *parent = css_cls_state(css->parent);
>
> if (parent)
> cs->classid = parent->classid;
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> index b990cef..2f385b9 100644
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -140,7 +140,7 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
>
> static int cgrp_css_online(struct cgroup_subsys_state *css)
> {
> - struct cgroup_subsys_state *parent_css = css_parent(css);
> + struct cgroup_subsys_state *parent_css = css->parent;
> struct net_device *dev;
> int ret = 0;
>
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 82d6b4f..3116015 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -182,7 +182,7 @@ static inline bool is_devcg_online(const struct dev_cgroup *devcg)
> static int devcgroup_online(struct cgroup_subsys_state *css)
> {
> struct dev_cgroup *dev_cgroup = css_to_devcgroup(css);
> - struct dev_cgroup *parent_dev_cgroup = css_to_devcgroup(css_parent(css));
> + struct dev_cgroup *parent_dev_cgroup = css_to_devcgroup(css->parent);
> int ret = 0;
>
> mutex_lock(&devcgroup_mutex);
> @@ -374,7 +374,7 @@ static bool may_access(struct dev_cgroup *dev_cgroup,
> static int parent_has_perm(struct dev_cgroup *childcg,
> struct dev_exception_item *ex)
> {
> - struct dev_cgroup *parent = css_to_devcgroup(css_parent(&childcg->css));
> + struct dev_cgroup *parent = css_to_devcgroup(childcg->css.parent);
>
> if (!parent)
> return 1;
> @@ -502,7 +502,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
> char temp[12]; /* 11 + 1 characters needed for a u32 */
> int count, rc = 0;
> struct dev_exception_item ex;
> - struct dev_cgroup *parent = css_to_devcgroup(css_parent(&devcgroup->css));
> + struct dev_cgroup *parent = css_to_devcgroup(devcgroup->css.parent);
>
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
> --
> 1.9.0
>
>
For netprio and netclassid
Acked-by: Neil Horman <nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 01/14] cgroup: remove css_parent()
2014-05-09 21:31 ` [PATCH 01/14] cgroup: remove css_parent() Tejun Heo
[not found] ` <1399671091-23867-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-12 13:16 ` Michal Hocko
1 sibling, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2014-05-12 13:16 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan, cgroups, linux-kernel, hannes, Vivek Goyal, Jens Axboe,
Peter Zijlstra, Neil Horman, David S. Miller
On Fri 09-05-14 17:31:18, Tejun Heo wrote:
> cgroup in general is moving towards using cgroup_subsys_state as the
> fundamental structural component and css_parent() was introduced to
> convert from using cgroup->parent to css->parent. It was quite some
> time ago and we're moving forward with making css more prominent.
>
> This patch drops the trivial wrapper css_parent() and let the users
> dereference css->parent. While at it, explicitly mark fields of css
> which are public and immutable.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: "David S. Miller" <davem@davemloft.net>
For memcg part
Acked-by: Michal Hocko <mhocko@suse.cz>
> ---
> block/blk-cgroup.h | 2 +-
> include/linux/cgroup.h | 29 +++++++++++------------------
> kernel/cgroup.c | 8 ++++----
> kernel/cgroup_freezer.c | 2 +-
> kernel/cpuset.c | 2 +-
> kernel/sched/core.c | 2 +-
> kernel/sched/cpuacct.c | 2 +-
> mm/hugetlb_cgroup.c | 2 +-
> mm/memcontrol.c | 14 +++++++-------
> net/core/netclassid_cgroup.c | 2 +-
> net/core/netprio_cgroup.c | 2 +-
> security/device_cgroup.c | 6 +++---
> 12 files changed, 33 insertions(+), 40 deletions(-)
>
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 371fe8e..d692b29 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -204,7 +204,7 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
> */
> static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
> {
> - return css_to_blkcg(css_parent(&blkcg->css));
> + return css_to_blkcg(blkcg->css.parent);
> }
>
> /**
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 76dadd77..88c4d03 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -48,22 +48,28 @@ enum cgroup_subsys_id {
> };
> #undef SUBSYS
>
> -/* Per-subsystem/per-cgroup state maintained by the system. */
> +/*
> + * Per-subsystem/per-cgroup state maintained by the system. This is the
> + * fundamental structural building block that controllers deal with.
> + *
> + * Fields marked with "PI:" are public and immutable and may be accessed
> + * directly without synchronization.
> + */
> struct cgroup_subsys_state {
> - /* the cgroup that this css is attached to */
> + /* PI: the cgroup that this css is attached to */
> struct cgroup *cgroup;
>
> - /* the cgroup subsystem that this css is attached to */
> + /* PI: the cgroup subsystem that this css is attached to */
> struct cgroup_subsys *ss;
>
> /* reference count - access via css_[try]get() and css_put() */
> struct percpu_ref refcnt;
>
> - /* the parent css */
> + /* PI: the parent css */
> struct cgroup_subsys_state *parent;
>
> /*
> - * Subsys-unique ID. 0 is unused and root is always 1. The
> + * PI: Subsys-unique ID. 0 is unused and root is always 1. The
> * matching css can be looked up using css_from_id().
> */
> int id;
> @@ -665,19 +671,6 @@ struct cgroup_subsys {
> #undef SUBSYS
>
> /**
> - * css_parent - find the parent css
> - * @css: the target cgroup_subsys_state
> - *
> - * Return the parent css of @css. This function is guaranteed to return
> - * non-NULL parent as long as @css isn't the root.
> - */
> -static inline
> -struct cgroup_subsys_state *css_parent(struct cgroup_subsys_state *css)
> -{
> - return css->parent;
> -}
> -
> -/**
> * task_css_set_check - obtain a task's css_set with extra access conditions
> * @task: the task to obtain css_set for
> * @__c: extra condition expression to be passed to rcu_dereference_check()
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 64ff413..dab8f35 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -3175,10 +3175,10 @@ css_next_descendant_pre(struct cgroup_subsys_state *pos,
>
> /* no child, visit my or the closest ancestor's next sibling */
> while (pos != root) {
> - next = css_next_child(pos, css_parent(pos));
> + next = css_next_child(pos, pos->parent);
> if (next)
> return next;
> - pos = css_parent(pos);
> + pos = pos->parent;
> }
>
> return NULL;
> @@ -3260,12 +3260,12 @@ css_next_descendant_post(struct cgroup_subsys_state *pos,
> return NULL;
>
> /* if there's an unvisited sibling, visit its leftmost descendant */
> - next = css_next_child(pos, css_parent(pos));
> + next = css_next_child(pos, pos->parent);
> if (next)
> return css_leftmost_descendant(next);
>
> /* no sibling left, visit parent */
> - return css_parent(pos);
> + return pos->parent;
> }
>
> static bool cgroup_has_live_children(struct cgroup *cgrp)
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index f52c443..fee1ae63 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -59,7 +59,7 @@ static inline struct freezer *task_freezer(struct task_struct *task)
>
> static struct freezer *parent_freezer(struct freezer *freezer)
> {
> - return css_freezer(css_parent(&freezer->css));
> + return css_freezer(freezer->css.parent);
> }
>
> bool cgroup_freezing(struct task_struct *task)
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 2f4b08b..5b2a310 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -124,7 +124,7 @@ static inline struct cpuset *task_cs(struct task_struct *task)
>
> static inline struct cpuset *parent_cs(struct cpuset *cs)
> {
> - return css_cs(css_parent(&cs->css));
> + return css_cs(cs->css.parent);
> }
>
> #ifdef CONFIG_NUMA
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 268a45e..ac61ad1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7586,7 +7586,7 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
> {
> struct task_group *tg = css_tg(css);
> - struct task_group *parent = css_tg(css_parent(css));
> + struct task_group *parent = css_tg(css->parent);
>
> if (parent)
> sched_online_group(tg, parent);
> diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
> index c143ee3..9cf350c 100644
> --- a/kernel/sched/cpuacct.c
> +++ b/kernel/sched/cpuacct.c
> @@ -46,7 +46,7 @@ static inline struct cpuacct *task_ca(struct task_struct *tsk)
>
> static inline struct cpuacct *parent_ca(struct cpuacct *ca)
> {
> - return css_ca(css_parent(&ca->css));
> + return css_ca(ca->css.parent);
> }
>
> static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index a380681..493f758 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -52,7 +52,7 @@ static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
> static inline struct hugetlb_cgroup *
> parent_hugetlb_cgroup(struct hugetlb_cgroup *h_cg)
> {
> - return hugetlb_cgroup_from_css(css_parent(&h_cg->css));
> + return hugetlb_cgroup_from_css(h_cg->css.parent);
> }
>
> static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b638a79..a5e0417 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1540,7 +1540,7 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> {
> /* root ? */
> - if (!css_parent(&memcg->css))
> + if (!memcg->css.parent)
> return vm_swappiness;
>
> return memcg->swappiness;
> @@ -4909,7 +4909,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
> {
> int retval = 0;
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> - struct mem_cgroup *parent_memcg = mem_cgroup_from_css(css_parent(&memcg->css));
> + struct mem_cgroup *parent_memcg = mem_cgroup_from_css(memcg->css.parent);
>
> mutex_lock(&memcg_create_mutex);
>
> @@ -5207,8 +5207,8 @@ static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
> if (!memcg->use_hierarchy)
> goto out;
>
> - while (css_parent(&memcg->css)) {
> - memcg = mem_cgroup_from_css(css_parent(&memcg->css));
> + while (memcg->css.parent) {
> + memcg = mem_cgroup_from_css(memcg->css.parent);
> if (!memcg->use_hierarchy)
> break;
> tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
> @@ -5443,7 +5443,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
> struct cftype *cft, u64 val)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> - struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(&memcg->css));
> + struct mem_cgroup *parent = mem_cgroup_from_css(memcg->css.parent);
>
> if (val > 100 || !parent)
> return -EINVAL;
> @@ -5790,7 +5790,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
> struct cftype *cft, u64 val)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> - struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(&memcg->css));
> + struct mem_cgroup *parent = mem_cgroup_from_css(memcg->css.parent);
>
> /* cannot set to root cgroup and only 0 and 1 are allowed */
> if (!parent || !((val == 0) || (val == 1)))
> @@ -6407,7 +6407,7 @@ static int
> mem_cgroup_css_online(struct cgroup_subsys_state *css)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> - struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(css));
> + struct mem_cgroup *parent = mem_cgroup_from_css(css->parent);
>
> if (css->id > MEM_CGROUP_ID_MAX)
> return -ENOSPC;
> diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
> index 22931e1..30d903b 100644
> --- a/net/core/netclassid_cgroup.c
> +++ b/net/core/netclassid_cgroup.c
> @@ -42,7 +42,7 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
> static int cgrp_css_online(struct cgroup_subsys_state *css)
> {
> struct cgroup_cls_state *cs = css_cls_state(css);
> - struct cgroup_cls_state *parent = css_cls_state(css_parent(css));
> + struct cgroup_cls_state *parent = css_cls_state(css->parent);
>
> if (parent)
> cs->classid = parent->classid;
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> index b990cef..2f385b9 100644
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -140,7 +140,7 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
>
> static int cgrp_css_online(struct cgroup_subsys_state *css)
> {
> - struct cgroup_subsys_state *parent_css = css_parent(css);
> + struct cgroup_subsys_state *parent_css = css->parent;
> struct net_device *dev;
> int ret = 0;
>
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 82d6b4f..3116015 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -182,7 +182,7 @@ static inline bool is_devcg_online(const struct dev_cgroup *devcg)
> static int devcgroup_online(struct cgroup_subsys_state *css)
> {
> struct dev_cgroup *dev_cgroup = css_to_devcgroup(css);
> - struct dev_cgroup *parent_dev_cgroup = css_to_devcgroup(css_parent(css));
> + struct dev_cgroup *parent_dev_cgroup = css_to_devcgroup(css->parent);
> int ret = 0;
>
> mutex_lock(&devcgroup_mutex);
> @@ -374,7 +374,7 @@ static bool may_access(struct dev_cgroup *dev_cgroup,
> static int parent_has_perm(struct dev_cgroup *childcg,
> struct dev_exception_item *ex)
> {
> - struct dev_cgroup *parent = css_to_devcgroup(css_parent(&childcg->css));
> + struct dev_cgroup *parent = css_to_devcgroup(childcg->css.parent);
>
> if (!parent)
> return 1;
> @@ -502,7 +502,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
> char temp[12]; /* 11 + 1 characters needed for a u32 */
> int count, rc = 0;
> struct dev_exception_item ex;
> - struct dev_cgroup *parent = css_to_devcgroup(css_parent(&devcgroup->css));
> + struct dev_cgroup *parent = css_to_devcgroup(devcgroup->css.parent);
>
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
> --
> 1.9.0
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty()
[not found] ` <1399671091-23867-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-12 14:53 ` Michal Hocko
[not found] ` <20140512145324.GE9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2014-05-12 14:53 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On Fri 09-05-14 17:31:19, Tejun Heo wrote:
> mem_cgroup_force_empty() is used only from
> mem_cgroup_force_empty_write() and tests whether the target memcg has
> any tasks or children without any synchronization and then returns
> -EBUSY if so.
>
> This is just weird. The tests don't really mean anything as tasks and
> children may be added after the tests and it also makes the behavior
> of the knob arbitrary because there may be lingering offline and
> removed children on the children list for extended period of time -
> writes to the knob can return -EBUSY for reasons completely invisible
> to userland.
Agreed.
> The knob is best-effort anyway and the broken business test doesn't
> affect its operation. Remove it.
But I do not think that removing just the test is the right way to go.
It is mem_cgroup_reparent_charges which bothers me because it loops
until the current counter falls down to 0 and it also feels strange that
a group can hand over pages up the hierarchy (or even to an unlimitted
root if this is a top of a hierarchy).
So I think that we want to get rid of reparenting as well. The main use
case as described by the documentation is:
"
The typical use case for this interface is before calling rmdir().
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.
"
rmdir will reparent pages implicitly so the reclaim part should be
sufficient and it would be OK even with existing tasks. Subgroups would
be little bit more tricky because the user doesn't have any control over
which group of the hierarchy will get reclaimed.
Anyway, the knob sucks - for the similar reasons why drop_cache sucks -
especially when the check would be gone and it would be another way how
to trigger reclaim anytime. Having the check doesn't prevent from races
but it at least prevents abuse.
So I would be rather for dropping the knob altogether, but that seems to
be a long term thing. So let's start with deprecating it + remove the
check with dropping reparenting part.
What do you think about the following patch instead:
---
From 03f8cb2e1fd2636d859c54df9b58719fe96e0e54 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Date: Mon, 12 May 2014 16:34:17 +0200
Subject: [PATCH] memcg: remove tasks/children test from from
mem_cgroup_force_empty
Tejun has correctly pointed out that tasks/children test in
mem_cgroup_force_empty is not correct because there is no other locking
which preserves this state throughout the rest of the function so both
new tasks can join the group or new children groups can be added while
somebody is writing to memory.force_empty. A new task would break
mem_cgroup_reparent_charges expectation that all failures as described
by mem_cgroup_force_empty_list are temporal and there is no way out.
The main use case for the knob as described by
Documentation/cgroups/memory.txt is to:
"
The typical use case for this interface is before calling rmdir().
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.
"
This means that reparenting is not really required as rmdir will
reparent pages implicitly from the safe context. If we remove it from
mem_cgroup_force_empty then we are safe even with existing tasks because
the number of reclaim attempts is bounded. Moreover the knob still does
what the documentation claims (modulo reparenting which doesn't make any
difference) and users might expect. Longterm we want to deprecate the
whole knob and put the reparented pages to the tail of parent LRU during
cgroup removal.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
Documentation/cgroups/memory.txt | 6 +-----
mm/memcontrol.c | 6 ------
2 files changed, 1 insertion(+), 11 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index cfcf14de2598..fc9fad984bfb 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -462,15 +462,11 @@ About use_hierarchy, see Section 6.
5.1 force_empty
memory.force_empty interface is provided to make cgroup's memory usage empty.
- You can use this interface only when the cgroup has no tasks.
When writing anything to this
# echo 0 > memory.force_empty
- Almost all pages tracked by this memory cgroup will be unmapped and freed.
- Some pages cannot be freed because they are locked or in-use. Such pages are
- moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this
- cgroup will be empty.
+ the cgroup will be reclaimed and as many pages reclaimed as possible.
The typical use case for this interface is before calling rmdir().
Because rmdir() moves all pages to parent, some out-of-use page caches can be
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f9b84fc2e9fe..912104d6d2a9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4764,10 +4764,6 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct cgroup *cgrp = memcg->css.cgroup;
- /* returns EBUSY if there is a task or if we come here twice. */
- if (cgroup_has_tasks(cgrp) || !list_empty(&cgrp->children))
- return -EBUSY;
-
/* we call try-to-free pages for make this cgroup empty */
lru_add_drain_all();
/* try to free all pages in this cgroup */
@@ -4786,8 +4782,6 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
}
}
- lru_add_drain();
- mem_cgroup_reparent_charges(memcg);
return 0;
}
--
2.0.0.rc0
--
Michal Hocko
SUSE Labs
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH] memcg: deprecate memory.force_empty knob
[not found] ` <20140512145324.GE9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2014-05-12 14:58 ` Michal Hocko
[not found] ` <20140512145803.GF9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-05-12 14:59 ` [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty() Tejun Heo
` (3 subsequent siblings)
4 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2014-05-12 14:58 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
And this one for deprecating force_empty.
---
From 9bb3119900baa07b92fac932991cf94dd930f907 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Date: Mon, 12 May 2014 16:20:46 +0200
Subject: [PATCH] memcg: deprecate memory.force_empty knob
force_empty has been introduced primarily to drop memory before it gets
reparented on the group removal. This alone doesn't sound fully
justified because reparented pages which are not in use can reclaimed
also later when there is a memory pressure on the parent level.
Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
Documentation/cgroups/memory.txt | 3 +++
mm/memcontrol.c | 4 ++++
2 files changed, 7 insertions(+)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index f0f67b44ea07..fc9fad984bfb 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -477,6 +477,9 @@ About use_hierarchy, see Section 6.
write will still return success. In this case, it is expected that
memory.kmem.usage_in_bytes == memory.usage_in_bytes.
+ Please note that this knob is considered deprecated and will be removed
+ in future.
+
About use_hierarchy, see Section 6.
5.2 stat file
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b030b15b626a..912104d6d2a9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4793,6 +4793,10 @@ static int mem_cgroup_force_empty_write(struct cgroup_subsys_state *css,
if (mem_cgroup_is_root(memcg))
return -EINVAL;
+ pr_info("%s (%d): memory.force_empty is deprecated and will be removed.",
+ current->comm, task_pid_nr(current));
+ pr_cont(" Let us know if you know if it needed in your usecase at");
+ pr_cont(" linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org\n");
return mem_cgroup_force_empty(memcg);
}
--
2.0.0.rc0
--
Michal Hocko
SUSE Labs
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty()
[not found] ` <20140512145324.GE9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-05-12 14:58 ` [PATCH] memcg: deprecate memory.force_empty knob Michal Hocko
@ 2014-05-12 14:59 ` Tejun Heo
[not found] ` <20140512145927.GA1421-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-13 13:10 ` Johannes Weiner
` (2 subsequent siblings)
4 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2014-05-12 14:59 UTC (permalink / raw)
To: Michal Hocko
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
Hello,
On Mon, May 12, 2014 at 04:53:24PM +0200, Michal Hocko wrote:
> What do you think about the following patch instead:
As long as the direct ->children dereference is gone, I have no
objection and yes the knob's purpose seems weird at best.
> ---
> From 03f8cb2e1fd2636d859c54df9b58719fe96e0e54 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Date: Mon, 12 May 2014 16:34:17 +0200
> Subject: [PATCH] memcg: remove tasks/children test from from
> mem_cgroup_force_empty
>
> Tejun has correctly pointed out that tasks/children test in
> mem_cgroup_force_empty is not correct because there is no other locking
> which preserves this state throughout the rest of the function so both
> new tasks can join the group or new children groups can be added while
> somebody is writing to memory.force_empty. A new task would break
> mem_cgroup_reparent_charges expectation that all failures as described
> by mem_cgroup_force_empty_list are temporal and there is no way out.
>
> The main use case for the knob as described by
> Documentation/cgroups/memory.txt is to:
> "
> The typical use case for this interface is before calling rmdir().
> Because rmdir() moves all pages to parent, some out-of-use page caches can be
> moved to the parent. If you want to avoid that, force_empty will be useful.
> "
>
> This means that reparenting is not really required as rmdir will
> reparent pages implicitly from the safe context. If we remove it from
> mem_cgroup_force_empty then we are safe even with existing tasks because
> the number of reclaim attempts is bounded. Moreover the knob still does
> what the documentation claims (modulo reparenting which doesn't make any
> difference) and users might expect. Longterm we want to deprecate the
> whole knob and put the reparented pages to the tail of parent LRU during
> cgroup removal.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Can I roll this into my series so that I can put this before changes
which depend on direct ->children usages being removed?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH] memcg: deprecate memory.force_empty knob
[not found] ` <20140512145803.GF9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2014-05-12 15:00 ` Tejun Heo
[not found] ` <20140512150014.GB1421-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2014-05-12 15:00 UTC (permalink / raw)
To: Michal Hocko
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On Mon, May 12, 2014 at 04:58:03PM +0200, Michal Hocko wrote:
> @@ -4793,6 +4793,10 @@ static int mem_cgroup_force_empty_write(struct cgroup_subsys_state *css,
>
> if (mem_cgroup_is_root(memcg))
> return -EINVAL;
> + pr_info("%s (%d): memory.force_empty is deprecated and will be removed.",
> + current->comm, task_pid_nr(current));
> + pr_cont(" Let us know if you know if it needed in your usecase at");
> + pr_cont(" linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org\n");
> return mem_cgroup_force_empty(memcg);
It probably would be way easier to just mark the knob with
CFTYPE_INSANE.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 03/14] memcg: update memcg_has_children() to use css_next_child()
[not found] ` <1399671091-23867-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-12 15:18 ` Michal Hocko
0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2014-05-12 15:18 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On Fri 09-05-14 17:31:20, Tejun Heo wrote:
> Currently, memcg_has_children() and mem_cgroup_hierarchy_write()
> directly test cgroup->children for list emptiness. It's semantically
> correct in traditional hierarchies as it actually wants to test for
> any children dead or alive; however, cgroup->children is not a
> published field and scheduled to go away.
>
> This patch moves out .use_hierarchy test out of memcg_has_children()
> and updates it to use css_next_child() to test whether there exists
> any children. With .use_hierarchy test moved out, it can also be used
> by mem_cgroup_hierarchy_write().
I hope we will not grow more users of memcg_has_children but it would be
good to at least add a comment that this has to be checked by the
caller because we consider parent/children always when use_hierarchy is
used.
> A side note: As .use_hierarchy is going away, it doesn't really matter
> but I'm not sure about how it's used in __memcg_activate_kmem(). The
> condition tested by memcg_has_children() is mushy when seen from
> userland as its result is affected by dead csses which aren't visible
> from userland. I think the rule would be a lot clearer if we have a
> dedicated "freshly minted" flag which gets cleared when the first task
> is migrated into it or the first child is created and then gate
> activation with that.
This would work. KMEM_ACCOUNTED_ALLOWED in kmem_account_flags would be a
candidate.
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> ---
> mm/memcontrol.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 036453a..eb6e1b5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4834,18 +4834,23 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
> } while (usage > 0);
> }
>
> +/* test whether @memcg has children, dead or alive */
> static inline bool memcg_has_children(struct mem_cgroup *memcg)
> {
> - lockdep_assert_held(&memcg_create_mutex);
> + bool ret;
> +
> /*
> - * The lock does not prevent addition or deletion to the list
> - * of children, but it prevents a new child from being
> - * initialized based on this parent in css_online(), so it's
> - * enough to decide whether hierarchically inherited
> - * attributes can still be changed or not.
> + * The lock does not prevent addition or deletion of children, but
> + * it prevents a new child from being initialized based on this
> + * parent in css_online(), so it's enough to decide whether
> + * hierarchically inherited attributes can still be changed or not.
> */
> - return memcg->use_hierarchy &&
> - !list_empty(&memcg->css.cgroup->children);
> + lockdep_assert_held(&memcg_create_mutex);
> +
> + rcu_read_lock();
> + ret = css_next_child(NULL, &memcg->css);
> + rcu_read_unlock();
> + return ret;
> }
>
> /*
> @@ -4921,7 +4926,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
> */
> if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
> (val == 1 || val == 0)) {
> - if (list_empty(&memcg->css.cgroup->children))
> + if (!memcg_has_children(memcg))
> memcg->use_hierarchy = val;
> else
> retval = -EBUSY;
> @@ -5038,7 +5043,8 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
> * of course permitted.
> */
> mutex_lock(&memcg_create_mutex);
> - if (cgroup_has_tasks(memcg->css.cgroup) || memcg_has_children(memcg))
> + if (cgroup_has_tasks(memcg->css.cgroup) ||
> + (memcg->use_hierarchy && memcg_has_children(memcg)))
> err = -EBUSY;
> mutex_unlock(&memcg_create_mutex);
> if (err)
> --
> 1.9.0
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH] memcg: deprecate memory.force_empty knob
[not found] ` <20140512150014.GB1421-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-05-12 15:20 ` Michal Hocko
2014-05-12 15:25 ` Tejun Heo
0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2014-05-12 15:20 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On Mon 12-05-14 11:00:14, Tejun Heo wrote:
> On Mon, May 12, 2014 at 04:58:03PM +0200, Michal Hocko wrote:
> > @@ -4793,6 +4793,10 @@ static int mem_cgroup_force_empty_write(struct cgroup_subsys_state *css,
> >
> > if (mem_cgroup_is_root(memcg))
> > return -EINVAL;
> > + pr_info("%s (%d): memory.force_empty is deprecated and will be removed.",
> > + current->comm, task_pid_nr(current));
> > + pr_cont(" Let us know if you know if it needed in your usecase at");
> > + pr_cont(" linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org\n");
> > return mem_cgroup_force_empty(memcg);
>
> It probably would be way easier to just mark the knob with
> CFTYPE_INSANE.
That would prevent from creating the file, right? I do not mind that but
I would like to see people complaining before.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty()
[not found] ` <20140512145927.GA1421-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-05-12 15:21 ` Michal Hocko
0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2014-05-12 15:21 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On Mon 12-05-14 10:59:27, Tejun Heo wrote:
[...]
> Can I roll this into my series so that I can put this before changes
> which depend on direct ->children usages being removed?
Sure
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH] memcg: deprecate memory.force_empty knob
2014-05-12 15:20 ` Michal Hocko
@ 2014-05-12 15:25 ` Tejun Heo
[not found] ` <20140512152507.GD1421-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2014-05-12 15:25 UTC (permalink / raw)
To: Michal Hocko; +Cc: lizefan, cgroups, linux-kernel, hannes
On Mon, May 12, 2014 at 05:20:15PM +0200, Michal Hocko wrote:
> On Mon 12-05-14 11:00:14, Tejun Heo wrote:
> > On Mon, May 12, 2014 at 04:58:03PM +0200, Michal Hocko wrote:
> > > @@ -4793,6 +4793,10 @@ static int mem_cgroup_force_empty_write(struct cgroup_subsys_state *css,
> > >
> > > if (mem_cgroup_is_root(memcg))
> > > return -EINVAL;
> > > + pr_info("%s (%d): memory.force_empty is deprecated and will be removed.",
> > > + current->comm, task_pid_nr(current));
> > > + pr_cont(" Let us know if you know if it needed in your usecase at");
> > > + pr_cont(" linux-mm@kvack.org\n");
> > > return mem_cgroup_force_empty(memcg);
> >
> > It probably would be way easier to just mark the knob with
> > CFTYPE_INSANE.
>
> That would prevent from creating the file, right? I do not mind that but
> I would like to see people complaining before.
Oh sure, if you wanna see people complaining before the roll out of
unified hierarchy, but let's make sure it's also marked with
CFTYPE_INSANE. It's easy to remove the flag afterwards. The other
way isn't, so...
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH] memcg: deprecate memory.force_empty knob
[not found] ` <20140512152507.GD1421-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-05-12 15:34 ` Michal Hocko
[not found] ` <20140512153458.GJ9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2014-05-12 15:34 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On Mon 12-05-14 11:25:07, Tejun Heo wrote:
> On Mon, May 12, 2014 at 05:20:15PM +0200, Michal Hocko wrote:
> > On Mon 12-05-14 11:00:14, Tejun Heo wrote:
> > > On Mon, May 12, 2014 at 04:58:03PM +0200, Michal Hocko wrote:
> > > > @@ -4793,6 +4793,10 @@ static int mem_cgroup_force_empty_write(struct cgroup_subsys_state *css,
> > > >
> > > > if (mem_cgroup_is_root(memcg))
> > > > return -EINVAL;
> > > > + pr_info("%s (%d): memory.force_empty is deprecated and will be removed.",
> > > > + current->comm, task_pid_nr(current));
> > > > + pr_cont(" Let us know if you know if it needed in your usecase at");
> > > > + pr_cont(" linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org\n");
> > > > return mem_cgroup_force_empty(memcg);
> > >
> > > It probably would be way easier to just mark the knob with
> > > CFTYPE_INSANE.
> >
> > That would prevent from creating the file, right? I do not mind that but
> > I would like to see people complaining before.
>
> Oh sure, if you wanna see people complaining before the roll out of
> unified hierarchy, but let's make sure it's also marked with
> CFTYPE_INSANE. It's easy to remove the flag afterwards. The other
> way isn't, so...
---
From 6f2a33df7750f0794b03f7a85aba02a4e631f2a0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Date: Mon, 12 May 2014 16:20:46 +0200
Subject: [PATCH] memcg: deprecate memory.force_empty knob
force_empty has been introduced primarily to drop memory before it gets
reparented on the group removal. This alone doesn't sound fully
justified because reparented pages which are not in use can be reclaimed
also later when there is a memory pressure on the parent level.
Mark the knob CFTYPE_INSANE which tells the cgroup core that it
shouldn't create the knob with the experimental sane_behavior. Other
users will get informed about the deprecation and asked to tell us more.
But I expect that most users will be simply cgroup remove handlers
which do that since ever without having any good reason for it.
If somebody really cares and the reparented pages, which would be dropped
otherwise, push out more important ones then we should fix the
reparenting code and put pages to the tail.
Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
Documentation/cgroups/memory.txt | 3 +++
mm/memcontrol.c | 5 +++++
2 files changed, 8 insertions(+)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index f0f67b44ea07..fc9fad984bfb 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -477,6 +477,9 @@ About use_hierarchy, see Section 6.
write will still return success. In this case, it is expected that
memory.kmem.usage_in_bytes == memory.usage_in_bytes.
+ Please note that this knob is considered deprecated and will be removed
+ in future.
+
About use_hierarchy, see Section 6.
5.2 stat file
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b030b15b626a..ee123f3d40d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4793,6 +4793,10 @@ static int mem_cgroup_force_empty_write(struct cgroup_subsys_state *css,
if (mem_cgroup_is_root(memcg))
return -EINVAL;
+ pr_info("%s (%d): memory.force_empty is deprecated and will be removed.",
+ current->comm, task_pid_nr(current));
+ pr_cont(" Let us know if you know if it needed in your usecase at");
+ pr_cont(" linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org\n");
return mem_cgroup_force_empty(memcg);
}
@@ -6037,6 +6041,7 @@ static struct cftype mem_cgroup_files[] = {
},
{
.name = "force_empty",
+ .flags = CFTYPE_INSANE,
.trigger = mem_cgroup_force_empty_write,
},
{
--
2.0.0.rc0
--
Michal Hocko
SUSE Labs
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 04/14] device_cgroup: remove direct access to cgroup->children
[not found] ` <1399671091-23867-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-13 12:56 ` Aristeu Rozanski
2014-05-14 12:52 ` Serge E. Hallyn
1 sibling, 0 replies; 48+ messages in thread
From: Aristeu Rozanski @ 2014-05-13 12:56 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, Serge Hallyn
On Fri, May 09, 2014 at 05:31:21PM -0400, Tejun Heo wrote:
> Currently, devcg::has_children() directly tests cgroup->children for
> list emptiness. The field is not a published field and scheduled to
> go away. In addition, the test isn't strictly correct as devcg should
> only care about children which are visible to userland.
>
> This patch converts has_children() to use css_next_child() instead.
> The subtle incorrectness is noted and will be dealt with later.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
> security/device_cgroup.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 3116015..75b4b18 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -477,9 +477,17 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
>
> static inline bool has_children(struct dev_cgroup *devcgroup)
> {
> - struct cgroup *cgrp = devcgroup->css.cgroup;
> + bool ret;
>
> - return !list_empty(&cgrp->children);
> + /*
> + * FIXME: There may be lingering offline csses and this function
> + * may return %true when there isn't any userland-visible child
> + * which is incorrect for our purposes.
> + */
> + rcu_read_lock();
> + ret = css_next_child(NULL, &devcgroup->css);
> + rcu_read_unlock();
> + return ret;
> }
>
> /*
Acked-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
--
Aristeu
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 13/14] device_cgroup: use css_has_online_children() instead of has_children()
[not found] ` <1399671091-23867-14-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-13 12:56 ` Aristeu Rozanski
0 siblings, 0 replies; 48+ messages in thread
From: Aristeu Rozanski @ 2014-05-13 12:56 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, Serge Hallyn
On Fri, May 09, 2014 at 05:31:30PM -0400, Tejun Heo wrote:
> devcgroup_update_access() wants to know whether there are child
> cgroups which are online and visible to userland and has_children()
> may return false positive. Replace it with css_has_online_children().
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
> security/device_cgroup.c | 19 ++-----------------
> 1 file changed, 2 insertions(+), 17 deletions(-)
>
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 75b4b18..22de334 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -475,21 +475,6 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
> return rc;
> }
>
> -static inline bool has_children(struct dev_cgroup *devcgroup)
> -{
> - bool ret;
> -
> - /*
> - * FIXME: There may be lingering offline csses and this function
> - * may return %true when there isn't any userland-visible child
> - * which is incorrect for our purposes.
> - */
> - rcu_read_lock();
> - ret = css_next_child(NULL, &devcgroup->css);
> - rcu_read_unlock();
> - return ret;
> -}
> -
> /*
> * Modify the exception list using allow/deny rules.
> * CAP_SYS_ADMIN is needed for this. It's at least separate from CAP_MKNOD
> @@ -522,7 +507,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
> case 'a':
> switch (filetype) {
> case DEVCG_ALLOW:
> - if (has_children(devcgroup))
> + if (css_has_online_children(&devcgroup->css))
> return -EINVAL;
>
> if (!may_allow_all(parent))
> @@ -538,7 +523,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
> return rc;
> break;
> case DEVCG_DENY:
> - if (has_children(devcgroup))
> + if (css_has_online_children(&devcgroup->css))
> return -EINVAL;
>
> dev_exception_clean(devcgroup);
Acked-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
--
Aristeu
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty()
[not found] ` <20140512145324.GE9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-05-12 14:58 ` [PATCH] memcg: deprecate memory.force_empty knob Michal Hocko
2014-05-12 14:59 ` [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty() Tejun Heo
@ 2014-05-13 13:10 ` Johannes Weiner
2014-05-13 16:46 ` Tejun Heo
2014-05-13 18:51 ` [PATCH UPDATED 02/14] memcg: remove " Tejun Heo
4 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2014-05-13 13:10 UTC (permalink / raw)
To: Michal Hocko
Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Mon, May 12, 2014 at 04:53:24PM +0200, Michal Hocko wrote:
> On Fri 09-05-14 17:31:19, Tejun Heo wrote:
> > mem_cgroup_force_empty() is used only from
> > mem_cgroup_force_empty_write() and tests whether the target memcg has
> > any tasks or children without any synchronization and then returns
> > -EBUSY if so.
> >
> > This is just weird. The tests don't really mean anything as tasks and
> > children may be added after the tests and it also makes the behavior
> > of the knob arbitrary because there may be lingering offline and
> > removed children on the children list for extended period of time -
> > writes to the knob can return -EBUSY for reasons completely invisible
> > to userland.
>
> Agreed.
>
> > The knob is best-effort anyway and the broken business test doesn't
> > affect its operation. Remove it.
>
> But I do not think that removing just the test is the right way to go.
> It is mem_cgroup_reparent_charges which bothers me because it loops
> until the current counter falls down to 0 and it also feels strange that
> a group can hand over pages up the hierarchy (or even to an unlimitted
> root if this is a top of a hierarchy).
>
> So I think that we want to get rid of reparenting as well. The main use
> case as described by the documentation is:
> "
> The typical use case for this interface is before calling rmdir().
> Because rmdir() moves all pages to parent, some out-of-use page caches can be
> moved to the parent. If you want to avoid that, force_empty will be useful.
> "
>
> rmdir will reparent pages implicitly so the reclaim part should be
> sufficient and it would be OK even with existing tasks. Subgroups would
> be little bit more tricky because the user doesn't have any control over
> which group of the hierarchy will get reclaimed.
>
> Anyway, the knob sucks - for the similar reasons why drop_cache sucks -
> especially when the check would be gone and it would be another way how
> to trigger reclaim anytime. Having the check doesn't prevent from races
> but it at least prevents abuse.
>
> So I would be rather for dropping the knob altogether, but that seems to
> be a long term thing. So let's start with deprecating it + remove the
> check with dropping reparenting part.
>
> What do you think about the following patch instead:
> ---
> >From 03f8cb2e1fd2636d859c54df9b58719fe96e0e54 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Date: Mon, 12 May 2014 16:34:17 +0200
> Subject: [PATCH] memcg: remove tasks/children test from from
> mem_cgroup_force_empty
>
> Tejun has correctly pointed out that tasks/children test in
> mem_cgroup_force_empty is not correct because there is no other locking
> which preserves this state throughout the rest of the function so both
> new tasks can join the group or new children groups can be added while
> somebody is writing to memory.force_empty. A new task would break
> mem_cgroup_reparent_charges expectation that all failures as described
> by mem_cgroup_force_empty_list are temporal and there is no way out.
>
> The main use case for the knob as described by
> Documentation/cgroups/memory.txt is to:
> "
> The typical use case for this interface is before calling rmdir().
> Because rmdir() moves all pages to parent, some out-of-use page caches can be
> moved to the parent. If you want to avoid that, force_empty will be useful.
> "
>
> This means that reparenting is not really required as rmdir will
> reparent pages implicitly from the safe context. If we remove it from
> mem_cgroup_force_empty then we are safe even with existing tasks because
> the number of reclaim attempts is bounded. Moreover the knob still does
> what the documentation claims (modulo reparenting which doesn't make any
> difference) and users might expect. Longterm we want to deprecate the
> whole knob and put the reparented pages to the tail of parent LRU during
> cgroup removal.
Yeah, let's deprecate the knob, but please change this to "deal with
left-over pages during cgroup removal". Reparenting is an
implementation detail.
In fact, now that Tejun made offlined csss iterable, we can make
charges pin the css beyond cgroup lifetime and just reclaim leftovers
from their original css. Then get rid of exit-reparenting entirely.
Should also make kmemcg handling during cgroup removal much easier...
[ I was going to send patches to do all this already but couldn't get
them ready in time before my vacation. ]
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH] memcg: deprecate memory.force_empty knob
[not found] ` <20140512153458.GJ9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2014-05-13 13:16 ` Johannes Weiner
[not found] ` <20140513131655.GC18849-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
0 siblings, 1 reply; 48+ messages in thread
From: Johannes Weiner @ 2014-05-13 13:16 UTC (permalink / raw)
To: Michal Hocko
Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Mon, May 12, 2014 at 05:34:58PM +0200, Michal Hocko wrote:
> On Mon 12-05-14 11:25:07, Tejun Heo wrote:
> > On Mon, May 12, 2014 at 05:20:15PM +0200, Michal Hocko wrote:
> > > On Mon 12-05-14 11:00:14, Tejun Heo wrote:
> > > > On Mon, May 12, 2014 at 04:58:03PM +0200, Michal Hocko wrote:
> > > > > @@ -4793,6 +4793,10 @@ static int mem_cgroup_force_empty_write(struct cgroup_subsys_state *css,
> > > > >
> > > > > if (mem_cgroup_is_root(memcg))
> > > > > return -EINVAL;
> > > > > + pr_info("%s (%d): memory.force_empty is deprecated and will be removed.",
> > > > > + current->comm, task_pid_nr(current));
> > > > > + pr_cont(" Let us know if you know if it needed in your usecase at");
> > > > > + pr_cont(" linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org\n");
> > > > > return mem_cgroup_force_empty(memcg);
> > > >
> > > > It probably would be way easier to just mark the knob with
> > > > CFTYPE_INSANE.
> > >
> > > That would prevent from creating the file, right? I do not mind that but
> > > I would like to see people complaining before.
> >
> > Oh sure, if you wanna see people complaining before the roll out of
> > unified hierarchy, but let's make sure it's also marked with
> > CFTYPE_INSANE. It's easy to remove the flag afterwards. The other
> > way isn't, so...
> ---
> >From 6f2a33df7750f0794b03f7a85aba02a4e631f2a0 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Date: Mon, 12 May 2014 16:20:46 +0200
> Subject: [PATCH] memcg: deprecate memory.force_empty knob
>
> force_empty has been introduced primarily to drop memory before it gets
> reparented on the group removal. This alone doesn't sound fully
> justified because reparented pages which are not in use can be reclaimed
> also later when there is a memory pressure on the parent level.
>
> Mark the knob CFTYPE_INSANE which tells the cgroup core that it
> shouldn't create the knob with the experimental sane_behavior. Other
> users will get informed about the deprecation and asked to tell us more.
> But I expect that most users will be simply cgroup remove handlers
> which do that since ever without having any good reason for it.
>
> If somebody really cares and the reparented pages, which would be dropped
> otherwise, push out more important ones then we should fix the
> reparenting code and put pages to the tail.
>
> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
I'm skeptical the printk will do anything useful, but you marked the
knob insane and that's the most important change.
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH] memcg: deprecate memory.force_empty knob
[not found] ` <20140513131655.GC18849-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2014-05-13 15:09 ` Michal Hocko
0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2014-05-13 15:09 UTC (permalink / raw)
To: Johannes Weiner
Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Tue 13-05-14 09:16:56, Johannes Weiner wrote:
> On Mon, May 12, 2014 at 05:34:58PM +0200, Michal Hocko wrote:
[...]
> > >From 6f2a33df7750f0794b03f7a85aba02a4e631f2a0 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> > Date: Mon, 12 May 2014 16:20:46 +0200
> > Subject: [PATCH] memcg: deprecate memory.force_empty knob
> >
> > force_empty has been introduced primarily to drop memory before it gets
> > reparented on the group removal. This alone doesn't sound fully
> > justified because reparented pages which are not in use can be reclaimed
> > also later when there is a memory pressure on the parent level.
> >
> > Mark the knob CFTYPE_INSANE which tells the cgroup core that it
> > shouldn't create the knob with the experimental sane_behavior. Other
> > users will get informed about the deprecation and asked to tell us more.
> > But I expect that most users will be simply cgroup remove handlers
> > which do that since ever without having any good reason for it.
> >
> > If somebody really cares and the reparented pages, which would be dropped
> > otherwise, push out more important ones then we should fix the
> > reparenting code and put pages to the tail.
> >
> > Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
>
> I'm skeptical the printk will do anything useful, but you marked the
> knob insane and that's the most important change.
Well, I suspect that most users will try the new semantic at the latest
possible moment and then it can come up as a surprise. I would prefer to
catch those as soon as possible. I am even thinking to push this to SLES
to catch possible enterprise users.
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Thanks. OK, I will post it to Andrew. I guess he will want to have some
rate-limiting or print-once semantic...
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty()
[not found] ` <20140512145324.GE9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
` (2 preceding siblings ...)
2014-05-13 13:10 ` Johannes Weiner
@ 2014-05-13 16:46 ` Tejun Heo
2014-05-13 18:51 ` [PATCH UPDATED 02/14] memcg: remove " Tejun Heo
4 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-13 16:46 UTC (permalink / raw)
To: Michal Hocko
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On Mon, May 12, 2014 at 04:53:24PM +0200, Michal Hocko wrote:
> From 03f8cb2e1fd2636d859c54df9b58719fe96e0e54 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Date: Mon, 12 May 2014 16:34:17 +0200
> Subject: [PATCH] memcg: remove tasks/children test from from
> mem_cgroup_force_empty
>
> Tejun has correctly pointed out that tasks/children test in
> mem_cgroup_force_empty is not correct because there is no other locking
> which preserves this state throughout the rest of the function so both
> new tasks can join the group or new children groups can be added while
> somebody is writing to memory.force_empty. A new task would break
> mem_cgroup_reparent_charges expectation that all failures as described
> by mem_cgroup_force_empty_list are temporal and there is no way out.
>
> The main use case for the knob as described by
> Documentation/cgroups/memory.txt is to:
> "
> The typical use case for this interface is before calling rmdir().
> Because rmdir() moves all pages to parent, some out-of-use page caches can be
> moved to the parent. If you want to avoid that, force_empty will be useful.
> "
>
> This means that reparenting is not really required as rmdir will
> reparent pages implicitly from the safe context. If we remove it from
> mem_cgroup_force_empty then we are safe even with existing tasks because
> the number of reclaim attempts is bounded. Moreover the knob still does
> what the documentation claims (modulo reparenting which doesn't make any
> difference) and users might expect. Longterm we want to deprecate the
> whole knob and put the reparented pages to the tail of parent LRU during
> cgroup removal.
Replaced the original patch with this one with Johannes' ack added.
Thanks!
--
tejun
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 03/14] memcg: update memcg_has_children() to use css_next_child()
2014-05-09 21:31 ` [PATCH 03/14] memcg: update memcg_has_children() to use css_next_child() Tejun Heo
[not found] ` <1399671091-23867-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-13 16:53 ` Tejun Heo
1 sibling, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-13 16:53 UTC (permalink / raw)
To: lizefan; +Cc: cgroups, linux-kernel, hannes, Michal Hocko
Currently, memcg_has_children() and mem_cgroup_hierarchy_write()
directly test cgroup->children for list emptiness. It's semantically
correct in traditional hierarchies as it actually wants to test for
any children dead or alive; however, cgroup->children is not a
published field and scheduled to go away.
This patch moves out .use_hierarchy test out of memcg_has_children()
and updates it to use css_next_child() to test whether there exists
any children. With .use_hierarchy test moved out, it can also be used
by mem_cgroup_hierarchy_write().
A side note: As .use_hierarchy is going away, it doesn't really matter
but I'm not sure about how it's used in __memcg_activate_kmem(). The
condition tested by memcg_has_children() is mushy when seen from
userland as its result is affected by dead csses which aren't visible
from userland. I think the rule would be a lot clearer if we have a
dedicated "freshly minted" flag which gets cleared when the first task
is migrated into it or the first child is created and then gate
activation with that.
v2: Added comment noting that testing use_hierarchy is the
responsibility of the callers of memcg_has_children() as suggested
by Michal Hocko.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
mm/memcontrol.c | 31 +++++++++++++++++++++----------
1 file changed, 21 insertions(+), 10 deletions(-)
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4834,18 +4834,28 @@ static void mem_cgroup_reparent_charges(
} while (usage > 0);
}
+/*
+ * Test whether @memcg has children, dead or alive. Note that this
+ * function doesn't care whether @memcg has use_hierarchy enabled and
+ * returns %true if there are child csses according to the cgroup
+ * hierarchy. Testing use_hierarchy is the caller's responsiblity.
+ */
static inline bool memcg_has_children(struct mem_cgroup *memcg)
{
- lockdep_assert_held(&memcg_create_mutex);
+ bool ret;
+
/*
- * The lock does not prevent addition or deletion to the list
- * of children, but it prevents a new child from being
- * initialized based on this parent in css_online(), so it's
- * enough to decide whether hierarchically inherited
- * attributes can still be changed or not.
+ * The lock does not prevent addition or deletion of children, but
+ * it prevents a new child from being initialized based on this
+ * parent in css_online(), so it's enough to decide whether
+ * hierarchically inherited attributes can still be changed or not.
*/
- return memcg->use_hierarchy &&
- !list_empty(&memcg->css.cgroup->children);
+ lockdep_assert_held(&memcg_create_mutex);
+
+ rcu_read_lock();
+ ret = css_next_child(NULL, &memcg->css);
+ rcu_read_unlock();
+ return ret;
}
/*
@@ -4920,7 +4930,7 @@ static int mem_cgroup_hierarchy_write(st
*/
if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
(val == 1 || val == 0)) {
- if (list_empty(&memcg->css.cgroup->children))
+ if (!memcg_has_children(memcg))
memcg->use_hierarchy = val;
else
retval = -EBUSY;
@@ -5037,7 +5047,8 @@ static int __memcg_activate_kmem(struct
* of course permitted.
*/
mutex_lock(&memcg_create_mutex);
- if (cgroup_has_tasks(memcg->css.cgroup) || memcg_has_children(memcg))
+ if (cgroup_has_tasks(memcg->css.cgroup) ||
+ (memcg->use_hierarchy && memcg_has_children(memcg)))
err = -EBUSY;
mutex_unlock(&memcg_create_mutex);
if (err)
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (3 preceding siblings ...)
2014-05-09 21:31 ` [PATCH 07/14] cgroup: link all cgroup_subsys_states in their sibling lists Tejun Heo
@ 2014-05-13 16:59 ` Tejun Heo
2014-05-14 4:21 ` Li Zefan
` (2 subsequent siblings)
7 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-13 16:59 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On Fri, May 09, 2014 at 05:31:17PM -0400, Tejun Heo wrote:
> and available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-direct-css-iteration
Refreshed on top of cgroup/for-3.16 with the updated patches.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 01/14] cgroup: remove css_parent()
[not found] ` <1399671091-23867-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-11 1:47 ` David Miller
2014-05-11 13:02 ` Neil Horman
@ 2014-05-13 18:50 ` Tejun Heo
2 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-13 18:50 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, Vivek Goyal, Jens Axboe,
Peter Zijlstra, Michal Hocko, Neil Horman, David S. Miller
From 55d3077482c2b9d05deefa92cf28c4057538f83c Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Tue, 13 May 2014 14:47:01 -0400
cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent. It was quite some
time ago and we're moving forward with making css more prominent.
This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent. While at it, explicitly mark fields of css
which are public and immutable.
v2: New usage from device_cgroup.c converted.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Neil Horman <nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
Acked-by: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Cc: Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
block/blk-cgroup.h | 2 +-
include/linux/cgroup.h | 29 +++++++++++------------------
kernel/cgroup.c | 8 ++++----
kernel/cgroup_freezer.c | 2 +-
kernel/cpuset.c | 2 +-
kernel/sched/core.c | 2 +-
kernel/sched/cpuacct.c | 2 +-
mm/hugetlb_cgroup.c | 2 +-
mm/memcontrol.c | 14 +++++++-------
net/core/netclassid_cgroup.c | 2 +-
net/core/netprio_cgroup.c | 2 +-
security/device_cgroup.c | 8 ++++----
12 files changed, 34 insertions(+), 41 deletions(-)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 371fe8e..d692b29 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -204,7 +204,7 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
*/
static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
{
- return css_to_blkcg(css_parent(&blkcg->css));
+ return css_to_blkcg(blkcg->css.parent);
}
/**
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 76dadd77..88c4d03 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -48,22 +48,28 @@ enum cgroup_subsys_id {
};
#undef SUBSYS
-/* Per-subsystem/per-cgroup state maintained by the system. */
+/*
+ * Per-subsystem/per-cgroup state maintained by the system. This is the
+ * fundamental structural building block that controllers deal with.
+ *
+ * Fields marked with "PI:" are public and immutable and may be accessed
+ * directly without synchronization.
+ */
struct cgroup_subsys_state {
- /* the cgroup that this css is attached to */
+ /* PI: the cgroup that this css is attached to */
struct cgroup *cgroup;
- /* the cgroup subsystem that this css is attached to */
+ /* PI: the cgroup subsystem that this css is attached to */
struct cgroup_subsys *ss;
/* reference count - access via css_[try]get() and css_put() */
struct percpu_ref refcnt;
- /* the parent css */
+ /* PI: the parent css */
struct cgroup_subsys_state *parent;
/*
- * Subsys-unique ID. 0 is unused and root is always 1. The
+ * PI: Subsys-unique ID. 0 is unused and root is always 1. The
* matching css can be looked up using css_from_id().
*/
int id;
@@ -665,19 +671,6 @@ struct cgroup_subsys {
#undef SUBSYS
/**
- * css_parent - find the parent css
- * @css: the target cgroup_subsys_state
- *
- * Return the parent css of @css. This function is guaranteed to return
- * non-NULL parent as long as @css isn't the root.
- */
-static inline
-struct cgroup_subsys_state *css_parent(struct cgroup_subsys_state *css)
-{
- return css->parent;
-}
-
-/**
* task_css_set_check - obtain a task's css_set with extra access conditions
* @task: the task to obtain css_set for
* @__c: extra condition expression to be passed to rcu_dereference_check()
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 365787f..321163a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3175,10 +3175,10 @@ css_next_descendant_pre(struct cgroup_subsys_state *pos,
/* no child, visit my or the closest ancestor's next sibling */
while (pos != root) {
- next = css_next_child(pos, css_parent(pos));
+ next = css_next_child(pos, pos->parent);
if (next)
return next;
- pos = css_parent(pos);
+ pos = pos->parent;
}
return NULL;
@@ -3260,12 +3260,12 @@ css_next_descendant_post(struct cgroup_subsys_state *pos,
return NULL;
/* if there's an unvisited sibling, visit its leftmost descendant */
- next = css_next_child(pos, css_parent(pos));
+ next = css_next_child(pos, pos->parent);
if (next)
return css_leftmost_descendant(next);
/* no sibling left, visit parent */
- return css_parent(pos);
+ return pos->parent;
}
static bool cgroup_has_live_children(struct cgroup *cgrp)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 6b4e60e..a79e40f 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -59,7 +59,7 @@ static inline struct freezer *task_freezer(struct task_struct *task)
static struct freezer *parent_freezer(struct freezer *freezer)
{
- return css_freezer(css_parent(&freezer->css));
+ return css_freezer(freezer->css.parent);
}
bool cgroup_freezing(struct task_struct *task)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 2f4b08b..5b2a310 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -124,7 +124,7 @@ static inline struct cpuset *task_cs(struct task_struct *task)
static inline struct cpuset *parent_cs(struct cpuset *cs)
{
- return css_cs(css_parent(&cs->css));
+ return css_cs(cs->css.parent);
}
#ifdef CONFIG_NUMA
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..ac61ad1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7586,7 +7586,7 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
- struct task_group *parent = css_tg(css_parent(css));
+ struct task_group *parent = css_tg(css->parent);
if (parent)
sched_online_group(tg, parent);
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index c143ee3..9cf350c 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -46,7 +46,7 @@ static inline struct cpuacct *task_ca(struct task_struct *tsk)
static inline struct cpuacct *parent_ca(struct cpuacct *ca)
{
- return css_ca(css_parent(&ca->css));
+ return css_ca(ca->css.parent);
}
static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index a380681..493f758 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -52,7 +52,7 @@ static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
static inline struct hugetlb_cgroup *
parent_hugetlb_cgroup(struct hugetlb_cgroup *h_cg)
{
- return hugetlb_cgroup_from_css(css_parent(&h_cg->css));
+ return hugetlb_cgroup_from_css(h_cg->css.parent);
}
static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b638a79..a5e0417 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1540,7 +1540,7 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
/* root ? */
- if (!css_parent(&memcg->css))
+ if (!memcg->css.parent)
return vm_swappiness;
return memcg->swappiness;
@@ -4909,7 +4909,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
{
int retval = 0;
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- struct mem_cgroup *parent_memcg = mem_cgroup_from_css(css_parent(&memcg->css));
+ struct mem_cgroup *parent_memcg = mem_cgroup_from_css(memcg->css.parent);
mutex_lock(&memcg_create_mutex);
@@ -5207,8 +5207,8 @@ static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
if (!memcg->use_hierarchy)
goto out;
- while (css_parent(&memcg->css)) {
- memcg = mem_cgroup_from_css(css_parent(&memcg->css));
+ while (memcg->css.parent) {
+ memcg = mem_cgroup_from_css(memcg->css.parent);
if (!memcg->use_hierarchy)
break;
tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
@@ -5443,7 +5443,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
struct cftype *cft, u64 val)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(&memcg->css));
+ struct mem_cgroup *parent = mem_cgroup_from_css(memcg->css.parent);
if (val > 100 || !parent)
return -EINVAL;
@@ -5790,7 +5790,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
struct cftype *cft, u64 val)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(&memcg->css));
+ struct mem_cgroup *parent = mem_cgroup_from_css(memcg->css.parent);
/* cannot set to root cgroup and only 0 and 1 are allowed */
if (!parent || !((val == 0) || (val == 1)))
@@ -6407,7 +6407,7 @@ static int
mem_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
- struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(css));
+ struct mem_cgroup *parent = mem_cgroup_from_css(css->parent);
if (css->id > MEM_CGROUP_ID_MAX)
return -ENOSPC;
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 22931e1..30d903b 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -42,7 +42,7 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
static int cgrp_css_online(struct cgroup_subsys_state *css)
{
struct cgroup_cls_state *cs = css_cls_state(css);
- struct cgroup_cls_state *parent = css_cls_state(css_parent(css));
+ struct cgroup_cls_state *parent = css_cls_state(css->parent);
if (parent)
cs->classid = parent->classid;
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b990cef..2f385b9 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -140,7 +140,7 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
static int cgrp_css_online(struct cgroup_subsys_state *css)
{
- struct cgroup_subsys_state *parent_css = css_parent(css);
+ struct cgroup_subsys_state *parent_css = css->parent;
struct net_device *dev;
int ret = 0;
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 7dbac40..ce14a31 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -182,7 +182,7 @@ static inline bool is_devcg_online(const struct dev_cgroup *devcg)
static int devcgroup_online(struct cgroup_subsys_state *css)
{
struct dev_cgroup *dev_cgroup = css_to_devcgroup(css);
- struct dev_cgroup *parent_dev_cgroup = css_to_devcgroup(css_parent(css));
+ struct dev_cgroup *parent_dev_cgroup = css_to_devcgroup(css->parent);
int ret = 0;
mutex_lock(&devcgroup_mutex);
@@ -455,7 +455,7 @@ static bool verify_new_ex(struct dev_cgroup *dev_cgroup,
static int parent_has_perm(struct dev_cgroup *childcg,
struct dev_exception_item *ex)
{
- struct dev_cgroup *parent = css_to_devcgroup(css_parent(&childcg->css));
+ struct dev_cgroup *parent = css_to_devcgroup(childcg->css.parent);
if (!parent)
return 1;
@@ -476,7 +476,7 @@ static int parent_has_perm(struct dev_cgroup *childcg,
static bool parent_allows_removal(struct dev_cgroup *childcg,
struct dev_exception_item *ex)
{
- struct dev_cgroup *parent = css_to_devcgroup(css_parent(&childcg->css));
+ struct dev_cgroup *parent = css_to_devcgroup(childcg->css.parent);
if (!parent)
return true;
@@ -614,7 +614,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
char temp[12]; /* 11 + 1 characters needed for a u32 */
int count, rc = 0;
struct dev_exception_item ex;
- struct dev_cgroup *parent = css_to_devcgroup(css_parent(&devcgroup->css));
+ struct dev_cgroup *parent = css_to_devcgroup(devcgroup->css.parent);
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH UPDATED 02/14] memcg: remove tasks/children test from mem_cgroup_force_empty()
[not found] ` <20140512145324.GE9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
` (3 preceding siblings ...)
2014-05-13 16:46 ` Tejun Heo
@ 2014-05-13 18:51 ` Tejun Heo
4 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-13 18:51 UTC (permalink / raw)
To: Michal Hocko
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
From ff584991bd7a6beea591aa10bce1712eb6cef3d6 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Date: Mon, 12 May 2014 16:34:17 +0200
Tejun has correctly pointed out that tasks/children test in
mem_cgroup_force_empty is not correct because there is no other locking
which preserves this state throughout the rest of the function so both
new tasks can join the group or new children groups can be added while
somebody is writing to memory.force_empty. A new task would break
mem_cgroup_reparent_charges expectation that all failures as described
by mem_cgroup_force_empty_list are temporal and there is no way out.
The main use case for the knob as described by
Documentation/cgroups/memory.txt is to:
"
The typical use case for this interface is before calling rmdir().
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.
"
This means that reparenting is not really required as rmdir will
reparent pages implicitly from the safe context. If we remove it from
mem_cgroup_force_empty then we are safe even with existing tasks because
the number of reclaim attempts is bounded. Moreover the knob still does
what the documentation claims (modulo reparenting which doesn't make any
difference) and users might expect. Longterm we want to deprecate the
whole knob and put the reparented pages to the tail of parent LRU during
cgroup removal.
tj: Removed unused variable @cgrp from mem_cgroup_force_empty()
Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
Documentation/cgroups/memory.txt | 6 +-----
mm/memcontrol.c | 7 -------
2 files changed, 1 insertion(+), 12 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 2622115..8cb87e3 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -453,15 +453,11 @@ About use_hierarchy, see Section 6.
5.1 force_empty
memory.force_empty interface is provided to make cgroup's memory usage empty.
- You can use this interface only when the cgroup has no tasks.
When writing anything to this
# echo 0 > memory.force_empty
- Almost all pages tracked by this memory cgroup will be unmapped and freed.
- Some pages cannot be freed because they are locked or in-use. Such pages are
- moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this
- cgroup will be empty.
+ the cgroup will be reclaimed and as many pages reclaimed as possible.
The typical use case for this interface is before calling rmdir().
Because rmdir() moves all pages to parent, some out-of-use page caches can be
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a5e0417..6144a8e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4857,11 +4857,6 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
{
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct cgroup *cgrp = memcg->css.cgroup;
-
- /* returns EBUSY if there is a task or if we come here twice. */
- if (cgroup_has_tasks(cgrp) || !list_empty(&cgrp->children))
- return -EBUSY;
/* we call try-to-free pages for make this cgroup empty */
lru_add_drain_all();
@@ -4881,8 +4876,6 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
}
}
- lru_add_drain();
- mem_cgroup_reparent_charges(memcg);
return 0;
}
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (4 preceding siblings ...)
2014-05-13 16:59 ` [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
@ 2014-05-14 4:21 ` Li Zefan
[not found] ` <5372EF45.8060701-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2014-05-16 1:29 ` Li Zefan
2014-05-16 16:08 ` Tejun Heo
7 siblings, 1 reply; 48+ messages in thread
From: Li Zefan @ 2014-05-14 4:21 UTC (permalink / raw)
To: Tejun Heo
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
Hi Tejun,
On 2014/5/10 5:31, Tejun Heo wrote:
> Hello,
>
> Currently, while csses (cgroup_subsys_states) have ->parent linkage
> too, only cgroups form full tree through their ->children and
> ->sibling fields and css iterations naturally is implemented by
> iterating cgroups and then dereferencing the css for the specified
> subsystem.
>
> There are now use cases where controllers need to iterate through
> csses regardless of their online state as long as they have positive
What use cases are we talking about here?
> reference. This can't easily be achieved by iterating cgroups because
> its css pointer array needs to be cleared on offline and there may be
> multiple dying csses for a cgroup for the same subsystem and there's
> only one pointer per cgroup-subsystem pair.
>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 04/14] device_cgroup: remove direct access to cgroup->children
[not found] ` <1399671091-23867-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-13 12:56 ` Aristeu Rozanski
@ 2014-05-14 12:52 ` Serge E. Hallyn
1 sibling, 0 replies; 48+ messages in thread
From: Serge E. Hallyn @ 2014-05-14 12:52 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w, Aristeu Rozanski, Serge Hallyn
Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Currently, devcg::has_children() directly tests cgroup->children for
> list emptiness. The field is not a published field and scheduled to
> go away. In addition, the test isn't strictly correct as devcg should
> only care about children which are visible to userland.
>
> This patch converts has_children() to use css_next_child() instead.
> The subtle incorrectness is noted and will be dealt with later.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> ---
> security/device_cgroup.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 3116015..75b4b18 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -477,9 +477,17 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
>
> static inline bool has_children(struct dev_cgroup *devcgroup)
> {
> - struct cgroup *cgrp = devcgroup->css.cgroup;
> + bool ret;
>
> - return !list_empty(&cgrp->children);
> + /*
> + * FIXME: There may be lingering offline csses and this function
> + * may return %true when there isn't any userland-visible child
> + * which is incorrect for our purposes.
> + */
> + rcu_read_lock();
> + ret = css_next_child(NULL, &devcgroup->css);
> + rcu_read_unlock();
> + return ret;
> }
>
> /*
> --
> 1.9.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 13/14] device_cgroup: use css_has_online_children() instead of has_children()
2014-05-09 21:31 ` [PATCH 13/14] device_cgroup: use css_has_online_children() instead of has_children() Tejun Heo
[not found] ` <1399671091-23867-14-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-14 12:53 ` Serge E. Hallyn
1 sibling, 0 replies; 48+ messages in thread
From: Serge E. Hallyn @ 2014-05-14 12:53 UTC (permalink / raw)
To: Tejun Heo
Cc: lizefan, cgroups, linux-kernel, hannes, Aristeu Rozanski,
Serge Hallyn
Quoting Tejun Heo (tj@kernel.org):
> devcgroup_update_access() wants to know whether there are child
> cgroups which are online and visible to userland and has_children()
> may return false positive. Replace it with css_has_online_children().
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Aristeu Rozanski <aris@redhat.com>
> Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
> ---
> security/device_cgroup.c | 19 ++-----------------
> 1 file changed, 2 insertions(+), 17 deletions(-)
>
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 75b4b18..22de334 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -475,21 +475,6 @@ static int propagate_exception(struct dev_cgroup *devcg_root,
> return rc;
> }
>
> -static inline bool has_children(struct dev_cgroup *devcgroup)
> -{
> - bool ret;
> -
> - /*
> - * FIXME: There may be lingering offline csses and this function
> - * may return %true when there isn't any userland-visible child
> - * which is incorrect for our purposes.
> - */
> - rcu_read_lock();
> - ret = css_next_child(NULL, &devcgroup->css);
> - rcu_read_unlock();
> - return ret;
> -}
> -
> /*
> * Modify the exception list using allow/deny rules.
> * CAP_SYS_ADMIN is needed for this. It's at least separate from CAP_MKNOD
> @@ -522,7 +507,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
> case 'a':
> switch (filetype) {
> case DEVCG_ALLOW:
> - if (has_children(devcgroup))
> + if (css_has_online_children(&devcgroup->css))
> return -EINVAL;
>
> if (!may_allow_all(parent))
> @@ -538,7 +523,7 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
> return rc;
> break;
> case DEVCG_DENY:
> - if (has_children(devcgroup))
> + if (css_has_online_children(&devcgroup->css))
> return -EINVAL;
>
> dev_exception_clean(devcgroup);
> --
> 1.9.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly
[not found] ` <5372EF45.8060701-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2014-05-14 13:07 ` Tejun Heo
[not found] ` <20140514130722.GE28815-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2014-05-14 13:07 UTC (permalink / raw)
To: Li Zefan
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
Hello, Li.
On Wed, May 14, 2014 at 12:21:25PM +0800, Li Zefan wrote:
> > There are now use cases where controllers need to iterate through
> > csses regardless of their online state as long as they have positive
>
> What use cases are we talking about here?
memcg wants to be able to iterate all csses whose refcnts haven't
reached zero yet so that it can treat offline csses the same way as
online ones in terms of memory reclaim. They don't contain new tasks
so new charges won't be created but offlining won't try to transfer
all charges to the parent but just leave the offline child attached
until all charges are eventually reclaimed from the pressure from the
parent.
I'm not too familiar with the details but this makes sense in generic
sense too. Offline marks an object starting its draining phase and
release marks the actual destruction point. For controllers with
persistent states like memcg, it's a lot more natural to deal offlined
csses as "active but draining following the usual hierarchical
operation" rather than trying to explicitly update the states from
offline to move them to the parent especially as the effort there
essentially is a waste as most of those moved charges aren't gonna be
used in the parent and will be released eventually.
Guaranteeing iteration of offline but not-released csses allow
controllers to treat the draining stage between offline and release
more or less identically to online state which in turn can make
->css_offline() significantly simpler and lighter.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly
[not found] ` <20140514130722.GE28815-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-05-16 1:28 ` Li Zefan
0 siblings, 0 replies; 48+ messages in thread
From: Li Zefan @ 2014-05-16 1:28 UTC (permalink / raw)
To: Tejun Heo
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On2014/5/14 21:07, Tejun Heo wrote:
> Hello, Li.
>
> On Wed, May 14, 2014 at 12:21:25PM +0800, Li Zefan wrote:
>>> There are now use cases where controllers need to iterate through
>>> csses regardless of their online state as long as they have positive
>>
>> What use cases are we talking about here?
>
> memcg wants to be able to iterate all csses whose refcnts haven't
> reached zero yet so that it can treat offline csses the same way as
> online ones in terms of memory reclaim. They don't contain new tasks
> so new charges won't be created but offlining won't try to transfer
> all charges to the parent but just leave the offline child attached
> until all charges are eventually reclaimed from the pressure from the
> parent.
>
> I'm not too familiar with the details but this makes sense in generic
> sense too. Offline marks an object starting its draining phase and
> release marks the actual destruction point. For controllers with
> persistent states like memcg, it's a lot more natural to deal offlined
> csses as "active but draining following the usual hierarchical
> operation" rather than trying to explicitly update the states from
> offline to move them to the parent especially as the effort there
> essentially is a waste as most of those moved charges aren't gonna be
> used in the parent and will be released eventually.
>
> Guaranteeing iteration of offline but not-released csses allow
> controllers to treat the draining stage between offline and release
> more or less identically to online state which in turn can make
> ->css_offline() significantly simpler and lighter.
>
yeah, fair enough.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (5 preceding siblings ...)
2014-05-14 4:21 ` Li Zefan
@ 2014-05-16 1:29 ` Li Zefan
2014-05-16 16:08 ` Tejun Heo
7 siblings, 0 replies; 48+ messages in thread
From: Li Zefan @ 2014-05-16 1:29 UTC (permalink / raw)
To: Tejun Heo
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On 2014/5/10 5:31, Tejun Heo wrote:
> Hello,
>
> Currently, while csses (cgroup_subsys_states) have ->parent linkage
> too, only cgroups form full tree through their ->children and
> ->sibling fields and css iterations naturally is implemented by
> iterating cgroups and then dereferencing the css for the specified
> subsystem.
>
> There are now use cases where controllers need to iterate through
> csses regardless of their online state as long as they have positive
> reference. This can't easily be achieved by iterating cgroups because
> its css pointer array needs to be cleared on offline and there may be
> multiple dying csses for a cgroup for the same subsystem and there's
> only one pointer per cgroup-subsystem pair.
>
> This patchset moves ->children and ->sibling from cgroup to css and
> link all csses in proper trees and then make css iterators walk csses
> directly instead of going through cgroups. This achieves iteration of
> all non-released csses while also simplifying the iteration
> implementation. This is also in line with the general direction of
> using csses as the primary structural component.
>
> This patchset contains the following fourteen patches.
>
> 0001-cgroup-remove-css_parent.patch
> 0002-cgroup-remove-pointless-has-tasks-children-test-from.patch
> 0003-memcg-update-memcg_has_children-to-use-css_next_chil.patch
> 0004-device_cgroup-remove-direct-access-to-cgroup-childre.patch
> 0005-cgroup-remove-cgroup-parent.patch
> 0006-cgroup-move-cgroup-sibling-and-children-into-cgroup_.patch
> 0007-cgroup-link-all-cgroup_subsys_states-in-their-siblin.patch
> 0008-cgroup-move-cgroup-serial_nr-into-cgroup_subsys_stat.patch
> 0009-cgroup-introduce-CSS_RELEASED-and-reduce-css-iterati.patch
> 0010-cgroup-iterate-cgroup_subsys_states-directly.patch
> 0011-cgroup-use-CSS_ONLINE-instead-of-CGRP_DEAD.patch
> 0012-cgroup-convert-cgroup_has_live_children-into-css_has.patch
> 0013-device_cgroup-use-css_has_online_children-instead-of.patch
> 0014-cgroup-implement-css_tryget.patch
>
> 0001-0004 are prep patches.
>
> 0005-0008 move fields from cgroup to css and link csses in tree
> structure instead of cgroups.
>
> 0009-0010 implement direct css iteration.
>
> 0011-0013 convert a cgroup based interface to a css one, which is now
> possible as both are the same in terms of the tree structure, and fix
> devcg brekage using it.
>
> 0014 implements css_tryget() which is to be used to gain access to
> offline but not-yet-released csses.
>
> This pachset is on top of
>
> b9a63d0116e8 ("Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.16")
> + [1] [PATCHSET v2 cgroup/for-3.16] cgroup: post unified hierarchy fixes and updates
> + [2] (REFRESHED) [PATCHSET cgroup/for-3.16] cgroup: implement cftype->write()
> + [3] (REFRESHED) [PATCHSET cgroup/for-3.16] cgroup: remove cgroup_tree_mutex
> + [4] [PATCHSET cgroup/for-3.16] cgroup: use css->refcnt for cgroup reference counting
>
> and available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-direct-css-iteration
>
> diffstat follows. Thanks.
>
> block/blk-cgroup.h | 2
> include/linux/cgroup.h | 122 +++++++++++---------
> kernel/cgroup.c | 257 ++++++++++++++++++++++++-------------------
> kernel/cgroup_freezer.c | 2
> kernel/cpuset.c | 2
> kernel/sched/core.c | 2
> kernel/sched/cpuacct.c | 2
> mm/hugetlb_cgroup.c | 2
> mm/memcontrol.c | 45 +++----
> net/core/netclassid_cgroup.c | 2
> net/core/netprio_cgroup.c | 2
> security/device_cgroup.c | 17 --
> 12 files changed, 251 insertions(+), 206 deletions(-)
>
Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 09/14] cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
[not found] ` <1399671091-23867-10-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2014-05-16 16:07 ` Tejun Heo
0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-16 16:07 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
From 55e2f1d4b69516f899712c04df6635309bf06983 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Fri, 16 May 2014 11:40:35 -0400
css iterations allow the caller to drop RCU read lock. As long as the
caller keeps the current position accessible, it can simply re-grab
RCU read lock later and continue iteration. This is achieved by using
CGRP_DEAD to detect whether the current positions next pointer is safe
to dereference and if not re-iterate from the beginning to the next
position using ->serial_nr.
CGRP_DEAD is used as the marker to invalidate the next pointer and the
only requirement is that the marker is set before the next sibling
starts its RCU grace period. Because CGRP_DEAD is set at the end of
cgroup_destroy_locked() but the cgroup is unlinked when the reference
count reaches zero, we currently have a rather large window where this
fallback re-iteration logic can be triggered.
This patch introduces CSS_RELEASED which is set when a css is unlinked
from its sibling list. This still keeps the re-iteration logic
working while drastically reducing the window of its activation.
While at it, rewrite the comment in css_next_child() to reflect the
new flag and better explain the synchronization.
This will also enable iterating csses directly instead of through
cgroups.
v2: CSS_RELEASED now assigned to 1 << 2 as 1 << 0 is used by
CSS_NO_REF.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 41 ++++++++++++++++++++---------------------
2 files changed, 21 insertions(+), 21 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ebe7ce4..5375582e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -97,6 +97,7 @@ struct cgroup_subsys_state {
enum {
CSS_NO_REF = (1 << 0), /* no reference counting for this css */
CSS_ONLINE = (1 << 1), /* between ->css_online() and ->css_offline() */
+ CSS_RELEASED = (1 << 2), /* refcnt reached zero, released */
};
/**
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index eebae99..7df818b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3108,27 +3108,28 @@ css_next_child(struct cgroup_subsys_state *pos_css,
cgroup_assert_mutex_or_rcu_locked();
/*
- * @pos could already have been removed. Once a cgroup is removed,
- * its ->sibling.next is no longer updated when its next sibling
- * changes. As CGRP_DEAD assertion is serialized and happens
- * before the cgroup is taken off the ->sibling list, if we see it
- * unasserted, it's guaranteed that the next sibling hasn't
- * finished its grace period even if it's already removed, and thus
- * safe to dereference from this RCU critical section. If
- * ->sibling.next is inaccessible, cgroup_is_dead() is guaranteed
- * to be visible as %true here.
+ * @pos could already have been unlinked from the sibling list.
+ * Once a cgroup is removed, its ->sibling.next is no longer
+ * updated when its next sibling changes. CSS_RELEASED is set when
+ * @pos is taken off list, at which time its next pointer is valid,
+ * and, as releases are serialized, the one pointed to by the next
+ * pointer is guaranteed to not have started release yet. This
+ * implies that if we observe !CSS_RELEASED on @pos in this RCU
+ * critical section, the one pointed to by its next pointer is
+ * guaranteed to not have finished its RCU grace period even if we
+ * have dropped rcu_read_lock() inbetween iterations.
*
- * If @pos is dead, its next pointer can't be dereferenced;
- * however, as each cgroup is given a monotonically increasing
- * unique serial number and always appended to the sibling list,
- * the next one can be found by walking the parent's children until
- * we see a cgroup with higher serial number than @pos's. While
- * this path can be slower, it's taken only when either the current
- * cgroup is removed or iteration and removal race.
+ * If @pos has CSS_RELEASED set, its next pointer can't be
+ * dereferenced; however, as each css is given a monotonically
+ * increasing unique serial number and always appended to the
+ * sibling list, the next one can be found by walking the parent's
+ * children until the first css with higher serial number than
+ * @pos's. While this path can be slower, it happens iff iteration
+ * races against release and the race window is very small.
*/
if (!pos) {
next = list_entry_rcu(cgrp->self.children.next, struct cgroup, self.sibling);
- } else if (likely(!cgroup_is_dead(pos))) {
+ } else if (likely(!(pos->self.flags & CSS_RELEASED))) {
next = list_entry_rcu(pos->self.sibling.next, struct cgroup, self.sibling);
} else {
list_for_each_entry_rcu(next, &cgrp->self.children, self.sibling)
@@ -4139,6 +4140,7 @@ static void css_release_work_fn(struct work_struct *work)
mutex_lock(&cgroup_mutex);
+ css->flags |= CSS_RELEASED;
list_del_rcu(&css->sibling);
if (ss) {
@@ -4525,10 +4527,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
/*
* Mark @cgrp dead. This prevents further task migration and child
- * creation by disabling cgroup_lock_live_group(). Note that
- * CGRP_DEAD assertion is depended upon by css_next_child() to
- * resume iteration after dropping RCU read lock. See
- * css_next_child() for details.
+ * creation by disabling cgroup_lock_live_group().
*/
set_bit(CGRP_DEAD, &cgrp->flags);
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v2 14/14] cgroup: implement css_tryget()
[not found] ` <1399671091-23867-15-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-11 4:54 ` Johannes Weiner
@ 2014-05-16 16:07 ` Tejun Heo
1 sibling, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-16 16:07 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
From 141fdfadf18c10fed33624bbd414fcc0cb74d7ca Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Fri, 16 May 2014 11:40:38 -0400
Implement css_tryget() which tries to grab a cgroup_subsys_state's
reference as long as it already hasn't reached zero. Combined with
the recent css iterator changes to include offline && !released csses
during traversal, this can be used to access csses regardless of its
online state.
v2: Take the new flag CSS_NO_REF into account.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
include/linux/cgroup.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b769999..4afe544 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -113,6 +113,24 @@ static inline void css_get(struct cgroup_subsys_state *css)
}
/**
+ * css_tryget - try to obtain a reference on the specified css
+ * @css: target css
+ *
+ * Obtain a reference on @css unless it already has reached zero and is
+ * being released. This function doesn't care whether @css is on or
+ * offline. The caller naturally needs to ensure that @css is accessible
+ * but doesn't have to be holding a reference on it - IOW, RCU protected
+ * access is good enough for this function. Returns %true if a reference
+ * count was successfully obtained; %false otherwise.
+ */
+static inline bool css_tryget(struct cgroup_subsys_state *css)
+{
+ if (!(css->flags & CSS_NO_REF))
+ return percpu_ref_tryget(&css->refcnt);
+ return true;
+}
+
+/**
* css_tryget_online - try to obtain a reference on the specified css if online
* @css: target css
*
--
1.9.0
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (6 preceding siblings ...)
2014-05-16 1:29 ` Li Zefan
@ 2014-05-16 16:08 ` Tejun Heo
7 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2014-05-16 16:08 UTC (permalink / raw)
To: lizefan-hv44wF8Li93QT0dZR+AlfA
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
hannes-druUgvl0LCNAfugRpC6u6w
On Fri, May 09, 2014 at 05:31:17PM -0400, Tejun Heo wrote:
> Hello,
>
> Currently, while csses (cgroup_subsys_states) have ->parent linkage
> too, only cgroups form full tree through their ->children and
> ->sibling fields and css iterations naturally is implemented by
> iterating cgroups and then dereferencing the css for the specified
> subsystem.
>
> There are now use cases where controllers need to iterate through
> csses regardless of their online state as long as they have positive
> reference. This can't easily be achieved by iterating cgroups because
> its css pointer array needs to be cleared on offline and there may be
> multiple dying csses for a cgroup for the same subsystem and there's
> only one pointer per cgroup-subsystem pair.
>
> This patchset moves ->children and ->sibling from cgroup to css and
> link all csses in proper trees and then make css iterators walk csses
> directly instead of going through cgroups. This achieves iteration of
> all non-released csses while also simplifying the iteration
> implementation. This is also in line with the general direction of
> using csses as the primary structural component.
>
> This patchset contains the following fourteen patches.
>
> 0001-cgroup-remove-css_parent.patch
> 0002-cgroup-remove-pointless-has-tasks-children-test-from.patch
> 0003-memcg-update-memcg_has_children-to-use-css_next_chil.patch
> 0004-device_cgroup-remove-direct-access-to-cgroup-childre.patch
> 0005-cgroup-remove-cgroup-parent.patch
> 0006-cgroup-move-cgroup-sibling-and-children-into-cgroup_.patch
> 0007-cgroup-link-all-cgroup_subsys_states-in-their-siblin.patch
> 0008-cgroup-move-cgroup-serial_nr-into-cgroup_subsys_stat.patch
> 0009-cgroup-introduce-CSS_RELEASED-and-reduce-css-iterati.patch
> 0010-cgroup-iterate-cgroup_subsys_states-directly.patch
> 0011-cgroup-use-CSS_ONLINE-instead-of-CGRP_DEAD.patch
> 0012-cgroup-convert-cgroup_has_live_children-into-css_has.patch
> 0013-device_cgroup-use-css_has_online_children-instead-of.patch
> 0014-cgroup-implement-css_tryget.patch
Applied to cgroup/for-3.16 with the minor updates to accomodate
CSS_NO_REF.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2014-05-16 16:08 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-09 21:31 [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
2014-05-09 21:31 ` [PATCH 01/14] cgroup: remove css_parent() Tejun Heo
[not found] ` <1399671091-23867-2-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-11 1:47 ` David Miller
2014-05-11 13:02 ` Neil Horman
2014-05-13 18:50 ` [PATCH v2 " Tejun Heo
2014-05-12 13:16 ` [PATCH " Michal Hocko
[not found] ` <1399671091-23867-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-09 21:31 ` [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty() Tejun Heo
[not found] ` <1399671091-23867-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-12 14:53 ` Michal Hocko
[not found] ` <20140512145324.GE9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-05-12 14:58 ` [PATCH] memcg: deprecate memory.force_empty knob Michal Hocko
[not found] ` <20140512145803.GF9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-05-12 15:00 ` Tejun Heo
[not found] ` <20140512150014.GB1421-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-12 15:20 ` Michal Hocko
2014-05-12 15:25 ` Tejun Heo
[not found] ` <20140512152507.GD1421-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-12 15:34 ` Michal Hocko
[not found] ` <20140512153458.GJ9564-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-05-13 13:16 ` Johannes Weiner
[not found] ` <20140513131655.GC18849-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-05-13 15:09 ` Michal Hocko
2014-05-12 14:59 ` [PATCH 02/14] cgroup: remove pointless has tasks/children test from mem_cgroup_force_empty() Tejun Heo
[not found] ` <20140512145927.GA1421-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-12 15:21 ` Michal Hocko
2014-05-13 13:10 ` Johannes Weiner
2014-05-13 16:46 ` Tejun Heo
2014-05-13 18:51 ` [PATCH UPDATED 02/14] memcg: remove " Tejun Heo
2014-05-09 21:31 ` [PATCH 03/14] memcg: update memcg_has_children() to use css_next_child() Tejun Heo
[not found] ` <1399671091-23867-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-12 15:18 ` Michal Hocko
2014-05-13 16:53 ` [PATCH v2 " Tejun Heo
2014-05-09 21:31 ` [PATCH 06/14] cgroup: move cgroup->sibling and ->children into cgroup_subsys_state Tejun Heo
2014-05-09 21:31 ` [PATCH 07/14] cgroup: link all cgroup_subsys_states in their sibling lists Tejun Heo
2014-05-13 16:59 ` [PATCHSET cgroup/for-3.16] cgroup: iterate cgroup_subsys_states directly Tejun Heo
2014-05-14 4:21 ` Li Zefan
[not found] ` <5372EF45.8060701-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2014-05-14 13:07 ` Tejun Heo
[not found] ` <20140514130722.GE28815-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-16 1:28 ` Li Zefan
2014-05-16 1:29 ` Li Zefan
2014-05-16 16:08 ` Tejun Heo
2014-05-09 21:31 ` [PATCH 04/14] device_cgroup: remove direct access to cgroup->children Tejun Heo
[not found] ` <1399671091-23867-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-13 12:56 ` Aristeu Rozanski
2014-05-14 12:52 ` Serge E. Hallyn
2014-05-09 21:31 ` [PATCH 05/14] cgroup: remove cgroup->parent Tejun Heo
2014-05-09 21:31 ` [PATCH 08/14] cgroup: move cgroup->serial_nr into cgroup_subsys_state Tejun Heo
2014-05-09 21:31 ` [PATCH 09/14] cgroup: introduce CSS_RELEASED and reduce css iteration fallback window Tejun Heo
[not found] ` <1399671091-23867-10-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-16 16:07 ` [PATCH v2 " Tejun Heo
2014-05-09 21:31 ` [PATCH 10/14] cgroup: iterate cgroup_subsys_states directly Tejun Heo
2014-05-09 21:31 ` [PATCH 11/14] cgroup: use CSS_ONLINE instead of CGRP_DEAD Tejun Heo
2014-05-09 21:31 ` [PATCH 12/14] cgroup: convert cgroup_has_live_children() into css_has_online_children() Tejun Heo
2014-05-09 21:31 ` [PATCH 13/14] device_cgroup: use css_has_online_children() instead of has_children() Tejun Heo
[not found] ` <1399671091-23867-14-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-13 12:56 ` Aristeu Rozanski
2014-05-14 12:53 ` Serge E. Hallyn
2014-05-09 21:31 ` [PATCH 14/14] cgroup: implement css_tryget() Tejun Heo
[not found] ` <1399671091-23867-15-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2014-05-11 4:54 ` Johannes Weiner
[not found] ` <20140511045459.GA25009-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-05-11 12:38 ` Tejun Heo
2014-05-16 16:07 ` [PATCH v2 " Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).