[PATCH v3 0/6] replace cgroup_lock with memcg specific locking

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] replace cgroup_lock with memcg specific locking
@ 2013-01-21 11:13 Glauber Costa
  2013-01-21 11:13 ` [PATCH v3 1/6] memcg: prevent changes to move_charge_at_immigrate during task attach Glauber Costa
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 11:13 UTC (permalink / raw)
  To: cgroups; +Cc: linux-mm, Tejun Heo, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu

Hi,

In memcg, we use the cgroup_lock basically to synchronize against
attaching new children to a cgroup. We do this because we rely on cgroup core to
provide us with this information.

We need to guarantee that upon child creation, our tunables are consistent.
For those, the calls to cgroup_lock() all live in handlers like
mem_cgroup_hierarchy_write(), where we change a tunable in the group that is
hierarchy-related. For instance, the use_hierarchy flag cannot be changed if
the cgroup already have children.

Furthermore, those values are propageted from the parent to the child when a
new child is created. So if we don't lock like this, we can end up with the
following situation:

A                                   B
 memcg_css_alloc()                       mem_cgroup_hierarchy_write()
 copy use hierarchy from parent          change use hierarchy in parent
 finish creation.

This is mainly because during create, we are still not fully connected to the
css tree. So all iterators and the such that we could use, will fail to show
that the group has children.

My observation is that all of creation can proceed in parallel with those
tasks, except value assignment. So what this patchseries does is to first move
all value assignment that is dependent on parent values from css_alloc to
css_online, where the iterators all work, and then we lock only the value
assignment. This will guarantee that parent and children always have consistent
values. Together with an online test, that can be derived from the observation
that the refcount of an online memcg can be made to be always positive, we
should be able to synchronize our side without the cgroup lock.

*v3:
 - simplified test for presence of children, and no longer using refcnt for
   online testing
 - some cleanups as suggested by Michal

*v2:
 - sanitize kmemcg assignment in the light of the current locking change.
 - don't grab locks on immigrate charges by caching the value during can_attach

Glauber Costa (6):
  memcg: prevent changes to move_charge_at_immigrate during task attach
  memcg: split part of memcg creation to css_online
  memcg: fast hierarchy-aware child test.
  memcg: replace cgroup_lock with memcg specific memcg_lock
  memcg: increment static branch right after limit set.
  memcg: avoid dangling reference count in creation failure.

 mm/memcontrol.c | 210 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 123 insertions(+), 87 deletions(-)

-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/6] memcg: prevent changes to move_charge_at_immigrate during task attach
  2013-01-21 11:13 [PATCH v3 0/6] replace cgroup_lock with memcg specific locking Glauber Costa
@ 2013-01-21 11:13 ` Glauber Costa
  2013-01-21 11:13 ` [PATCH v3 2/6] memcg: split part of memcg creation to css_online Glauber Costa
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 11:13 UTC (permalink / raw)
  To: cgroups
  Cc: linux-mm, Tejun Heo, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Glauber Costa

Currently, we rely on the cgroup_lock() to prevent changes to
move_charge_at_immigrate during task migration. However, this is only
needed because the current strategy keeps checking this value throughout
the whole process. Since all we need is serialization, one needs only to
guarantee that whatever decision we made in the beginning of a specific
migration is respected throughout the process.

We can achieve this by just saving it in mc. By doing this, no kind of
locking is needed.

[ v2: change flag name to avoid confusion ]

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09255ec..91d90a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -398,8 +398,8 @@ static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
 
 /* Stuffs for move charges at task migration. */
 /*
- * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
- * left-shifted bitmap of these types.
+ * Types of charges to be moved. "move_charge_at_immitgrate" and
+ * "immigrate_flags" are treated as a left-shifted bitmap of these types.
  */
 enum move_type {
 	MOVE_CHARGE_TYPE_ANON,	/* private anonymous page and swap of it */
@@ -412,6 +412,7 @@ static struct move_charge_struct {
 	spinlock_t	  lock; /* for from, to */
 	struct mem_cgroup *from;
 	struct mem_cgroup *to;
+	unsigned long immigrate_flags;
 	unsigned long precharge;
 	unsigned long moved_charge;
 	unsigned long moved_swap;
@@ -424,14 +425,12 @@ static struct move_charge_struct {
 
 static bool move_anon(void)
 {
-	return test_bit(MOVE_CHARGE_TYPE_ANON,
-					&mc.to->move_charge_at_immigrate);
+	return test_bit(MOVE_CHARGE_TYPE_ANON, &mc.immigrate_flags);
 }
 
 static bool move_file(void)
 {
-	return test_bit(MOVE_CHARGE_TYPE_FILE,
-					&mc.to->move_charge_at_immigrate);
+	return test_bit(MOVE_CHARGE_TYPE_FILE, &mc.immigrate_flags);
 }
 
 /*
@@ -5146,15 +5145,14 @@ static int mem_cgroup_move_charge_write(struct cgroup *cgrp,
 
 	if (val >= (1 << NR_MOVE_TYPE))
 		return -EINVAL;
+
 	/*
-	 * We check this value several times in both in can_attach() and
-	 * attach(), so we need cgroup lock to prevent this value from being
-	 * inconsistent.
+	 * No kind of locking is needed in here, because ->can_attach() will
+	 * check this value once in the beginning of the process, and then carry
+	 * on with stale data. This means that changes to this value will only
+	 * affect task migrations starting after the change.
 	 */
-	cgroup_lock();
 	memcg->move_charge_at_immigrate = val;
-	cgroup_unlock();
-
 	return 0;
 }
 #else
@@ -6530,8 +6528,15 @@ static int mem_cgroup_can_attach(struct cgroup *cgroup,
 	struct task_struct *p = cgroup_taskset_first(tset);
 	int ret = 0;
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
+	unsigned long move_charge_at_immigrate;
 
-	if (memcg->move_charge_at_immigrate) {
+	/*
+	 * We are now commited to this value whatever it is. Changes in this
+	 * tunable will only affect upcoming migrations, not the current one.
+	 * So we need to save it, and keep it going.
+	 */
+	move_charge_at_immigrate  = memcg->move_charge_at_immigrate;
+	if (move_charge_at_immigrate) {
 		struct mm_struct *mm;
 		struct mem_cgroup *from = mem_cgroup_from_task(p);
 
@@ -6551,6 +6556,7 @@ static int mem_cgroup_can_attach(struct cgroup *cgroup,
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = memcg;
+			mc.immigrate_flags = move_charge_at_immigrate;
 			spin_unlock(&mc.lock);
 			/* We set mc.moving_task later */
 
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/6] memcg: split part of memcg creation to css_online
  2013-01-21 11:13 [PATCH v3 0/6] replace cgroup_lock with memcg specific locking Glauber Costa
  2013-01-21 11:13 ` [PATCH v3 1/6] memcg: prevent changes to move_charge_at_immigrate during task attach Glauber Costa
@ 2013-01-21 11:13 ` Glauber Costa
  2013-01-21 13:56   ` Michal Hocko
  2013-01-21 11:13 ` [PATCH v3 3/6] memcg: fast hierarchy-aware child test Glauber Costa
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 11:13 UTC (permalink / raw)
  To: cgroups
  Cc: linux-mm, Tejun Heo, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Glauber Costa

This patch is a preparatory work for later locking rework to get rid of
big cgroup lock from memory controller code.

The memory controller uses some tunables to adjust its operation. Those
tunables are inherited from parent to children upon children
intialization. For most of them, the value cannot be changed after the
parent has a new children.

cgroup core splits initialization in two phases: css_alloc and css_online.
After css_alloc, the memory allocation and basic initialization are
done. But the new group is not yet visible anywhere, not even for cgroup
core code. It is only somewhere between css_alloc and css_online that it
is inserted into the internal children lists. Copying tunable values in
css_alloc will lead to inconsistent values: the children will copy the
old parent values, that can change between the copy and the moment in
which the groups is linked to any data structure that can indicate the
presence of children.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91d90a0..6c72204 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6063,7 +6063,7 @@ err_cleanup:
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup *cont)
 {
-	struct mem_cgroup *memcg, *parent;
+	struct mem_cgroup *memcg;
 	long error = -ENOMEM;
 	int node;
 
@@ -6079,7 +6079,6 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 	if (cont->parent == NULL) {
 		int cpu;
 		enable_swap_cgroup();
-		parent = NULL;
 		if (mem_cgroup_soft_limit_tree_init())
 			goto free_out;
 		root_mem_cgroup = memcg;
@@ -6088,13 +6087,43 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 						&per_cpu(memcg_stock, cpu);
 			INIT_WORK(&stock->work, drain_local_stock);
 		}
-	} else {
-		parent = mem_cgroup_from_cont(cont->parent);
-		memcg->use_hierarchy = parent->use_hierarchy;
-		memcg->oom_kill_disable = parent->oom_kill_disable;
+
+		res_counter_init(&memcg->res, NULL);
+		res_counter_init(&memcg->memsw, NULL);
+		res_counter_init(&memcg->kmem, NULL);
 	}
 
-	if (parent && parent->use_hierarchy) {
+	memcg->last_scanned_node = MAX_NUMNODES;
+	INIT_LIST_HEAD(&memcg->oom_notify);
+	atomic_set(&memcg->refcnt, 1);
+	memcg->move_charge_at_immigrate = 0;
+	mutex_init(&memcg->thresholds_lock);
+	spin_lock_init(&memcg->move_lock);
+
+	return &memcg->css;
+
+free_out:
+	__mem_cgroup_free(memcg);
+	return ERR_PTR(error);
+}
+
+static int
+mem_cgroup_css_online(struct cgroup *cont)
+{
+	struct mem_cgroup *memcg, *parent;
+	int error = 0;
+
+	if (!cont->parent)
+		return 0;
+
+	memcg = mem_cgroup_from_cont(cont);
+	parent = mem_cgroup_from_cont(cont->parent);
+
+	memcg->use_hierarchy = parent->use_hierarchy;
+	memcg->oom_kill_disable = parent->oom_kill_disable;
+	memcg->swappiness = mem_cgroup_swappiness(parent);
+
+	if (parent->use_hierarchy) {
 		res_counter_init(&memcg->res, &parent->res);
 		res_counter_init(&memcg->memsw, &parent->memsw);
 		res_counter_init(&memcg->kmem, &parent->kmem);
@@ -6115,18 +6144,9 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 		 * much sense so let cgroup subsystem know about this
 		 * unfortunate state in our controller.
 		 */
-		if (parent && parent != root_mem_cgroup)
+		if (parent != root_mem_cgroup)
 			mem_cgroup_subsys.broken_hierarchy = true;
 	}
-	memcg->last_scanned_node = MAX_NUMNODES;
-	INIT_LIST_HEAD(&memcg->oom_notify);
-
-	if (parent)
-		memcg->swappiness = mem_cgroup_swappiness(parent);
-	atomic_set(&memcg->refcnt, 1);
-	memcg->move_charge_at_immigrate = 0;
-	mutex_init(&memcg->thresholds_lock);
-	spin_lock_init(&memcg->move_lock);
 
 	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
 	if (error) {
@@ -6136,12 +6156,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 		 * call __mem_cgroup_free, so return directly
 		 */
 		mem_cgroup_put(memcg);
-		return ERR_PTR(error);
 	}
-	return &memcg->css;
-free_out:
-	__mem_cgroup_free(memcg);
-	return ERR_PTR(error);
+	return error;
 }
 
 static void mem_cgroup_css_offline(struct cgroup *cont)
@@ -6751,6 +6767,7 @@ struct cgroup_subsys mem_cgroup_subsys = {
 	.name = "memory",
 	.subsys_id = mem_cgroup_subsys_id,
 	.css_alloc = mem_cgroup_css_alloc,
+	.css_online = mem_cgroup_css_online,
 	.css_offline = mem_cgroup_css_offline,
 	.css_free = mem_cgroup_css_free,
 	.can_attach = mem_cgroup_can_attach,
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 3/6] memcg: fast hierarchy-aware child test.
  2013-01-21 11:13 [PATCH v3 0/6] replace cgroup_lock with memcg specific locking Glauber Costa
  2013-01-21 11:13 ` [PATCH v3 1/6] memcg: prevent changes to move_charge_at_immigrate during task attach Glauber Costa
  2013-01-21 11:13 ` [PATCH v3 2/6] memcg: split part of memcg creation to css_online Glauber Costa
@ 2013-01-21 11:13 ` Glauber Costa
  2013-01-21 14:10   ` Michal Hocko
  2013-01-21 11:13 ` [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock Glauber Costa
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 11:13 UTC (permalink / raw)
  To: cgroups
  Cc: linux-mm, Tejun Heo, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Glauber Costa

Currently, we use cgroups' provided list of children to verify if it is
safe to proceed with any value change that is dependent on the cgroup
being empty.

This is less than ideal, because it enforces a dependency over cgroup
core that we would be better off without. The solution proposed here is
to iterate over the child cgroups and if any is found that is already
online, we bounce and return: we don't really care how many children we
have, only if we have any.

This is also made to be hierarchy aware. IOW, cgroups with  hierarchy
disabled, while they still exist, will be considered for the purpose of
this interface as having no children.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 mm/memcontrol.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6c72204..6d3ad21 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4716,6 +4716,32 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 }
 
 /*
+ * this mainly exists for tests during set of use_hierarchy. Since this is
+ * the very setting we are changing, the current hierarchy value is meaningless
+ */
+static inline bool __memcg_has_children(struct mem_cgroup *memcg)
+{
+	struct cgroup *pos;
+
+	/* bounce at first found */
+	cgroup_for_each_child(pos, memcg->css.cgroup)
+		return true;
+	return false;
+}
+
+/*
+ * must be called with cgroup_lock held, unless the cgroup is guaranteed to be
+ * already dead (like in mem_cgroup_force_empty, for instance).  This is
+ * different than mem_cgroup_count_children, in the sense that we don't really
+ * care how many children we have, we only need to know if we have any. It is
+ * also count any memcg without hierarchy as infertile for that matter.
+ */
+static inline bool memcg_has_children(struct mem_cgroup *memcg)
+{
+	return memcg->use_hierarchy && __memcg_has_children(memcg);
+}
+
+/*
  * Reclaims as many pages from the given memcg as possible and moves
  * the rest to the parent.
  *
@@ -4800,7 +4826,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
 	 */
 	if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
 				(val == 1 || val == 0)) {
-		if (list_empty(&cont->children))
+		if (!__memcg_has_children(memcg))
 			memcg->use_hierarchy = val;
 		else
 			retval = -EBUSY;
@@ -4917,8 +4943,7 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
 	cgroup_lock();
 	mutex_lock(&set_limit_mutex);
 	if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
-		if (cgroup_task_count(cont) || (memcg->use_hierarchy &&
-						!list_empty(&cont->children))) {
+		if (cgroup_task_count(cont) || memcg_has_children(memcg)) {
 			ret = -EBUSY;
 			goto out;
 		}
@@ -5334,8 +5359,7 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	cgroup_lock();
 
 	/* If under hierarchy, only empty-root can set this value */
-	if ((parent->use_hierarchy) ||
-	    (memcg->use_hierarchy && !list_empty(&cgrp->children))) {
+	if ((parent->use_hierarchy) || memcg_has_children(memcg)) {
 		cgroup_unlock();
 		return -EINVAL;
 	}
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock
  2013-01-21 11:13 [PATCH v3 0/6] replace cgroup_lock with memcg specific locking Glauber Costa
                   ` (2 preceding siblings ...)
  2013-01-21 11:13 ` [PATCH v3 3/6] memcg: fast hierarchy-aware child test Glauber Costa
@ 2013-01-21 11:13 ` Glauber Costa
  2013-01-21 14:49   ` Michal Hocko
  2013-01-21 11:13 ` [PATCH v3 5/6] memcg: increment static branch right after limit set Glauber Costa
  2013-01-21 11:13 ` [PATCH v3 6/6] memcg: avoid dangling reference count in creation failure Glauber Costa
  5 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 11:13 UTC (permalink / raw)
  To: cgroups
  Cc: linux-mm, Tejun Heo, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Glauber Costa

After the preparation work done in earlier patches, the cgroup_lock can
be trivially replaced with a memcg-specific lock. This is an automatic
translation in every site the values involved were queried.

The sites were values are written, however, used to be naturally called
under cgroup_lock. This is the case for instance of the css_online
callback. For those, we now need to explicitly add the memcg_lock.

Also, now that the memcg_mutex is available, there is no need to abuse
the set_limit mutex in kmemcg value setting. The memcg_mutex will do a
better job, and we now resort to it.

With this, all the calls to cgroup_lock outside cgroup core are gone.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 mm/memcontrol.c | 52 ++++++++++++++++++++++++++++------------------------
 1 file changed, 28 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6d3ad21..d3b78b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -470,6 +470,13 @@ enum res_type {
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
 
+/*
+ * The memcg mutex needs to be held for any globally visible cgroup change.
+ * Group creation and tunable propagation, as well as any change that depends
+ * on the tunables being in a consistent state.
+ */
+static DEFINE_MUTEX(memcg_mutex);
+
 static void mem_cgroup_get(struct mem_cgroup *memcg);
 static void mem_cgroup_put(struct mem_cgroup *memcg);
 
@@ -2902,7 +2909,7 @@ int memcg_cache_id(struct mem_cgroup *memcg)
  * operation, because that is its main call site.
  *
  * But when we create a new cache, we can call this as well if its parent
- * is kmem-limited. That will have to hold set_limit_mutex as well.
+ * is kmem-limited. That will have to hold memcg_mutex as well.
  */
 int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 {
@@ -2917,7 +2924,7 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	 * the beginning of this conditional), is no longer 0. This
 	 * guarantees only one process will set the following boolean
 	 * to true. We don't need test_and_set because we're protected
-	 * by the set_limit_mutex anyway.
+	 * by the memcg_mutex anyway.
 	 */
 	memcg_kmem_set_activated(memcg);
 
@@ -3258,9 +3265,9 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	 *
 	 * Still, we don't want anyone else freeing memcg_caches under our
 	 * noses, which can happen if a new memcg comes to life. As usual,
-	 * we'll take the set_limit_mutex to protect ourselves against this.
+	 * we'll take the memcg_mutex to protect ourselves against this.
 	 */
-	mutex_lock(&set_limit_mutex);
+	mutex_lock(&memcg_mutex);
 	for (i = 0; i < memcg_limited_groups_array_size; i++) {
 		c = s->memcg_params->memcg_caches[i];
 		if (!c)
@@ -3283,7 +3290,7 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		cancel_work_sync(&c->memcg_params->destroy);
 		kmem_cache_destroy(c);
 	}
-	mutex_unlock(&set_limit_mutex);
+	mutex_unlock(&memcg_mutex);
 }
 
 struct create_work {
@@ -4730,7 +4737,7 @@ static inline bool __memcg_has_children(struct mem_cgroup *memcg)
 }
 
 /*
- * must be called with cgroup_lock held, unless the cgroup is guaranteed to be
+ * must be called with memcg_mutex held, unless the cgroup is guaranteed to be
  * already dead (like in mem_cgroup_force_empty, for instance).  This is
  * different than mem_cgroup_count_children, in the sense that we don't really
  * care how many children we have, we only need to know if we have any. It is
@@ -4811,7 +4818,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
 	if (parent)
 		parent_memcg = mem_cgroup_from_cont(parent);
 
-	cgroup_lock();
+	mutex_lock(&memcg_mutex);
 
 	if (memcg->use_hierarchy == val)
 		goto out;
@@ -4834,7 +4841,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
 		retval = -EINVAL;
 
 out:
-	cgroup_unlock();
+	mutex_unlock(&memcg_mutex);
 
 	return retval;
 }
@@ -4934,14 +4941,10 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
 	 * After it first became limited, changes in the value of the limit are
 	 * of course permitted.
 	 *
-	 * Taking the cgroup_lock is really offensive, but it is so far the only
-	 * way to guarantee that no children will appear. There are plenty of
-	 * other offenders, and they should all go away. Fine grained locking
-	 * is probably the way to go here. When we are fully hierarchical, we
-	 * can also get rid of the use_hierarchy check.
+	 * We are protected by the memcg_mutex, so no other cgroups can appear
+	 * in the mean time.
 	 */
-	cgroup_lock();
-	mutex_lock(&set_limit_mutex);
+	mutex_lock(&memcg_mutex);
 	if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
 		if (cgroup_task_count(cont) || memcg_has_children(memcg)) {
 			ret = -EBUSY;
@@ -4966,8 +4969,7 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
 	} else
 		ret = res_counter_set_limit(&memcg->kmem, val);
 out:
-	mutex_unlock(&set_limit_mutex);
-	cgroup_unlock();
+	mutex_unlock(&memcg_mutex);
 
 	/*
 	 * We are by now familiar with the fact that we can't inc the static
@@ -5024,9 +5026,9 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	mem_cgroup_get(memcg);
 	static_key_slow_inc(&memcg_kmem_enabled_key);
 
-	mutex_lock(&set_limit_mutex);
+	mutex_lock(&memcg_mutex);
 	ret = memcg_update_cache_sizes(memcg);
-	mutex_unlock(&set_limit_mutex);
+	mutex_unlock(&memcg_mutex);
 #endif
 out:
 	return ret;
@@ -5356,17 +5358,17 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 
 	parent = mem_cgroup_from_cont(cgrp->parent);
 
-	cgroup_lock();
+	mutex_lock(&memcg_mutex);
 
 	/* If under hierarchy, only empty-root can set this value */
 	if ((parent->use_hierarchy) || memcg_has_children(memcg)) {
-		cgroup_unlock();
+		mutex_unlock(&memcg_mutex);
 		return -EINVAL;
 	}
 
 	memcg->swappiness = val;
 
-	cgroup_unlock();
+	mutex_unlock(&memcg_mutex);
 
 	return 0;
 }
@@ -5692,7 +5694,7 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 
 	parent = mem_cgroup_from_cont(cgrp->parent);
 
-	cgroup_lock();
+	mutex_lock(&memcg_mutex);
 	/* oom-kill-disable is a flag for subhierarchy. */
 	if ((parent->use_hierarchy) ||
 	    (memcg->use_hierarchy && !list_empty(&cgrp->children))) {
@@ -5702,7 +5704,7 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	memcg->oom_kill_disable = val;
 	if (!val)
 		memcg_oom_recover(memcg);
-	cgroup_unlock();
+	mutex_unlock(&memcg_mutex);
 	return 0;
 }
 
@@ -6140,6 +6142,7 @@ mem_cgroup_css_online(struct cgroup *cont)
 	if (!cont->parent)
 		return 0;
 
+	mutex_lock(&memcg_mutex);
 	memcg = mem_cgroup_from_cont(cont);
 	parent = mem_cgroup_from_cont(cont->parent);
 
@@ -6173,6 +6176,7 @@ mem_cgroup_css_online(struct cgroup *cont)
 	}
 
 	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
+	mutex_unlock(&memcg_mutex);
 	if (error) {
 		/*
 		 * We call put now because our (and parent's) refcnts
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 5/6] memcg: increment static branch right after limit set.
  2013-01-21 11:13 [PATCH v3 0/6] replace cgroup_lock with memcg specific locking Glauber Costa
                   ` (3 preceding siblings ...)
  2013-01-21 11:13 ` [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock Glauber Costa
@ 2013-01-21 11:13 ` Glauber Costa
  2013-01-21 11:13 ` [PATCH v3 6/6] memcg: avoid dangling reference count in creation failure Glauber Costa
  5 siblings, 0 replies; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 11:13 UTC (permalink / raw)
  To: cgroups
  Cc: linux-mm, Tejun Heo, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Glauber Costa

We were deferring the kmemcg static branch increment to a later time,
due to a nasty dependency between the cpu_hotplug lock, taken by the
jump label update, and the cgroup_lock.

Now we no longer take the cgroup lock, and we can save ourselves the
trouble.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse>
---
 mm/memcontrol.c | 31 +++++++------------------------
 1 file changed, 7 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d3b78b9..5a247de 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4926,8 +4926,6 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
 {
 	int ret = -EINVAL;
 #ifdef CONFIG_MEMCG_KMEM
-	bool must_inc_static_branch = false;
-
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 	/*
 	 * For simplicity, we won't allow this to be disabled.  It also can't
@@ -4958,7 +4956,13 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
 			res_counter_set_limit(&memcg->kmem, RESOURCE_MAX);
 			goto out;
 		}
-		must_inc_static_branch = true;
+		static_key_slow_inc(&memcg_kmem_enabled_key);
+		/*
+		 * setting the active bit after the inc will guarantee no one
+		 * starts accounting before all call sites are patched
+		 */
+		memcg_kmem_set_active(memcg);
+
 		/*
 		 * kmem charges can outlive the cgroup. In the case of slab
 		 * pages, for instance, a page contain objects from various
@@ -4970,27 +4974,6 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
 		ret = res_counter_set_limit(&memcg->kmem, val);
 out:
 	mutex_unlock(&memcg_mutex);
-
-	/*
-	 * We are by now familiar with the fact that we can't inc the static
-	 * branch inside cgroup_lock. See disarm functions for details. A
-	 * worker here is overkill, but also wrong: After the limit is set, we
-	 * must start accounting right away. Since this operation can't fail,
-	 * we can safely defer it to here - no rollback will be needed.
-	 *
-	 * The boolean used to control this is also safe, because
-	 * KMEM_ACCOUNTED_ACTIVATED guarantees that only one process will be
-	 * able to set it to true;
-	 */
-	if (must_inc_static_branch) {
-		static_key_slow_inc(&memcg_kmem_enabled_key);
-		/*
-		 * setting the active bit after the inc will guarantee no one
-		 * starts accounting before all call sites are patched
-		 */
-		memcg_kmem_set_active(memcg);
-	}
-
 #endif
 	return ret;
 }
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 6/6] memcg: avoid dangling reference count in creation failure.
  2013-01-21 11:13 [PATCH v3 0/6] replace cgroup_lock with memcg specific locking Glauber Costa
                   ` (4 preceding siblings ...)
  2013-01-21 11:13 ` [PATCH v3 5/6] memcg: increment static branch right after limit set Glauber Costa
@ 2013-01-21 11:13 ` Glauber Costa
  2013-01-21 12:30   ` Michal Hocko
  5 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 11:13 UTC (permalink / raw)
  To: cgroups
  Cc: linux-mm, Tejun Heo, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Glauber Costa

When use_hierarchy is enabled, we acquire an extra reference count
in our parent during cgroup creation. We don't release it, though,
if any failure exist in the creation process.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Michal Hocko <mhocko@suse>
---
 mm/memcontrol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5a247de..3949123 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6167,6 +6167,8 @@ mem_cgroup_css_online(struct cgroup *cont)
 		 * call __mem_cgroup_free, so return directly
 		 */
 		mem_cgroup_put(memcg);
+		if (parent->use_hierarchy)
+			mem_cgroup_put(parent);
 	}
 	return error;
 }
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 6/6] memcg: avoid dangling reference count in creation failure.
  2013-01-21 11:13 ` [PATCH v3 6/6] memcg: avoid dangling reference count in creation failure Glauber Costa
@ 2013-01-21 12:30   ` Michal Hocko
  2013-01-21 13:08     ` Glauber Costa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-01-21 12:30 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On Mon 21-01-13 15:13:33, Glauber Costa wrote:
> When use_hierarchy is enabled, we acquire an extra reference count
> in our parent during cgroup creation. We don't release it, though,
> if any failure exist in the creation process.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Reported-by: Michal Hocko <mhocko@suse>

If you put this one to the head of the series we can backport it to
stable which is preferred, although nobody have seen this as a problem.

> ---
>  mm/memcontrol.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5a247de..3949123 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6167,6 +6167,8 @@ mem_cgroup_css_online(struct cgroup *cont)
>  		 * call __mem_cgroup_free, so return directly
>  		 */
>  		mem_cgroup_put(memcg);
> +		if (parent->use_hierarchy)
> +			mem_cgroup_put(parent);
>  	}
>  	return error;
>  }
> -- 
> 1.8.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 6/6] memcg: avoid dangling reference count in creation failure.
  2013-01-21 12:30   ` Michal Hocko
@ 2013-01-21 13:08     ` Glauber Costa
  2013-01-21 13:19       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 13:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On 01/21/2013 04:30 PM, Michal Hocko wrote:
> On Mon 21-01-13 15:13:33, Glauber Costa wrote:
>> When use_hierarchy is enabled, we acquire an extra reference count
>> in our parent during cgroup creation. We don't release it, though,
>> if any failure exist in the creation process.
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> Reported-by: Michal Hocko <mhocko@suse>
> 
> If you put this one to the head of the series we can backport it to
> stable which is preferred, although nobody have seen this as a problem.
> 
If I have to send again, I might. But I see no reason to do so otherwise.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 6/6] memcg: avoid dangling reference count in creation failure.
  2013-01-21 13:08     ` Glauber Costa
@ 2013-01-21 13:19       ` Michal Hocko
  2013-01-21 13:26         ` Glauber Costa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-01-21 13:19 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On Mon 21-01-13 17:08:36, Glauber Costa wrote:
> On 01/21/2013 04:30 PM, Michal Hocko wrote:
> > On Mon 21-01-13 15:13:33, Glauber Costa wrote:
> >> When use_hierarchy is enabled, we acquire an extra reference count
> >> in our parent during cgroup creation. We don't release it, though,
> >> if any failure exist in the creation process.
> >>
> >> Signed-off-by: Glauber Costa <glommer@parallels.com>
> >> Reported-by: Michal Hocko <mhocko@suse>
> > 
> > If you put this one to the head of the series we can backport it to
> > stable which is preferred, although nobody have seen this as a problem.
> > 
> If I have to send again, I might. But I see no reason to do so otherwise.

The question is whether this is worth backporting to stable. If yes then
it makes to move it up the series. Keep it here otherwise. I think the
failure is quite improbable and nobody complained so far. On the other
hand this is an obvious bug fix so it should qualify for stable.

I would wait for others for what they think and do the shuffling after
all other patches are settled. I would rather be safe and push the fix
pro-actively.

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 6/6] memcg: avoid dangling reference count in creation failure.
  2013-01-21 13:19       ` Michal Hocko
@ 2013-01-21 13:26         ` Glauber Costa
  0 siblings, 0 replies; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 13:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On 01/21/2013 05:19 PM, Michal Hocko wrote:
> On Mon 21-01-13 17:08:36, Glauber Costa wrote:
>> On 01/21/2013 04:30 PM, Michal Hocko wrote:
>>> On Mon 21-01-13 15:13:33, Glauber Costa wrote:
>>>> When use_hierarchy is enabled, we acquire an extra reference count
>>>> in our parent during cgroup creation. We don't release it, though,
>>>> if any failure exist in the creation process.
>>>>
>>>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>>>> Reported-by: Michal Hocko <mhocko@suse>
>>>
>>> If you put this one to the head of the series we can backport it to
>>> stable which is preferred, although nobody have seen this as a problem.
>>>
>> If I have to send again, I might. But I see no reason to do so otherwise.
> 
> The question is whether this is worth backporting to stable. If yes then
> it makes to move it up the series. Keep it here otherwise. I think the
> failure is quite improbable and nobody complained so far. On the other
> hand this is an obvious bug fix so it should qualify for stable.
> 
> I would wait for others for what they think and do the shuffling after
> all other patches are settled. I would rather be safe and push the fix
> pro-actively.
> 

As improbable as it is, what if we have one of those
bugs-turned-feature, that end up working by accident just because the
refcnt is not flushed? We should fix it, of course, but who knows how
hard it could be?

Of course it is all handwaving, but given that the trigger of this bug
is an unlikely condition, and the effect is a couple of wasted kbs -
even directory removal can proceed all right - and in the most common
use case of children-at-parent-level-only the increased reference will
be in the root memcg anyway... I wouldn't backport it.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/6] memcg: split part of memcg creation to css_online
  2013-01-21 11:13 ` [PATCH v3 2/6] memcg: split part of memcg creation to css_online Glauber Costa
@ 2013-01-21 13:56   ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2013-01-21 13:56 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On Mon 21-01-13 15:13:29, Glauber Costa wrote:
> This patch is a preparatory work for later locking rework to get rid of
> big cgroup lock from memory controller code.
> 
> The memory controller uses some tunables to adjust its operation. Those
> tunables are inherited from parent to children upon children
> intialization. For most of them, the value cannot be changed after the
> parent has a new children.
> 
> cgroup core splits initialization in two phases: css_alloc and css_online.
> After css_alloc, the memory allocation and basic initialization are
> done. But the new group is not yet visible anywhere, not even for cgroup
> core code. It is only somewhere between css_alloc and css_online that it
> is inserted into the internal children lists. Copying tunable values in
> css_alloc will lead to inconsistent values: the children will copy the
> old parent values, that can change between the copy and the moment in
> which the groups is linked to any data structure that can indicate the
> presence of children.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!
> ---
>  mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 39 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 91d90a0..6c72204 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6063,7 +6063,7 @@ err_cleanup:
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_css_alloc(struct cgroup *cont)
>  {
> -	struct mem_cgroup *memcg, *parent;
> +	struct mem_cgroup *memcg;
>  	long error = -ENOMEM;
>  	int node;
>  
> @@ -6079,7 +6079,6 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>  	if (cont->parent == NULL) {
>  		int cpu;
>  		enable_swap_cgroup();
> -		parent = NULL;
>  		if (mem_cgroup_soft_limit_tree_init())
>  			goto free_out;
>  		root_mem_cgroup = memcg;
> @@ -6088,13 +6087,43 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>  						&per_cpu(memcg_stock, cpu);
>  			INIT_WORK(&stock->work, drain_local_stock);
>  		}
> -	} else {
> -		parent = mem_cgroup_from_cont(cont->parent);
> -		memcg->use_hierarchy = parent->use_hierarchy;
> -		memcg->oom_kill_disable = parent->oom_kill_disable;
> +
> +		res_counter_init(&memcg->res, NULL);
> +		res_counter_init(&memcg->memsw, NULL);
> +		res_counter_init(&memcg->kmem, NULL);
>  	}
>  
> -	if (parent && parent->use_hierarchy) {
> +	memcg->last_scanned_node = MAX_NUMNODES;
> +	INIT_LIST_HEAD(&memcg->oom_notify);
> +	atomic_set(&memcg->refcnt, 1);
> +	memcg->move_charge_at_immigrate = 0;
> +	mutex_init(&memcg->thresholds_lock);
> +	spin_lock_init(&memcg->move_lock);
> +
> +	return &memcg->css;
> +
> +free_out:
> +	__mem_cgroup_free(memcg);
> +	return ERR_PTR(error);
> +}
> +
> +static int
> +mem_cgroup_css_online(struct cgroup *cont)
> +{
> +	struct mem_cgroup *memcg, *parent;
> +	int error = 0;
> +
> +	if (!cont->parent)
> +		return 0;
> +
> +	memcg = mem_cgroup_from_cont(cont);
> +	parent = mem_cgroup_from_cont(cont->parent);
> +
> +	memcg->use_hierarchy = parent->use_hierarchy;
> +	memcg->oom_kill_disable = parent->oom_kill_disable;
> +	memcg->swappiness = mem_cgroup_swappiness(parent);
> +
> +	if (parent->use_hierarchy) {
>  		res_counter_init(&memcg->res, &parent->res);
>  		res_counter_init(&memcg->memsw, &parent->memsw);
>  		res_counter_init(&memcg->kmem, &parent->kmem);
> @@ -6115,18 +6144,9 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>  		 * much sense so let cgroup subsystem know about this
>  		 * unfortunate state in our controller.
>  		 */
> -		if (parent && parent != root_mem_cgroup)
> +		if (parent != root_mem_cgroup)
>  			mem_cgroup_subsys.broken_hierarchy = true;
>  	}
> -	memcg->last_scanned_node = MAX_NUMNODES;
> -	INIT_LIST_HEAD(&memcg->oom_notify);
> -
> -	if (parent)
> -		memcg->swappiness = mem_cgroup_swappiness(parent);
> -	atomic_set(&memcg->refcnt, 1);
> -	memcg->move_charge_at_immigrate = 0;
> -	mutex_init(&memcg->thresholds_lock);
> -	spin_lock_init(&memcg->move_lock);
>  
>  	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
>  	if (error) {
> @@ -6136,12 +6156,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>  		 * call __mem_cgroup_free, so return directly
>  		 */
>  		mem_cgroup_put(memcg);
> -		return ERR_PTR(error);
>  	}
> -	return &memcg->css;
> -free_out:
> -	__mem_cgroup_free(memcg);
> -	return ERR_PTR(error);
> +	return error;
>  }
>  
>  static void mem_cgroup_css_offline(struct cgroup *cont)
> @@ -6751,6 +6767,7 @@ struct cgroup_subsys mem_cgroup_subsys = {
>  	.name = "memory",
>  	.subsys_id = mem_cgroup_subsys_id,
>  	.css_alloc = mem_cgroup_css_alloc,
> +	.css_online = mem_cgroup_css_online,
>  	.css_offline = mem_cgroup_css_offline,
>  	.css_free = mem_cgroup_css_free,
>  	.can_attach = mem_cgroup_can_attach,
> -- 
> 1.8.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/6] memcg: fast hierarchy-aware child test.
  2013-01-21 11:13 ` [PATCH v3 3/6] memcg: fast hierarchy-aware child test Glauber Costa
@ 2013-01-21 14:10   ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2013-01-21 14:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On Mon 21-01-13 15:13:30, Glauber Costa wrote:
> Currently, we use cgroups' provided list of children to verify if it is
> safe to proceed with any value change that is dependent on the cgroup
> being empty.
> 
> This is less than ideal, because it enforces a dependency over cgroup
> core that we would be better off without. The solution proposed here is
> to iterate over the child cgroups and if any is found that is already
> online, we bounce and return: we don't really care how many children we
> have, only if we have any.
> 
> This is also made to be hierarchy aware. IOW, cgroups with  hierarchy
> disabled, while they still exist, will be considered for the purpose of
> this interface as having no children.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>

OK, as I said I can live with this.
Acked-by: Michal Hocko <mhocko@suse.cz>

But
> ---
>  mm/memcontrol.c | 34 +++++++++++++++++++++++++++++-----
>  1 file changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6c72204..6d3ad21 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4716,6 +4716,32 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
>  }
>  
>  /*
> + * this mainly exists for tests during set of use_hierarchy. Since this is
> + * the very setting we are changing, the current hierarchy value is meaningless
> + */
> +static inline bool __memcg_has_children(struct mem_cgroup *memcg)
> +{
> +	struct cgroup *pos;
> +
> +	/* bounce at first found */
> +	cgroup_for_each_child(pos, memcg->css.cgroup)
> +		return true;
> +	return false;
> +}

This needs rcu_read_{un}lock.

> +
> +/*
> + * must be called with cgroup_lock held, unless the cgroup is guaranteed to be
> + * already dead (like in mem_cgroup_force_empty, for instance).  This is
> + * different than mem_cgroup_count_children, in the sense that we don't really
> + * care how many children we have, we only need to know if we have any. It is
> + * also count any memcg without hierarchy as infertile for that matter.
> + */
> +static inline bool memcg_has_children(struct mem_cgroup *memcg)
> +{
> +	return memcg->use_hierarchy && __memcg_has_children(memcg);
> +}
> +
> +/*
>   * Reclaims as many pages from the given memcg as possible and moves
>   * the rest to the parent.
>   *
> @@ -4800,7 +4826,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
>  	 */
>  	if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
>  				(val == 1 || val == 0)) {
> -		if (list_empty(&cont->children))
> +		if (!__memcg_has_children(memcg))
>  			memcg->use_hierarchy = val;
>  		else
>  			retval = -EBUSY;
> @@ -4917,8 +4943,7 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
>  	cgroup_lock();
>  	mutex_lock(&set_limit_mutex);
>  	if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
> -		if (cgroup_task_count(cont) || (memcg->use_hierarchy &&
> -						!list_empty(&cont->children))) {
> +		if (cgroup_task_count(cont) || memcg_has_children(memcg)) {
>  			ret = -EBUSY;
>  			goto out;
>  		}
> @@ -5334,8 +5359,7 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  	cgroup_lock();
>  
>  	/* If under hierarchy, only empty-root can set this value */
> -	if ((parent->use_hierarchy) ||
> -	    (memcg->use_hierarchy && !list_empty(&cgrp->children))) {
> +	if ((parent->use_hierarchy) || memcg_has_children(memcg)) {
>  		cgroup_unlock();
>  		return -EINVAL;
>  	}
> -- 
> 1.8.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock
  2013-01-21 11:13 ` [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock Glauber Costa
@ 2013-01-21 14:49   ` Michal Hocko
  2013-01-21 15:12     ` Glauber Costa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-01-21 14:49 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On Mon 21-01-13 15:13:31, Glauber Costa wrote:
> After the preparation work done in earlier patches, the cgroup_lock can
> be trivially replaced with a memcg-specific lock. This is an automatic
> translation in every site the values involved were queried.
> 
> The sites were values are written, however, used to be naturally called
> under cgroup_lock. This is the case for instance of the css_online
> callback. For those, we now need to explicitly add the memcg_lock.
> 
> Also, now that the memcg_mutex is available, there is no need to abuse
> the set_limit mutex in kmemcg value setting. The memcg_mutex will do a
> better job, and we now resort to it.

You will hate me for this because I should have said that in the
previous round already (but I will use "I shown a mercy on you and
that blinded me" for my defense).
I am not so sure it will do a better job (it is only kmem that uses both
locks). I thought that memcg_mutex is just a first step and that we move
to a more finer grained locking later (a too general documentation of
the lock even asks for it).  So I would keep the limit mutex and figure
whether memcg_mutex could be split up even further.

Other than that the patch looks good to me
 
> With this, all the calls to cgroup_lock outside cgroup core are gone.

OK, Tejun will be happy ;)

> Signed-off-by: Glauber Costa <glommer@parallels.com>
> ---
>  mm/memcontrol.c | 52 ++++++++++++++++++++++++++++------------------------
>  1 file changed, 28 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6d3ad21..d3b78b9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -470,6 +470,13 @@ enum res_type {
>  #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
>  #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
>  
> +/*
> + * The memcg mutex needs to be held for any globally visible cgroup change.
> + * Group creation and tunable propagation, as well as any change that depends
> + * on the tunables being in a consistent state.
> + */
> +static DEFINE_MUTEX(memcg_mutex);
> +
>  static void mem_cgroup_get(struct mem_cgroup *memcg);
>  static void mem_cgroup_put(struct mem_cgroup *memcg);
>  
> @@ -2902,7 +2909,7 @@ int memcg_cache_id(struct mem_cgroup *memcg)
>   * operation, because that is its main call site.
>   *
>   * But when we create a new cache, we can call this as well if its parent
> - * is kmem-limited. That will have to hold set_limit_mutex as well.
> + * is kmem-limited. That will have to hold memcg_mutex as well.
>   */
>  int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>  {
> @@ -2917,7 +2924,7 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>  	 * the beginning of this conditional), is no longer 0. This
>  	 * guarantees only one process will set the following boolean
>  	 * to true. We don't need test_and_set because we're protected
> -	 * by the set_limit_mutex anyway.
> +	 * by the memcg_mutex anyway.
>  	 */
>  	memcg_kmem_set_activated(memcg);
>  
> @@ -3258,9 +3265,9 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
>  	 *
>  	 * Still, we don't want anyone else freeing memcg_caches under our
>  	 * noses, which can happen if a new memcg comes to life. As usual,
> -	 * we'll take the set_limit_mutex to protect ourselves against this.
> +	 * we'll take the memcg_mutex to protect ourselves against this.
>  	 */
> -	mutex_lock(&set_limit_mutex);
> +	mutex_lock(&memcg_mutex);
>  	for (i = 0; i < memcg_limited_groups_array_size; i++) {
>  		c = s->memcg_params->memcg_caches[i];
>  		if (!c)
> @@ -3283,7 +3290,7 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
>  		cancel_work_sync(&c->memcg_params->destroy);
>  		kmem_cache_destroy(c);
>  	}
> -	mutex_unlock(&set_limit_mutex);
> +	mutex_unlock(&memcg_mutex);
>  }
>  
>  struct create_work {
> @@ -4730,7 +4737,7 @@ static inline bool __memcg_has_children(struct mem_cgroup *memcg)
>  }
>  
>  /*
> - * must be called with cgroup_lock held, unless the cgroup is guaranteed to be
> + * must be called with memcg_mutex held, unless the cgroup is guaranteed to be
>   * already dead (like in mem_cgroup_force_empty, for instance).  This is
>   * different than mem_cgroup_count_children, in the sense that we don't really
>   * care how many children we have, we only need to know if we have any. It is
> @@ -4811,7 +4818,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
>  	if (parent)
>  		parent_memcg = mem_cgroup_from_cont(parent);
>  
> -	cgroup_lock();
> +	mutex_lock(&memcg_mutex);
>  
>  	if (memcg->use_hierarchy == val)
>  		goto out;
> @@ -4834,7 +4841,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
>  		retval = -EINVAL;
>  
>  out:
> -	cgroup_unlock();
> +	mutex_unlock(&memcg_mutex);
>  
>  	return retval;
>  }
> @@ -4934,14 +4941,10 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
>  	 * After it first became limited, changes in the value of the limit are
>  	 * of course permitted.
>  	 *
> -	 * Taking the cgroup_lock is really offensive, but it is so far the only
> -	 * way to guarantee that no children will appear. There are plenty of
> -	 * other offenders, and they should all go away. Fine grained locking
> -	 * is probably the way to go here. When we are fully hierarchical, we
> -	 * can also get rid of the use_hierarchy check.
> +	 * We are protected by the memcg_mutex, so no other cgroups can appear
> +	 * in the mean time.
>  	 */
> -	cgroup_lock();
> -	mutex_lock(&set_limit_mutex);
> +	mutex_lock(&memcg_mutex);
>  	if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
>  		if (cgroup_task_count(cont) || memcg_has_children(memcg)) {
>  			ret = -EBUSY;
> @@ -4966,8 +4969,7 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
>  	} else
>  		ret = res_counter_set_limit(&memcg->kmem, val);
>  out:
> -	mutex_unlock(&set_limit_mutex);
> -	cgroup_unlock();
> +	mutex_unlock(&memcg_mutex);
>  
>  	/*
>  	 * We are by now familiar with the fact that we can't inc the static
> @@ -5024,9 +5026,9 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
>  	mem_cgroup_get(memcg);
>  	static_key_slow_inc(&memcg_kmem_enabled_key);
>  
> -	mutex_lock(&set_limit_mutex);
> +	mutex_lock(&memcg_mutex);
>  	ret = memcg_update_cache_sizes(memcg);
> -	mutex_unlock(&set_limit_mutex);
> +	mutex_unlock(&memcg_mutex);
>  #endif
>  out:
>  	return ret;
> @@ -5356,17 +5358,17 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  
>  	parent = mem_cgroup_from_cont(cgrp->parent);
>  
> -	cgroup_lock();
> +	mutex_lock(&memcg_mutex);
>  
>  	/* If under hierarchy, only empty-root can set this value */
>  	if ((parent->use_hierarchy) || memcg_has_children(memcg)) {
> -		cgroup_unlock();
> +		mutex_unlock(&memcg_mutex);
>  		return -EINVAL;
>  	}
>  
>  	memcg->swappiness = val;
>  
> -	cgroup_unlock();
> +	mutex_unlock(&memcg_mutex);
>  
>  	return 0;
>  }
> @@ -5692,7 +5694,7 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>  
>  	parent = mem_cgroup_from_cont(cgrp->parent);
>  
> -	cgroup_lock();
> +	mutex_lock(&memcg_mutex);
>  	/* oom-kill-disable is a flag for subhierarchy. */
>  	if ((parent->use_hierarchy) ||
>  	    (memcg->use_hierarchy && !list_empty(&cgrp->children))) {
> @@ -5702,7 +5704,7 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>  	memcg->oom_kill_disable = val;
>  	if (!val)
>  		memcg_oom_recover(memcg);
> -	cgroup_unlock();
> +	mutex_unlock(&memcg_mutex);
>  	return 0;
>  }
>  
> @@ -6140,6 +6142,7 @@ mem_cgroup_css_online(struct cgroup *cont)
>  	if (!cont->parent)
>  		return 0;
>  
> +	mutex_lock(&memcg_mutex);
>  	memcg = mem_cgroup_from_cont(cont);
>  	parent = mem_cgroup_from_cont(cont->parent);
>  
> @@ -6173,6 +6176,7 @@ mem_cgroup_css_online(struct cgroup *cont)
>  	}
>  
>  	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> +	mutex_unlock(&memcg_mutex);
>  	if (error) {
>  		/*
>  		 * We call put now because our (and parent's) refcnts
> -- 
> 1.8.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock
  2013-01-21 14:49   ` Michal Hocko
@ 2013-01-21 15:12     ` Glauber Costa
  2013-01-21 15:20       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 15:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On 01/21/2013 06:49 PM, Michal Hocko wrote:
> On Mon 21-01-13 15:13:31, Glauber Costa wrote:
>> After the preparation work done in earlier patches, the cgroup_lock can
>> be trivially replaced with a memcg-specific lock. This is an automatic
>> translation in every site the values involved were queried.
>>
>> The sites were values are written, however, used to be naturally called
>> under cgroup_lock. This is the case for instance of the css_online
>> callback. For those, we now need to explicitly add the memcg_lock.
>>
>> Also, now that the memcg_mutex is available, there is no need to abuse
>> the set_limit mutex in kmemcg value setting. The memcg_mutex will do a
>> better job, and we now resort to it.
> 
> You will hate me for this because I should have said that in the
> previous round already (but I will use "I shown a mercy on you and
> that blinded me" for my defense).
> I am not so sure it will do a better job (it is only kmem that uses both
> locks). I thought that memcg_mutex is just a first step and that we move
> to a more finer grained locking later (a too general documentation of
> the lock even asks for it).  So I would keep the limit mutex and figure
> whether memcg_mutex could be split up even further.
> 
> Other than that the patch looks good to me
> 
By now I have more than enough reasons to hate you, so this one won't
add much. Even then, don't worry. Beer resets it all.

That said, I disagree with you.

As you noted yourself, kmem needs both locks:
1) cgroup_lock, because we need to prevent creation of sub-groups.
2) set_limit lock, because we need one - any one - memcg global lock be
held while we are manipulating the kmem-specific data structures, and we
would like to spread cgroup_lock all around for that.

I now regret not having created the memcg_mutex for that: I'd be now
just extending it to other users, instead of trying a replacement.

So first of all, if the limit mutex is kept, we would *still* need to
hold the memcg mutex to avoid children appearing. If we *ever* switch to
a finer-grained lock(*), we will have to hold that lock anyway. So why
hold set_limit_mutex??

(*) None of the operations protected by this mutex are fast paths...


>> With this, all the calls to cgroup_lock outside cgroup core are gone.
> 
> OK, Tejun will be happy ;)
> 
He paid me ice cream.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock
  2013-01-21 15:12     ` Glauber Costa
@ 2013-01-21 15:20       ` Michal Hocko
  2013-01-21 15:34         ` Glauber Costa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-01-21 15:20 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On Mon 21-01-13 19:12:00, Glauber Costa wrote:
> On 01/21/2013 06:49 PM, Michal Hocko wrote:
> > On Mon 21-01-13 15:13:31, Glauber Costa wrote:
> >> After the preparation work done in earlier patches, the cgroup_lock can
> >> be trivially replaced with a memcg-specific lock. This is an automatic
> >> translation in every site the values involved were queried.
> >>
> >> The sites were values are written, however, used to be naturally called
> >> under cgroup_lock. This is the case for instance of the css_online
> >> callback. For those, we now need to explicitly add the memcg_lock.
> >>
> >> Also, now that the memcg_mutex is available, there is no need to abuse
> >> the set_limit mutex in kmemcg value setting. The memcg_mutex will do a
> >> better job, and we now resort to it.
> > 
> > You will hate me for this because I should have said that in the
> > previous round already (but I will use "I shown a mercy on you and
> > that blinded me" for my defense).
> > I am not so sure it will do a better job (it is only kmem that uses both
> > locks). I thought that memcg_mutex is just a first step and that we move
> > to a more finer grained locking later (a too general documentation of
> > the lock even asks for it).  So I would keep the limit mutex and figure
> > whether memcg_mutex could be split up even further.
> > 
> > Other than that the patch looks good to me
> > 
> By now I have more than enough reasons to hate you, so this one won't
> add much. Even then, don't worry. Beer resets it all.
> 
> That said, I disagree with you.
> 
> As you noted yourself, kmem needs both locks:
> 1) cgroup_lock, because we need to prevent creation of sub-groups.
> 2) set_limit lock, because we need one - any one - memcg global lock be
> held while we are manipulating the kmem-specific data structures, and we
> would like to spread cgroup_lock all around for that.
> 
> I now regret not having created the memcg_mutex for that: I'd be now
> just extending it to other users, instead of trying a replacement.
> 
> So first of all, if the limit mutex is kept, we would *still* need to
> hold the memcg mutex to avoid children appearing. If we *ever* switch to
> a finer-grained lock(*), we will have to hold that lock anyway. So why
> hold set_limit_mutex??

Yeah but memcg is not just kmem, is it? See mem_cgroup_resize_limit for
example. Why should it be linearized with, say, a new group creation.
Same thing with memsw. Besides that you know what those two locks are
intended for. memcg_mutex to prevent from races with a new group
creation and the limit lock for races with what-ever limit setting.
This sounds much more specific than
"
The memcg mutex needs to be held for any globally visible cgroup change.
"

> (*) None of the operations protected by this mutex are fast paths...
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock
  2013-01-21 15:20       ` Michal Hocko
@ 2013-01-21 15:34         ` Glauber Costa
  2013-01-21 16:07           ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 15:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On 01/21/2013 07:20 PM, Michal Hocko wrote:
> On Mon 21-01-13 19:12:00, Glauber Costa wrote:
>> On 01/21/2013 06:49 PM, Michal Hocko wrote:
>>> On Mon 21-01-13 15:13:31, Glauber Costa wrote:
>>>> After the preparation work done in earlier patches, the cgroup_lock can
>>>> be trivially replaced with a memcg-specific lock. This is an automatic
>>>> translation in every site the values involved were queried.
>>>>
>>>> The sites were values are written, however, used to be naturally called
>>>> under cgroup_lock. This is the case for instance of the css_online
>>>> callback. For those, we now need to explicitly add the memcg_lock.
>>>>
>>>> Also, now that the memcg_mutex is available, there is no need to abuse
>>>> the set_limit mutex in kmemcg value setting. The memcg_mutex will do a
>>>> better job, and we now resort to it.
>>>
>>> You will hate me for this because I should have said that in the
>>> previous round already (but I will use "I shown a mercy on you and
>>> that blinded me" for my defense).
>>> I am not so sure it will do a better job (it is only kmem that uses both
>>> locks). I thought that memcg_mutex is just a first step and that we move
>>> to a more finer grained locking later (a too general documentation of
>>> the lock even asks for it).  So I would keep the limit mutex and figure
>>> whether memcg_mutex could be split up even further.
>>>
>>> Other than that the patch looks good to me
>>>
>> By now I have more than enough reasons to hate you, so this one won't
>> add much. Even then, don't worry. Beer resets it all.
>>
>> That said, I disagree with you.
>>
>> As you noted yourself, kmem needs both locks:
>> 1) cgroup_lock, because we need to prevent creation of sub-groups.
>> 2) set_limit lock, because we need one - any one - memcg global lock be
>> held while we are manipulating the kmem-specific data structures, and we
>> would like to spread cgroup_lock all around for that.
>>
>> I now regret not having created the memcg_mutex for that: I'd be now
>> just extending it to other users, instead of trying a replacement.
>>
>> So first of all, if the limit mutex is kept, we would *still* need to
>> hold the memcg mutex to avoid children appearing. If we *ever* switch to
>> a finer-grained lock(*), we will have to hold that lock anyway. So why
>> hold set_limit_mutex??
> 
> Yeah but memcg is not just kmem, is it? 

No, it belongs to all of us. It is usually called collaboration, but in
the memcg context, we can say we are accomplices.

> See mem_cgroup_resize_limit for
> example. Why should it be linearized with, say, a new group creation.

Because it is simpler to use the same lock, and all those operations are
not exactly frequent.

> Same thing with memsw.

See, I'm not the only culprit!

> Besides that you know what those two locks are
> intended for. memcg_mutex to prevent from races with a new group
> creation and the limit lock for races with what-ever limit setting.
> This sounds much more specific than

Again: Can I keep holding the set_limit_mutex? Sure I can. But we still
need to hold both, because kmemcg is also forbidden for groups that
already have tasks. And the reason why kmemcg holds the set_limit mutex
is just to protect from itself, then there is no *need* to hold any
extra lock (and we'll never be able to stop holding the creation lock,
whatever it is). So my main point here is not memcg_mutex vs
set_limit_mutex, but rather, memcg_mutex is needed anyway, and once it
is taken, the set_limit_mutex *can* be held, but doesn't need to.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock
  2013-01-21 15:34         ` Glauber Costa
@ 2013-01-21 16:07           ` Michal Hocko
  2013-01-21 16:12             ` Glauber Costa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-01-21 16:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On Mon 21-01-13 19:34:27, Glauber Costa wrote:
> On 01/21/2013 07:20 PM, Michal Hocko wrote:
> > On Mon 21-01-13 19:12:00, Glauber Costa wrote:
> >> On 01/21/2013 06:49 PM, Michal Hocko wrote:
> >>> On Mon 21-01-13 15:13:31, Glauber Costa wrote:
> >>>> After the preparation work done in earlier patches, the cgroup_lock can
> >>>> be trivially replaced with a memcg-specific lock. This is an automatic
> >>>> translation in every site the values involved were queried.
> >>>>
> >>>> The sites were values are written, however, used to be naturally called
> >>>> under cgroup_lock. This is the case for instance of the css_online
> >>>> callback. For those, we now need to explicitly add the memcg_lock.
> >>>>
> >>>> Also, now that the memcg_mutex is available, there is no need to abuse
> >>>> the set_limit mutex in kmemcg value setting. The memcg_mutex will do a
> >>>> better job, and we now resort to it.
> >>>
> >>> You will hate me for this because I should have said that in the
> >>> previous round already (but I will use "I shown a mercy on you and
> >>> that blinded me" for my defense).
> >>> I am not so sure it will do a better job (it is only kmem that uses both
> >>> locks). I thought that memcg_mutex is just a first step and that we move
> >>> to a more finer grained locking later (a too general documentation of
> >>> the lock even asks for it).  So I would keep the limit mutex and figure
> >>> whether memcg_mutex could be split up even further.
> >>>
> >>> Other than that the patch looks good to me
> >>>
> >> By now I have more than enough reasons to hate you, so this one won't
> >> add much. Even then, don't worry. Beer resets it all.
> >>
> >> That said, I disagree with you.
> >>
> >> As you noted yourself, kmem needs both locks:
> >> 1) cgroup_lock, because we need to prevent creation of sub-groups.
> >> 2) set_limit lock, because we need one - any one - memcg global lock be
> >> held while we are manipulating the kmem-specific data structures, and we
> >> would like to spread cgroup_lock all around for that.
> >>
> >> I now regret not having created the memcg_mutex for that: I'd be now
> >> just extending it to other users, instead of trying a replacement.
> >>
> >> So first of all, if the limit mutex is kept, we would *still* need to
> >> hold the memcg mutex to avoid children appearing. If we *ever* switch to
> >> a finer-grained lock(*), we will have to hold that lock anyway. So why
> >> hold set_limit_mutex??
> > 
> > Yeah but memcg is not just kmem, is it? 
> 
> No, it belongs to all of us. It is usually called collaboration, but in
> the memcg context, we can say we are accomplices.
> 
> > See mem_cgroup_resize_limit for
> > example. Why should it be linearized with, say, a new group creation.
> 
> Because it is simpler to use the same lock, and all those operations are
> not exactly frequent.
> 
> > Same thing with memsw.
> 
> See, I'm not the only culprit!

FWIW, that one doesn't need other log as well.

> > Besides that you know what those two locks are
> > intended for. memcg_mutex to prevent from races with a new group
> > creation and the limit lock for races with what-ever limit setting.
> > This sounds much more specific than
> 
> Again: Can I keep holding the set_limit_mutex? Sure I can. But we still
> need to hold both, because kmemcg is also forbidden for groups that
> already have tasks.

Holding both is not a big deal if they are necessary - granted the
ordering is correct - which is as it doesn't change with the move from
cgroup_mutex. But please do not add new dependencies (like regular limit
setting with memcg creation).

> And the reason why kmemcg holds the set_limit mutex
> is just to protect from itself, then there is no *need* to hold any
> extra lock (and we'll never be able to stop holding the creation lock,
> whatever it is). So my main point here is not memcg_mutex vs
> set_limit_mutex, but rather, memcg_mutex is needed anyway, and once it
> is taken, the set_limit_mutex *can* be held, but doesn't need to.

So you can update kmem specific usage of set_limit_mutex.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock
  2013-01-21 16:07           ` Michal Hocko
@ 2013-01-21 16:12             ` Glauber Costa
  2013-01-21 16:33               ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2013-01-21 16:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On 01/21/2013 08:07 PM, Michal Hocko wrote:
>> > And the reason why kmemcg holds the set_limit mutex
>> > is just to protect from itself, then there is no *need* to hold any
>> > extra lock (and we'll never be able to stop holding the creation lock,
>> > whatever it is). So my main point here is not memcg_mutex vs
>> > set_limit_mutex, but rather, memcg_mutex is needed anyway, and once it
>> > is taken, the set_limit_mutex *can* be held, but doesn't need to.
> So you can update kmem specific usage of set_limit_mutex.
Meaning ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock
  2013-01-21 16:12             ` Glauber Costa
@ 2013-01-21 16:33               ` Michal Hocko
  2013-01-21 17:37                 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-01-21 16:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On Mon 21-01-13 20:12:17, Glauber Costa wrote:
> On 01/21/2013 08:07 PM, Michal Hocko wrote:
> >> > And the reason why kmemcg holds the set_limit mutex
> >> > is just to protect from itself, then there is no *need* to hold any
> >> > extra lock (and we'll never be able to stop holding the creation lock,
> >> > whatever it is). So my main point here is not memcg_mutex vs
> >> > set_limit_mutex, but rather, memcg_mutex is needed anyway, and once it
> >> > is taken, the set_limit_mutex *can* be held, but doesn't need to.
> > So you can update kmem specific usage of set_limit_mutex.
> Meaning ?

I thought you've said it is not needed and the code says that:
- memcg_propagate_kmem is called with memcg_mutex held in css_alloc
- memcg_update_kmem_limit takes both of them
- kmem_cache_destroy_memcg_children _doesn't_ take both

So one obvious way to go would be changing
kmem_cache_destroy_memcg_children to memcg_mutex and removing
set_limit_mutex from other two paths.

This would leave set_limit_mutex lock for its original intention.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock
  2013-01-21 16:33               ` Michal Hocko
@ 2013-01-21 17:37                 ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2013-01-21 17:37 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-mm, Tejun Heo, Johannes Weiner, kamezawa.hiroyu

On Mon 21-01-13 17:33:49, Michal Hocko wrote:
> On Mon 21-01-13 20:12:17, Glauber Costa wrote:
> > On 01/21/2013 08:07 PM, Michal Hocko wrote:
> > >> > And the reason why kmemcg holds the set_limit mutex
> > >> > is just to protect from itself, then there is no *need* to hold any
> > >> > extra lock (and we'll never be able to stop holding the creation lock,
> > >> > whatever it is). So my main point here is not memcg_mutex vs
> > >> > set_limit_mutex, but rather, memcg_mutex is needed anyway, and once it
> > >> > is taken, the set_limit_mutex *can* be held, but doesn't need to.
> > > So you can update kmem specific usage of set_limit_mutex.
> > Meaning ?
> 
> I thought you've said it is not needed and the code says that:
> - memcg_propagate_kmem is called with memcg_mutex held in css_alloc
> - memcg_update_kmem_limit takes both of them
> - kmem_cache_destroy_memcg_children _doesn't_ take both
> 
> So one obvious way to go would be changing
> kmem_cache_destroy_memcg_children to memcg_mutex and removing
> set_limit_mutex from other two paths.
> 
> This would leave set_limit_mutex lock for its original intention.

And then we could use a more suitable name: memcg_create_lock

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-01-21 17:37 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-21 11:13 [PATCH v3 0/6] replace cgroup_lock with memcg specific locking Glauber Costa
2013-01-21 11:13 ` [PATCH v3 1/6] memcg: prevent changes to move_charge_at_immigrate during task attach Glauber Costa
2013-01-21 11:13 ` [PATCH v3 2/6] memcg: split part of memcg creation to css_online Glauber Costa
2013-01-21 13:56   ` Michal Hocko
2013-01-21 11:13 ` [PATCH v3 3/6] memcg: fast hierarchy-aware child test Glauber Costa
2013-01-21 14:10   ` Michal Hocko
2013-01-21 11:13 ` [PATCH v3 4/6] memcg: replace cgroup_lock with memcg specific memcg_lock Glauber Costa
2013-01-21 14:49   ` Michal Hocko
2013-01-21 15:12     ` Glauber Costa
2013-01-21 15:20       ` Michal Hocko
2013-01-21 15:34         ` Glauber Costa
2013-01-21 16:07           ` Michal Hocko
2013-01-21 16:12             ` Glauber Costa
2013-01-21 16:33               ` Michal Hocko
2013-01-21 17:37                 ` Michal Hocko
2013-01-21 11:13 ` [PATCH v3 5/6] memcg: increment static branch right after limit set Glauber Costa
2013-01-21 11:13 ` [PATCH v3 6/6] memcg: avoid dangling reference count in creation failure Glauber Costa
2013-01-21 12:30   ` Michal Hocko
2013-01-21 13:08     ` Glauber Costa
2013-01-21 13:19       ` Michal Hocko
2013-01-21 13:26         ` Glauber Costa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).