[PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach

Linux Container Development
 help / color / mirror / Atom feed

* [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found] ` <AANLkTikx6d0_VFtZ4zWQucRCf=vFt7N2M6=0jpnKasEE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-22  4:06   ` Colin Cross
  0 siblings, 0 replies; 13+ messages in thread
From: Colin Cross @ 2010-11-22  4:06 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Colin Cross

The synchronize_rcu call in cgroup_attach_task can be very
expensive.  All fastpath accesses to task->cgroups that expect
task->cgroups not to change already use task_lock() or
cgroup_lock() to protect against updates, and, in cgroup.c,
only the CGROUP_DEBUG files have RCU read-side critical
sections.

sched.c uses RCU read-side-critical sections on task->cgroups,
but only to ensure that a dereference of task->cgroups does
not become invalid, not that it doesn't change.

This patch adds a function put_css_set_rcu, which delays the
put until after a grace period has elapsed.  This ensures that
any RCU read-side critical sections that dereferenced
task->cgroups in sched.c have completed before the css_set is
deleted.  The synchronize_rcu()/put_css_set() combo in
cgroup_attach_task() can then be replaced with
put_css_set_rcu().

Also converts the CGROUP_DEBUG files that access
current->cgroups to use task_lock(current) instead of
rcu_read_lock().

Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>

---

This version fixes the problems with the previous patch by
keeping the use of RCU in cgroup_attach_task, but allowing
cgroup_attach_task to return immediately by deferring the
final put_css_reg to an rcu callback.

 include/linux/cgroup.h |    4 +++
 kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..fd26218 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -287,6 +287,10 @@ struct css_set {
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
+
+	/* For RCU-delayed puts */
+	atomic_t delayed_put_count;
+	struct work_struct delayed_put_work;
 };
 
 /*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..c7348e7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -298,7 +298,8 @@ static int cgroup_init_idr(struct cgroup_subsys *ss,
 
 /* css_set_lock protects the list of css_set objects, and the
  * chain of tasks off each css_set.  Nests outside task->alloc_lock
- * due to cgroup_iter_start() */
+ * due to cgroup_iter_start().  Never locked in irq context, so
+ * the non-irq variants of write_lock and read_lock are used. */
 static DEFINE_RWLOCK(css_set_lock);
 static int css_set_count;
 
@@ -396,6 +397,39 @@ static inline void put_css_set_taskexit(struct css_set *cg)
 	__put_css_set(cg, 1);
 }
 
+/* work function, executes in process context */
+static void __put_css_set_rcu(struct work_struct *work)
+{
+	struct css_set *cg;
+	cg = container_of(work, struct css_set, delayed_put_work);
+
+	while (atomic_add_unless(&cg->delayed_put_count, -1, 0))
+		put_css_set(cg);
+}
+
+/* rcu callback, executes in softirq context */
+static void _put_css_set_rcu(struct rcu_head *obj)
+{
+	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+
+	/* the rcu callback happens in softirq context, but css_set_lock
+	 * is not irq safe, so bounce to process context.
+	 */
+	schedule_work(&cg->delayed_put_work);
+}
+
+/* put_css_set_rcu - helper function to delay a put until after an rcu
+ * grace period
+ *
+ * free_css_set_rcu can never be called if there are outstanding
+ * put_css_set_rcu calls, so we can reuse cg->rcu_head.
+ */
+static inline void put_css_set_rcu(struct css_set *cg)
+{
+	if (atomic_inc_return(&cg->delayed_put_count) == 1)
+		call_rcu(&cg->rcu_head, _put_css_set_rcu);
+}
+
 /*
  * compare_css_sets - helper function for find_existing_css_set().
  * @cg: candidate css_set being tested
@@ -620,9 +654,11 @@ static struct css_set *find_css_set(
 	}
 
 	atomic_set(&res->refcount, 1);
+	atomic_set(&res->delayed_put_count, 0);
 	INIT_LIST_HEAD(&res->cg_links);
 	INIT_LIST_HEAD(&res->tasks);
 	INIT_HLIST_NODE(&res->hlist);
+	INIT_WORK(&res->delayed_put_work, __put_css_set_rcu);
 
 	/* Copy the set of subsystem state objects generated in
 	 * find_existing_css_set() */
@@ -725,9 +761,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
  * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
  * another.  It does so using cgroup_mutex, however there are
  * several performance critical places that need to reference
- * task->cgroup without the expense of grabbing a system global
+ * task->cgroups without the expense of grabbing a system global
  * mutex.  Therefore except as noted below, when dereferencing or, as
- * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
+ * in cgroup_attach_task(), modifying a task's cgroups pointer we use
  * task_lock(), which acts on a spinlock (task->alloc_lock) already in
  * the task_struct routinely used for such matters.
  *
@@ -1802,8 +1838,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
-	put_css_set(cg);
+	put_css_set_rcu(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -3900,6 +3935,7 @@ int __init cgroup_init_early(void)
 	INIT_LIST_HEAD(&init_css_set.cg_links);
 	INIT_LIST_HEAD(&init_css_set.tasks);
 	INIT_HLIST_NODE(&init_css_set.hlist);
+	INIT_WORK(&init_css_set.delayed_put_work, __put_css_set_rcu);
 	css_set_count = 1;
 	init_cgroup_root(&rootnode);
 	root_count = 1;
@@ -4827,9 +4863,9 @@ static u64 current_css_set_refcount_read(struct cgroup *cont,
 {
 	u64 count;
 
-	rcu_read_lock();
+	task_lock(current);
 	count = atomic_read(&current->cgroups->refcount);
-	rcu_read_unlock();
+	task_unlock(current);
 	return count;
 }
 
@@ -4838,12 +4874,10 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
 					 struct seq_file *seq)
 {
 	struct cg_cgroup_link *link;
-	struct css_set *cg;
 
 	read_lock(&css_set_lock);
-	rcu_read_lock();
-	cg = rcu_dereference(current->cgroups);
-	list_for_each_entry(link, &cg->cg_links, cg_link_list) {
+	task_lock(current);
+	list_for_each_entry(link, &current->cgroups->cg_links, cg_link_list) {
 		struct cgroup *c = link->cgrp;
 		const char *name;
 
@@ -4854,7 +4888,7 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
 		seq_printf(seq, "Root %d group %s\n",
 			   c->root->hierarchy_id, name);
 	}
-	rcu_read_unlock();
+	task_unlock(current);
 	read_unlock(&css_set_lock);
 	return 0;
 }
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]   ` <1290398767-15230-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2010-11-23  8:14     ` Li Zefan
  2010-11-24  1:24     ` Paul Menage
  1 sibling, 0 replies; 13+ messages in thread
From: Li Zefan @ 2010-11-23  8:14 UTC (permalink / raw)
  To: Colin Cross
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA

12:06, Colin Cross wrote:
> The synchronize_rcu call in cgroup_attach_task can be very
> expensive.  All fastpath accesses to task->cgroups that expect
> task->cgroups not to change already use task_lock() or
> cgroup_lock() to protect against updates, and, in cgroup.c,
> only the CGROUP_DEBUG files have RCU read-side critical
> sections.
> 
> sched.c uses RCU read-side-critical sections on task->cgroups,
> but only to ensure that a dereference of task->cgroups does
> not become invalid, not that it doesn't change.
> 

Other cgroup subsystems also use rcu_read_lock to access task->cgroups,
for example net_cls cgroup and device cgroup.

I don't think the performance of task attaching is so critically
important that we have to use call_rcu() instead of synchronize_rcu()?

> This patch adds a function put_css_set_rcu, which delays the
> put until after a grace period has elapsed.  This ensures that
> any RCU read-side critical sections that dereferenced
> task->cgroups in sched.c have completed before the css_set is
> deleted.  The synchronize_rcu()/put_css_set() combo in
> cgroup_attach_task() can then be replaced with
> put_css_set_rcu().
> 

> Also converts the CGROUP_DEBUG files that access
> current->cgroups to use task_lock(current) instead of
> rcu_read_lock().
> 

What for? What do we gain from doing this for those debug
interfaces?

> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
> 
> ---
> 
> This version fixes the problems with the previous patch by
> keeping the use of RCU in cgroup_attach_task, but allowing
> cgroup_attach_task to return immediately by deferring the
> final put_css_reg to an rcu callback.
> 
>  include/linux/cgroup.h |    4 +++
>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
>  2 files changed, 50 insertions(+), 12 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]     ` <4CEB77E0.10202-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2010-11-23  8:58       ` Colin Cross
  0 siblings, 0 replies; 13+ messages in thread
From: Colin Cross @ 2010-11-23  8:58 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 23, 2010 at 12:14 AM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> 12:06, Colin Cross wrote:
>> The synchronize_rcu call in cgroup_attach_task can be very
>> expensive.  All fastpath accesses to task->cgroups that expect
>> task->cgroups not to change already use task_lock() or
>> cgroup_lock() to protect against updates, and, in cgroup.c,
>> only the CGROUP_DEBUG files have RCU read-side critical
>> sections.
>>
>> sched.c uses RCU read-side-critical sections on task->cgroups,
>> but only to ensure that a dereference of task->cgroups does
>> not become invalid, not that it doesn't change.
>>
>
> Other cgroup subsystems also use rcu_read_lock to access task->cgroups,
> for example net_cls cgroup and device cgroup.
I believe the same comment applies as sched.c, I'll update the commit message.

> I don't think the performance of task attaching is so critically
> important that we have to use call_rcu() instead of synchronize_rcu()?
On my desktop, moving a task between cgroups averages 100 ms, and on
an Tegra2 SMP ARM platform it takes 20 ms.  Moving a task with many
threads can take hundreds of milliseconds or more.  With this patch it
takes 50 microseconds to move one task, a 400x improvement.

>> This patch adds a function put_css_set_rcu, which delays the
>> put until after a grace period has elapsed.  This ensures that
>> any RCU read-side critical sections that dereferenced
>> task->cgroups in sched.c have completed before the css_set is
>> deleted.  The synchronize_rcu()/put_css_set() combo in
>> cgroup_attach_task() can then be replaced with
>> put_css_set_rcu().
>>
>
>> Also converts the CGROUP_DEBUG files that access
>> current->cgroups to use task_lock(current) instead of
>> rcu_read_lock().
>>
>
> What for? What do we gain from doing this for those debug
> interfaces?
Left over from the previous patch that incorrectly dropped RCU
completely.  I'll put the rcu_read_locks back.

>> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
>>
>> ---
>>
>> This version fixes the problems with the previous patch by
>> keeping the use of RCU in cgroup_attach_task, but allowing
>> cgroup_attach_task to return immediately by deferring the
>> final put_css_reg to an rcu callback.
>>
>>  include/linux/cgroup.h |    4 +++
>>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
>>  2 files changed, 50 insertions(+), 12 deletions(-)
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]       ` <AANLkTimjpW6NZ6fEiVi0VzjkpQGVob4=VHsohXUiDQkJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-23 20:22         ` Colin Cross
  0 siblings, 0 replies; 13+ messages in thread
From: Colin Cross @ 2010-11-23 20:22 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 23, 2010 at 12:58 AM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
> On Tue, Nov 23, 2010 at 12:14 AM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>> 12:06, Colin Cross wrote:
>>> The synchronize_rcu call in cgroup_attach_task can be very
>>> expensive.  All fastpath accesses to task->cgroups that expect
>>> task->cgroups not to change already use task_lock() or
>>> cgroup_lock() to protect against updates, and, in cgroup.c,
>>> only the CGROUP_DEBUG files have RCU read-side critical
>>> sections.
>>>
>>> sched.c uses RCU read-side-critical sections on task->cgroups,
>>> but only to ensure that a dereference of task->cgroups does
>>> not become invalid, not that it doesn't change.
>>>
>>
>> Other cgroup subsystems also use rcu_read_lock to access task->cgroups,
>> for example net_cls cgroup and device cgroup.
> I believe the same comment applies as sched.c, I'll update the commit message.
>
>> I don't think the performance of task attaching is so critically
>> important that we have to use call_rcu() instead of synchronize_rcu()?
> On my desktop, moving a task between cgroups averages 100 ms, and on
> an Tegra2 SMP ARM platform it takes 20 ms.  Moving a task with many
> threads can take hundreds of milliseconds or more.  With this patch it
> takes 50 microseconds to move one task, a 400x improvement.
>
>>> This patch adds a function put_css_set_rcu, which delays the
>>> put until after a grace period has elapsed.  This ensures that
>>> any RCU read-side critical sections that dereferenced
>>> task->cgroups in sched.c have completed before the css_set is
>>> deleted.  The synchronize_rcu()/put_css_set() combo in
>>> cgroup_attach_task() can then be replaced with
>>> put_css_set_rcu().
>>>
>>
>>> Also converts the CGROUP_DEBUG files that access
>>> current->cgroups to use task_lock(current) instead of
>>> rcu_read_lock().
>>>
>>
>> What for? What do we gain from doing this for those debug
>> interfaces?
> Left over from the previous patch that incorrectly dropped RCU
> completely.  I'll put the rcu_read_locks back.
>
>>> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
>>>
>>> ---
>>>
>>> This version fixes the problems with the previous patch by
>>> keeping the use of RCU in cgroup_attach_task, but allowing
>>> cgroup_attach_task to return immediately by deferring the
>>> final put_css_reg to an rcu callback.
>>>
>>>  include/linux/cgroup.h |    4 +++
>>>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++----------
>>>  2 files changed, 50 insertions(+), 12 deletions(-)
>>
>

This patch has another problem - calling put_css_set_rcu twice before
an rcu grace period has elapsed would not guarantee the appropriate
rcu grace period for the second call.  I'll try a new approach, moving
the parts of put_css_set that need to be protected by rcu into
free_css_set_rcu.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]   ` <1290398767-15230-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-23  8:14     ` Li Zefan
@ 2010-11-24  1:24     ` Paul Menage
  1 sibling, 0 replies; 13+ messages in thread
From: Paul Menage @ 2010-11-24  1:24 UTC (permalink / raw)
  To: Colin Cross
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Sun, Nov 21, 2010 at 8:06 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
> The synchronize_rcu call in cgroup_attach_task can be very
> expensive.  All fastpath accesses to task->cgroups that expect
> task->cgroups not to change already use task_lock() or
> cgroup_lock() to protect against updates, and, in cgroup.c,
> only the CGROUP_DEBUG files have RCU read-side critical
> sections.

I definitely agree with the goal of using lighter-weight
synchronization than the current synchronize_rcu() call. However,
there are definitely some subtleties to worry about in this code.

One of the reasons originally for the current synchronization was to
avoid the case of calling subsystem destroy() callbacks while there
could still be threads with RCU references to the subsystem state. The
fact that synchronize_rcu() was called within a cgroup_mutex critical
section meant that an rmdir (or any other significant cgrooup
management action) couldn't possibly start until any RCU read sections
were done.

I suspect that when we moved a lot of the cgroup teardown code from
cgroup_rmdir() to cgroup_diput() (which also has a synchronize_rcu()
call in it) this restriction could have been eased, but I think I left
it as it was mostly out of paranoia that I was missing/forgetting some
crucial reason for keeping it in place.

I'd suggest trying the following approach, which I suspect is similar
to what you were suggesting in your last email

1) make find_existing_css_set ignore css_set objects with a zero refcount
2) change __put_css_set to be simply

if (atomic_dec_and_test(&cg->refcount)) {
  call_rcu(&cg->rcu_head, free_css_set_rcu);
}

3) move the rest of __put_css_set into a delayed work struct that's
scheduled by free_css_set_rcu

4) Get rid of the taskexit parameter - I think we can do that via a
simple flag that indicates whether any task has ever been moved into
the cgroup.

5) Put extra checks in cgroup_rmdir() such that if it tries to remove
a cgroup that has a non-zero refcount, it scans the cgroup's css_sets
list - if it finds only zero-refcount entries, then wait (via
synchronize_rcu() or some other appropriate means, maybe reusing the
CGRP_WAIT_ON_RMDIR mechanism?) until the css_set objects have been
fully cleaned up and the cgroup's refcounts have been released.
Otherwise the operation of moving the last thread out of a cgroup and
immediately deleting the cgroup would very likely fail with an EBUSY

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]     ` <AANLkTi=4-OgPUugnUBaqSU3oC=3wxTjAsOB_Ais3Or+i-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-24  1:43       ` Colin Cross
       [not found]         ` <4D3A3024.9040402@codeaurora.org>
       [not found]         ` <1290563018-2804-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-24  2:06       ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu " Li Zefan
  1 sibling, 2 replies; 13+ messages in thread
From: Colin Cross @ 2010-11-24  1:43 UTC (permalink / raw)
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Colin Cross

synchronize_rcu can be very expensive, averaging 100 ms in
some cases.  In cgroup_attach_task, it is used to prevent
a task->cgroups pointer dereferenced in an RCU read side
critical section from being invalidated by delaying the call
to put_css_set until after an RCU grace period.

To avoid the call to synchronize_rcu, make the put_css_set
call rcu-safe by moving the deletion of the css_set links
into rcu-protected free_css_set_rcu.

The calls to check_for_release in free_css_set_rcu now occur
in softirq context, so convert all uses of the
release_list_lock spinlock to irq safe versions.

The decrement of the cgroup refcount is no longer
synchronous with the call to put_css_set, which can result
in the cgroup refcount staying positive after the last call
to cgroup_attach_task returns.  To allow the cgroup to be
deleted with cgroup_rmdir synchronously after
cgroup_attach_task, introduce a second refcount,
rmdir_count, that is decremented synchronously in
put_css_set.  If cgroup_rmdir is called on a cgroup for
hich rmdir_count is zero but count is nonzero, reuse the
rmdir waitqueue to block the rmdir until the rcu callback
is called.

Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
---

This patch is similar to what you described.  The main differences are
that I used a new atomic to handle the rmdir case, and I converted
check_for_release to be callable in softirq context rather than schedule
work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
I'll take another look at that.  Is there any problem with disabling irqs
with spin_lock_irqsave in check_for_release?

 include/linux/cgroup.h |    6 ++
 kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
 2 files changed, 78 insertions(+), 52 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..3b6e73d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -202,6 +202,12 @@ struct cgroup {
 	atomic_t count;
 
 	/*
+	 * separate refcount for rmdir on a cgroup.  When rmdir_count is 0,
+	 * rmdir should block until count is 0.
+	 */
+	atomic_t rmdir_count;
+
+	/*
 	 * We link our 'sibling' struct into our parent's 'children'.
 	 * Our children link their 'sibling' into our 'children'.
 	 */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..fa3c0ac 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
 static DECLARE_WORK(release_agent_work, cgroup_release_agent);
 static void check_for_release(struct cgroup *cgrp);
 
+/*
+ * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
+ * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
+ * reference to css->refcnt. In general, this refcnt is expected to goes down
+ * to zero, soon.
+ *
+ * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
+ */
+DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
+
+static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
+{
+	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		wake_up_all(&cgroup_rmdir_waitq);
+}
+
+void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
+{
+	css_get(css);
+}
+
+void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
+{
+	cgroup_wakeup_rmdir_waiter(css->cgroup);
+	css_put(css);
+}
+
 /* Link structure for associating css_set objects with cgroups */
 struct cg_cgroup_link {
 	/*
@@ -329,6 +356,22 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
 static void free_css_set_rcu(struct rcu_head *obj)
 {
 	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+	struct cg_cgroup_link *link;
+	struct cg_cgroup_link *saved_link;
+
+	/* Nothing else can have a reference to cg, no need for css_set_lock */
+	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
+				 cg_link_list) {
+		struct cgroup *cgrp = link->cgrp;
+		list_del(&link->cg_link_list);
+		list_del(&link->cgrp_link_list);
+		if (atomic_dec_and_test(&cgrp->count)) {
+			check_for_release(cgrp);
+			cgroup_wakeup_rmdir_waiter(cgrp);
+		}
+		kfree(link);
+	}
+
 	kfree(cg);
 }
 
@@ -355,23 +398,20 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 		return;
 	}
 
-	/* This css_set is dead. unlink it and release cgroup refcounts */
 	hlist_del(&cg->hlist);
 	css_set_count--;
 
+	/* This css_set is now unreachable from the css_set_table, but RCU
+	 * read-side critical sections may still have a reference to it.
+	 * Decrement the cgroup rmdir_count so that rmdir's on an empty
+	 * cgroup can block until the free_css_set_rcu callback */
 	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
 				 cg_link_list) {
 		struct cgroup *cgrp = link->cgrp;
-		list_del(&link->cg_link_list);
-		list_del(&link->cgrp_link_list);
-		if (atomic_dec_and_test(&cgrp->count) &&
-		    notify_on_release(cgrp)) {
-			if (taskexit)
-				set_bit(CGRP_RELEASABLE, &cgrp->flags);
-			check_for_release(cgrp);
-		}
-
-		kfree(link);
+		if (taskexit)
+			set_bit(CGRP_RELEASABLE, &cgrp->flags);
+		atomic_dec(&cgrp->rmdir_count);
+		smp_mb();
 	}
 
 	write_unlock(&css_set_lock);
@@ -571,6 +611,8 @@ static void link_css_set(struct list_head *tmp_cg_links,
 				cgrp_link_list);
 	link->cg = cg;
 	link->cgrp = cgrp;
+	atomic_inc(&cgrp->rmdir_count);
+	smp_mb(); /* make sure rmdir_count increments first */
 	atomic_inc(&cgrp->count);
 	list_move(&link->cgrp_link_list, &cgrp->css_sets);
 	/*
@@ -725,9 +767,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
  * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
  * another.  It does so using cgroup_mutex, however there are
  * several performance critical places that need to reference
- * task->cgroup without the expense of grabbing a system global
+ * task->cgroups without the expense of grabbing a system global
  * mutex.  Therefore except as noted below, when dereferencing or, as
- * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
+ * in cgroup_attach_task(), modifying a task's cgroups pointer we use
  * task_lock(), which acts on a spinlock (task->alloc_lock) already in
  * the task_struct routinely used for such matters.
  *
@@ -909,33 +951,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
 }
 
 /*
- * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
- * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
- * reference to css->refcnt. In general, this refcnt is expected to goes down
- * to zero, soon.
- *
- * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
- */
-DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
-
-static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
-{
-	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
-}
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
-{
-	css_get(css);
-}
-
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
-{
-	cgroup_wakeup_rmdir_waiter(css->cgroup);
-	css_put(css);
-}
-
-/*
  * Call with cgroup_mutex held. Drops reference counts on modules, including
  * any duplicate ones that parse_cgroupfs_options took. If this function
  * returns an error, no reference counts are touched.
@@ -1802,7 +1817,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
+	/* put_css_set will not destroy cg until after an RCU grace period */
 	put_css_set(cg);
 
 	/*
@@ -3566,11 +3581,12 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 	DEFINE_WAIT(wait);
 	struct cgroup_event *event, *tmp;
 	int ret;
+	unsigned long flags;
 
 	/* the vfs holds both inode->i_mutex already */
 again:
 	mutex_lock(&cgroup_mutex);
-	if (atomic_read(&cgrp->count) != 0) {
+	if (atomic_read(&cgrp->rmdir_count) != 0) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
@@ -3603,13 +3619,13 @@ again:
 
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
-	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
+	if (atomic_read(&cgrp->rmdir_count) || !list_empty(&cgrp->children)) {
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
 	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
-	if (!cgroup_clear_css_refs(cgrp)) {
+	if (atomic_read(&cgrp->count) != 0 || !cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		/*
 		 * Because someone may call cgroup_wakeup_rmdir_waiter() before
@@ -3627,11 +3643,11 @@ again:
 	finish_wait(&cgroup_rmdir_waitq, &wait);
 	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 
-	spin_lock(&release_list_lock);
+	spin_lock_irqsave(&release_list_lock, flags);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
 	if (!list_empty(&cgrp->release_list))
 		list_del(&cgrp->release_list);
-	spin_unlock(&release_list_lock);
+	spin_unlock_irqrestore(&release_list_lock, flags);
 
 	cgroup_lock_hierarchy(cgrp->root);
 	/* delete this cgroup from parent->children */
@@ -4389,6 +4405,8 @@ int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task)
 
 static void check_for_release(struct cgroup *cgrp)
 {
+	unsigned long flags;
+
 	/* All of these checks rely on RCU to keep the cgroup
 	 * structure alive */
 	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
@@ -4397,13 +4415,13 @@ static void check_for_release(struct cgroup *cgrp)
 		 * already queued for a userspace notification, queue
 		 * it now */
 		int need_schedule_work = 0;
-		spin_lock(&release_list_lock);
+		spin_lock_irqsave(&release_list_lock, flags);
 		if (!cgroup_is_removed(cgrp) &&
 		    list_empty(&cgrp->release_list)) {
 			list_add(&cgrp->release_list, &release_list);
 			need_schedule_work = 1;
 		}
-		spin_unlock(&release_list_lock);
+		spin_unlock_irqrestore(&release_list_lock, flags);
 		if (need_schedule_work)
 			schedule_work(&release_agent_work);
 	}
@@ -4453,9 +4471,11 @@ EXPORT_SYMBOL_GPL(__css_put);
  */
 static void cgroup_release_agent(struct work_struct *work)
 {
+	unsigned long flags;
+
 	BUG_ON(work != &release_agent_work);
 	mutex_lock(&cgroup_mutex);
-	spin_lock(&release_list_lock);
+	spin_lock_irqsave(&release_list_lock, flags);
 	while (!list_empty(&release_list)) {
 		char *argv[3], *envp[3];
 		int i;
@@ -4464,7 +4484,7 @@ static void cgroup_release_agent(struct work_struct *work)
 						    struct cgroup,
 						    release_list);
 		list_del_init(&cgrp->release_list);
-		spin_unlock(&release_list_lock);
+		spin_unlock_irqrestore(&release_list_lock, flags);
 		pathbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
 		if (!pathbuf)
 			goto continue_free;
@@ -4494,9 +4514,9 @@ static void cgroup_release_agent(struct work_struct *work)
  continue_free:
 		kfree(pathbuf);
 		kfree(agentbuf);
-		spin_lock(&release_list_lock);
+		spin_lock_irqsave(&release_list_lock, flags);
 	}
-	spin_unlock(&release_list_lock);
+	spin_unlock_irqrestore(&release_list_lock, flags);
 	mutex_unlock(&cgroup_mutex);
 }
 
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]     ` <AANLkTi=4-OgPUugnUBaqSU3oC=3wxTjAsOB_Ais3Or+i-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-11-24  1:43       ` [PATCH] cgroup: Remove call to synchronize_rcu " Colin Cross
@ 2010-11-24  2:06       ` Li Zefan
  1 sibling, 0 replies; 13+ messages in thread
From: Li Zefan @ 2010-11-24  2:06 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Colin Cross

Paul Menage wrote:
> On Sun, Nov 21, 2010 at 8:06 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>> The synchronize_rcu call in cgroup_attach_task can be very
>> expensive.  All fastpath accesses to task->cgroups that expect
>> task->cgroups not to change already use task_lock() or
>> cgroup_lock() to protect against updates, and, in cgroup.c,
>> only the CGROUP_DEBUG files have RCU read-side critical
>> sections.
> 
> I definitely agree with the goal of using lighter-weight
> synchronization than the current synchronize_rcu() call. However,
> there are definitely some subtleties to worry about in this code.
> 
> One of the reasons originally for the current synchronization was to
> avoid the case of calling subsystem destroy() callbacks while there
> could still be threads with RCU references to the subsystem state. The
> fact that synchronize_rcu() was called within a cgroup_mutex critical
> section meant that an rmdir (or any other significant cgrooup
> management action) couldn't possibly start until any RCU read sections
> were done.
> 
> I suspect that when we moved a lot of the cgroup teardown code from
> cgroup_rmdir() to cgroup_diput() (which also has a synchronize_rcu()
> call in it) this restriction could have been eased, but I think I left
> it as it was mostly out of paranoia that I was missing/forgetting some
> crucial reason for keeping it in place.
> 
> I'd suggest trying the following approach, which I suspect is similar
> to what you were suggesting in your last email
> 
> 1) make find_existing_css_set ignore css_set objects with a zero refcount
> 2) change __put_css_set to be simply
> 
> if (atomic_dec_and_test(&cg->refcount)) {
>   call_rcu(&cg->rcu_head, free_css_set_rcu);
> }

If we do this, it's not anymore safe to use get_css_set(), which just
increments the refcount without checking if it's zero.

> 
> 3) move the rest of __put_css_set into a delayed work struct that's
> scheduled by free_css_set_rcu
> 
> 4) Get rid of the taskexit parameter - I think we can do that via a
> simple flag that indicates whether any task has ever been moved into
> the cgroup.
> 
> 5) Put extra checks in cgroup_rmdir() such that if it tries to remove
> a cgroup that has a non-zero refcount, it scans the cgroup's css_sets
> list - if it finds only zero-refcount entries, then wait (via
> synchronize_rcu() or some other appropriate means, maybe reusing the
> CGRP_WAIT_ON_RMDIR mechanism?) until the css_set objects have been
> fully cleaned up and the cgroup's refcounts have been released.
> Otherwise the operation of moving the last thread out of a cgroup and
> immediately deleting the cgroup would very likely fail with an EBUSY
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]       ` <4CEC7329.7070909-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2010-11-24  2:10         ` Colin Cross
  2010-11-24 18:58         ` Paul Menage
  1 sibling, 0 replies; 13+ messages in thread
From: Colin Cross @ 2010-11-24  2:10 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 23, 2010 at 6:06 PM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> Paul Menage wrote:
>> On Sun, Nov 21, 2010 at 8:06 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
>>> The synchronize_rcu call in cgroup_attach_task can be very
>>> expensive.  All fastpath accesses to task->cgroups that expect
>>> task->cgroups not to change already use task_lock() or
>>> cgroup_lock() to protect against updates, and, in cgroup.c,
>>> only the CGROUP_DEBUG files have RCU read-side critical
>>> sections.
>>
>> I definitely agree with the goal of using lighter-weight
>> synchronization than the current synchronize_rcu() call. However,
>> there are definitely some subtleties to worry about in this code.
>>
>> One of the reasons originally for the current synchronization was to
>> avoid the case of calling subsystem destroy() callbacks while there
>> could still be threads with RCU references to the subsystem state. The
>> fact that synchronize_rcu() was called within a cgroup_mutex critical
>> section meant that an rmdir (or any other significant cgrooup
>> management action) couldn't possibly start until any RCU read sections
>> were done.
>>
>> I suspect that when we moved a lot of the cgroup teardown code from
>> cgroup_rmdir() to cgroup_diput() (which also has a synchronize_rcu()
>> call in it) this restriction could have been eased, but I think I left
>> it as it was mostly out of paranoia that I was missing/forgetting some
>> crucial reason for keeping it in place.
>>
>> I'd suggest trying the following approach, which I suspect is similar
>> to what you were suggesting in your last email
>>
>> 1) make find_existing_css_set ignore css_set objects with a zero refcount
>> 2) change __put_css_set to be simply
>>
>> if (atomic_dec_and_test(&cg->refcount)) {
>>   call_rcu(&cg->rcu_head, free_css_set_rcu);
>> }
>
> If we do this, it's not anymore safe to use get_css_set(), which just
> increments the refcount without checking if it's zero.

I used an alternate approach, removing the css_set from the hash table
in put_css_set, but delaying the deletion to free_css_set_rcu.  That
way, nothing can get another reference to the css_set to call
get_css_set on.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]         ` <1290563018-2804-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
@ 2010-11-24  2:29           ` Colin Cross
  2011-01-22  1:17           ` Bryan Huntsman
  2011-01-28  1:17           ` Bryan Huntsman
  2 siblings, 0 replies; 13+ messages in thread
From: Colin Cross @ 2010-11-24  2:29 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Colin Cross

On Tue, Nov 23, 2010 at 5:43 PM, Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org> wrote:
> This patch is similar to what you described.  The main differences are
> that I used a new atomic to handle the rmdir case, and I converted
> check_for_release to be callable in softirq context rather than schedule
> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
> I'll take another look at that.  Is there any problem with disabling irqs
> with spin_lock_irqsave in check_for_release?

free_css_set_rcu needs to take a write lock on css_set_lock to protect the
list_del(&link->cgrp_link_list).  I'll convert it to schedule work, and change
the spin_lock_irqsave back to spin_lock.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task
       [not found]       ` <4CEC7329.7070909-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2010-11-24  2:10         ` Colin Cross
@ 2010-11-24 18:58         ` Paul Menage
  1 sibling, 0 replies; 13+ messages in thread
From: Paul Menage @ 2010-11-24 18:58 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Colin Cross

On Tue, Nov 23, 2010 at 6:06 PM, Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>
> If we do this, it's not anymore safe to use get_css_set(), which just
> increments the refcount without checking if it's zero.

I don't believe that it's currently safe to use get_css_set() on a
zero-refcount css_set anyway - a css_set starts with refcount 1 and
can be freed the moment that its refcount reaches 0. You can only use
get_css_set() on a stable reference to an existing css_set, which by
definition has a refcount of at least 1. Were you thinking of any
particular existing uses?

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]         ` <1290563018-2804-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-24  2:29           ` Colin Cross
@ 2011-01-22  1:17           ` Bryan Huntsman
  2011-01-28  1:17           ` Bryan Huntsman
  2 siblings, 0 replies; 13+ messages in thread
From: Bryan Huntsman @ 2011-01-22  1:17 UTC (permalink / raw)
  To: Colin Cross
  Cc: Paul Menage, mbohan-sgV2jX0FEOL9JmXXK+q4OQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/23/2010 05:43 PM, Colin Cross wrote:
> synchronize_rcu can be very expensive, averaging 100 ms in
> some cases.  In cgroup_attach_task, it is used to prevent
> a task->cgroups pointer dereferenced in an RCU read side
> critical section from being invalidated by delaying the call
> to put_css_set until after an RCU grace period.
> 
> To avoid the call to synchronize_rcu, make the put_css_set
> call rcu-safe by moving the deletion of the css_set links
> into rcu-protected free_css_set_rcu.
> 
> The calls to check_for_release in free_css_set_rcu now occur
> in softirq context, so convert all uses of the
> release_list_lock spinlock to irq safe versions.
> 
> The decrement of the cgroup refcount is no longer
> synchronous with the call to put_css_set, which can result
> in the cgroup refcount staying positive after the last call
> to cgroup_attach_task returns.  To allow the cgroup to be
> deleted with cgroup_rmdir synchronously after
> cgroup_attach_task, introduce a second refcount,
> rmdir_count, that is decremented synchronously in
> put_css_set.  If cgroup_rmdir is called on a cgroup for
> hich rmdir_count is zero but count is nonzero, reuse the
> rmdir waitqueue to block the rmdir until the rcu callback
> is called.
> 
> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
> ---
> 
> This patch is similar to what you described.  The main differences are
> that I used a new atomic to handle the rmdir case, and I converted
> check_for_release to be callable in softirq context rather than schedule
> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
> I'll take another look at that.  Is there any problem with disabling irqs
> with spin_lock_irqsave in check_for_release?
> 
>  include/linux/cgroup.h |    6 ++
>  kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
>  2 files changed, 78 insertions(+), 52 deletions(-)
> 

Colin, what became of this patch?  I see this in your Tegra tree for
Android.

http://android.git.kernel.org/?p=kernel/tegra.git;a=commit;h=05946a1e0fdb011ac0e6638ee60b181c584f127b

I looked in linux-next but didn't see it there.  This resolves a
performance issue on MSM SMP so I'm curious if this is going upstream.
 Thanks.

- Bryan


-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]           ` <4D3A3024.9040402-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
@ 2011-01-22  2:04             ` Colin Cross
  0 siblings, 0 replies; 13+ messages in thread
From: Colin Cross @ 2011-01-22  2:04 UTC (permalink / raw)
  To: Bryan Huntsman
  Cc: Paul Menage, mbohan-sgV2jX0FEOL9JmXXK+q4OQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 21, 2011 at 5:17 PM, Bryan Huntsman <bryanh-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org> wrote:
> On 11/23/2010 05:43 PM, Colin Cross wrote:
>> synchronize_rcu can be very expensive, averaging 100 ms in
>> some cases.  In cgroup_attach_task, it is used to prevent
>> a task->cgroups pointer dereferenced in an RCU read side
>> critical section from being invalidated by delaying the call
>> to put_css_set until after an RCU grace period.
>>
>> To avoid the call to synchronize_rcu, make the put_css_set
>> call rcu-safe by moving the deletion of the css_set links
>> into rcu-protected free_css_set_rcu.
>>
>> The calls to check_for_release in free_css_set_rcu now occur
>> in softirq context, so convert all uses of the
>> release_list_lock spinlock to irq safe versions.
>>
>> The decrement of the cgroup refcount is no longer
>> synchronous with the call to put_css_set, which can result
>> in the cgroup refcount staying positive after the last call
>> to cgroup_attach_task returns.  To allow the cgroup to be
>> deleted with cgroup_rmdir synchronously after
>> cgroup_attach_task, introduce a second refcount,
>> rmdir_count, that is decremented synchronously in
>> put_css_set.  If cgroup_rmdir is called on a cgroup for
>> hich rmdir_count is zero but count is nonzero, reuse the
>> rmdir waitqueue to block the rmdir until the rcu callback
>> is called.
>>
>> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
>> ---
>>
>> This patch is similar to what you described.  The main differences are
>> that I used a new atomic to handle the rmdir case, and I converted
>> check_for_release to be callable in softirq context rather than schedule
>> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
>> I'll take another look at that.  Is there any problem with disabling irqs
>> with spin_lock_irqsave in check_for_release?
>>
>>  include/linux/cgroup.h |    6 ++
>>  kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
>>  2 files changed, 78 insertions(+), 52 deletions(-)
>>
>
> Colin, what became of this patch?  I see this in your Tegra tree for
> Android.
>
> http://android.git.kernel.org/?p=kernel/tegra.git;a=commit;h=05946a1e0fdb011ac0e6638ee60b181c584f127b
>
> I looked in linux-next but didn't see it there.  This resolves a
> performance issue on MSM SMP so I'm curious if this is going upstream.
>  Thanks.
>

It's been posted, there are no outstanding comments I am working on,
but they haven't been picked up.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cgroup: Remove call to synchronize_rcu in cgroup_attach_task
       [not found]         ` <1290563018-2804-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
  2010-11-24  2:29           ` Colin Cross
  2011-01-22  1:17           ` Bryan Huntsman
@ 2011-01-28  1:17           ` Bryan Huntsman
  2 siblings, 0 replies; 13+ messages in thread
From: Bryan Huntsman @ 2011-01-28  1:17 UTC (permalink / raw)
  To: Colin Cross
  Cc: Paul Menage, mbohan-sgV2jX0FEOL9JmXXK+q4OQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/23/2010 05:43 PM, Colin Cross wrote:
> synchronize_rcu can be very expensive, averaging 100 ms in
> some cases.  In cgroup_attach_task, it is used to prevent
> a task->cgroups pointer dereferenced in an RCU read side
> critical section from being invalidated by delaying the call
> to put_css_set until after an RCU grace period.
> 
> To avoid the call to synchronize_rcu, make the put_css_set
> call rcu-safe by moving the deletion of the css_set links
> into rcu-protected free_css_set_rcu.
> 
> The calls to check_for_release in free_css_set_rcu now occur
> in softirq context, so convert all uses of the
> release_list_lock spinlock to irq safe versions.
> 
> The decrement of the cgroup refcount is no longer
> synchronous with the call to put_css_set, which can result
> in the cgroup refcount staying positive after the last call
> to cgroup_attach_task returns.  To allow the cgroup to be
> deleted with cgroup_rmdir synchronously after
> cgroup_attach_task, introduce a second refcount,
> rmdir_count, that is decremented synchronously in
> put_css_set.  If cgroup_rmdir is called on a cgroup for
> hich rmdir_count is zero but count is nonzero, reuse the
> rmdir waitqueue to block the rmdir until the rcu callback
> is called.
> 
> Signed-off-by: Colin Cross <ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
> ---
> 
> This patch is similar to what you described.  The main differences are
> that I used a new atomic to handle the rmdir case, and I converted
> check_for_release to be callable in softirq context rather than schedule
> work in free_css_set_rcu.  Your css_set scanning in rmdir sounds better,
> I'll take another look at that.  Is there any problem with disabling irqs
> with spin_lock_irqsave in check_for_release?
> 
>  include/linux/cgroup.h |    6 ++
>  kernel/cgroup.c        |  124 ++++++++++++++++++++++++++++--------------------
>  2 files changed, 78 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ed4ba11..3b6e73d 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -202,6 +202,12 @@ struct cgroup {
>  	atomic_t count;
>  
>  	/*
> +	 * separate refcount for rmdir on a cgroup.  When rmdir_count is 0,
> +	 * rmdir should block until count is 0.
> +	 */
> +	atomic_t rmdir_count;
> +
> +	/*
>  	 * We link our 'sibling' struct into our parent's 'children'.
>  	 * Our children link their 'sibling' into our 'children'.
>  	 */
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 66a416b..fa3c0ac 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -267,6 +267,33 @@ static void cgroup_release_agent(struct work_struct *work);
>  static DECLARE_WORK(release_agent_work, cgroup_release_agent);
>  static void check_for_release(struct cgroup *cgrp);
>  
> +/*
> + * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> + * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> + * reference to css->refcnt. In general, this refcnt is expected to goes down
> + * to zero, soon.
> + *
> + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> + */
> +DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> +
> +static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> +{
> +	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +		wake_up_all(&cgroup_rmdir_waitq);
> +}
> +
> +void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> +{
> +	css_get(css);
> +}
> +
> +void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> +{
> +	cgroup_wakeup_rmdir_waiter(css->cgroup);
> +	css_put(css);
> +}
> +
>  /* Link structure for associating css_set objects with cgroups */
>  struct cg_cgroup_link {
>  	/*
> @@ -329,6 +356,22 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
>  static void free_css_set_rcu(struct rcu_head *obj)
>  {
>  	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
> +	struct cg_cgroup_link *link;
> +	struct cg_cgroup_link *saved_link;
> +
> +	/* Nothing else can have a reference to cg, no need for css_set_lock */
> +	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
> +				 cg_link_list) {
> +		struct cgroup *cgrp = link->cgrp;
> +		list_del(&link->cg_link_list);
> +		list_del(&link->cgrp_link_list);
> +		if (atomic_dec_and_test(&cgrp->count)) {
> +			check_for_release(cgrp);
> +			cgroup_wakeup_rmdir_waiter(cgrp);
> +		}
> +		kfree(link);
> +	}
> +
>  	kfree(cg);
>  }
>  
> @@ -355,23 +398,20 @@ static void __put_css_set(struct css_set *cg, int taskexit)
>  		return;
>  	}
>  
> -	/* This css_set is dead. unlink it and release cgroup refcounts */
>  	hlist_del(&cg->hlist);
>  	css_set_count--;
>  
> +	/* This css_set is now unreachable from the css_set_table, but RCU
> +	 * read-side critical sections may still have a reference to it.
> +	 * Decrement the cgroup rmdir_count so that rmdir's on an empty
> +	 * cgroup can block until the free_css_set_rcu callback */
>  	list_for_each_entry_safe(link, saved_link, &cg->cg_links,
>  				 cg_link_list) {
>  		struct cgroup *cgrp = link->cgrp;
> -		list_del(&link->cg_link_list);
> -		list_del(&link->cgrp_link_list);
> -		if (atomic_dec_and_test(&cgrp->count) &&
> -		    notify_on_release(cgrp)) {
> -			if (taskexit)
> -				set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -			check_for_release(cgrp);
> -		}
> -
> -		kfree(link);
> +		if (taskexit)
> +			set_bit(CGRP_RELEASABLE, &cgrp->flags);
> +		atomic_dec(&cgrp->rmdir_count);
> +		smp_mb();
>  	}
>  
>  	write_unlock(&css_set_lock);
> @@ -571,6 +611,8 @@ static void link_css_set(struct list_head *tmp_cg_links,
>  				cgrp_link_list);
>  	link->cg = cg;
>  	link->cgrp = cgrp;
> +	atomic_inc(&cgrp->rmdir_count);
> +	smp_mb(); /* make sure rmdir_count increments first */
>  	atomic_inc(&cgrp->count);
>  	list_move(&link->cgrp_link_list, &cgrp->css_sets);
>  	/*
> @@ -725,9 +767,9 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
>   * cgroup_attach_task(), which overwrites one tasks cgroup pointer with
>   * another.  It does so using cgroup_mutex, however there are
>   * several performance critical places that need to reference
> - * task->cgroup without the expense of grabbing a system global
> + * task->cgroups without the expense of grabbing a system global
>   * mutex.  Therefore except as noted below, when dereferencing or, as
> - * in cgroup_attach_task(), modifying a task'ss cgroup pointer we use
> + * in cgroup_attach_task(), modifying a task's cgroups pointer we use
>   * task_lock(), which acts on a spinlock (task->alloc_lock) already in
>   * the task_struct routinely used for such matters.
>   *
> @@ -909,33 +951,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
>  }
>  
>  /*
> - * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> - * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> - * reference to css->refcnt. In general, this refcnt is expected to goes down
> - * to zero, soon.
> - *
> - * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> - */
> -DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> -
> -static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> -{
> -	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> -}
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> -{
> -	css_get(css);
> -}
> -
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> -{
> -	cgroup_wakeup_rmdir_waiter(css->cgroup);
> -	css_put(css);
> -}
> -
> -/*
>   * Call with cgroup_mutex held. Drops reference counts on modules, including
>   * any duplicate ones that parse_cgroupfs_options took. If this function
>   * returns an error, no reference counts are touched.
> @@ -1802,7 +1817,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
>  	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> -	synchronize_rcu();
> +	/* put_css_set will not destroy cg until after an RCU grace period */
>  	put_css_set(cg);
>  
>  	/*
> @@ -3566,11 +3581,12 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  	DEFINE_WAIT(wait);
>  	struct cgroup_event *event, *tmp;
>  	int ret;
> +	unsigned long flags;
>  
>  	/* the vfs holds both inode->i_mutex already */
>  again:
>  	mutex_lock(&cgroup_mutex);
> -	if (atomic_read(&cgrp->count) != 0) {
> +	if (atomic_read(&cgrp->rmdir_count) != 0) {
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> @@ -3603,13 +3619,13 @@ again:
>  
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
> -	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> +	if (atomic_read(&cgrp->rmdir_count) || !list_empty(&cgrp->children)) {
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
>  	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> -	if (!cgroup_clear_css_refs(cgrp)) {
> +	if (atomic_read(&cgrp->count) != 0 || !cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
>  		/*
>  		 * Because someone may call cgroup_wakeup_rmdir_waiter() before
> @@ -3627,11 +3643,11 @@ again:
>  	finish_wait(&cgroup_rmdir_waitq, &wait);
>  	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  
> -	spin_lock(&release_list_lock);
> +	spin_lock_irqsave(&release_list_lock, flags);
>  	set_bit(CGRP_REMOVED, &cgrp->flags);
>  	if (!list_empty(&cgrp->release_list))
>  		list_del(&cgrp->release_list);
> -	spin_unlock(&release_list_lock);
> +	spin_unlock_irqrestore(&release_list_lock, flags);
>  
>  	cgroup_lock_hierarchy(cgrp->root);
>  	/* delete this cgroup from parent->children */
> @@ -4389,6 +4405,8 @@ int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task)
>  
>  static void check_for_release(struct cgroup *cgrp)
>  {
> +	unsigned long flags;
> +
>  	/* All of these checks rely on RCU to keep the cgroup
>  	 * structure alive */
>  	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
> @@ -4397,13 +4415,13 @@ static void check_for_release(struct cgroup *cgrp)
>  		 * already queued for a userspace notification, queue
>  		 * it now */
>  		int need_schedule_work = 0;
> -		spin_lock(&release_list_lock);
> +		spin_lock_irqsave(&release_list_lock, flags);
>  		if (!cgroup_is_removed(cgrp) &&
>  		    list_empty(&cgrp->release_list)) {
>  			list_add(&cgrp->release_list, &release_list);
>  			need_schedule_work = 1;
>  		}
> -		spin_unlock(&release_list_lock);
> +		spin_unlock_irqrestore(&release_list_lock, flags);
>  		if (need_schedule_work)
>  			schedule_work(&release_agent_work);
>  	}
> @@ -4453,9 +4471,11 @@ EXPORT_SYMBOL_GPL(__css_put);
>   */
>  static void cgroup_release_agent(struct work_struct *work)
>  {
> +	unsigned long flags;
> +
>  	BUG_ON(work != &release_agent_work);
>  	mutex_lock(&cgroup_mutex);
> -	spin_lock(&release_list_lock);
> +	spin_lock_irqsave(&release_list_lock, flags);
>  	while (!list_empty(&release_list)) {
>  		char *argv[3], *envp[3];
>  		int i;
> @@ -4464,7 +4484,7 @@ static void cgroup_release_agent(struct work_struct *work)
>  						    struct cgroup,
>  						    release_list);
>  		list_del_init(&cgrp->release_list);
> -		spin_unlock(&release_list_lock);
> +		spin_unlock_irqrestore(&release_list_lock, flags);
>  		pathbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
>  		if (!pathbuf)
>  			goto continue_free;
> @@ -4494,9 +4514,9 @@ static void cgroup_release_agent(struct work_struct *work)
>   continue_free:
>  		kfree(pathbuf);
>  		kfree(agentbuf);
> -		spin_lock(&release_list_lock);
> +		spin_lock_irqsave(&release_list_lock, flags);
>  	}
> -	spin_unlock(&release_list_lock);
> +	spin_unlock_irqrestore(&release_list_lock, flags);
>  	mutex_unlock(&cgroup_mutex);
>  }
>  

Tested-by: Mike Bohan <mbohan-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

I'm responding on Mike's behalf and adding him to this thread.  This
patch improves launch time of a test app from ~700ms to ~250ms on MSM,
with much lower variance across tests.  We also see UI latency
improvements, but have not quantified the gains.

- Bryan

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-01-28  1:17 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <AANLkTikx6d0_VFtZ4zWQucRCf=vFt7N2M6=0jpnKasEE@mail.gmail.com>
     [not found] ` <AANLkTikx6d0_VFtZ4zWQucRCf=vFt7N2M6=0jpnKasEE-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-22  4:06   ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu in cgroup_attach_task Colin Cross
     [not found] ` <1290398767-15230-1-git-send-email-ccross@android.com>
     [not found]   ` <1290398767-15230-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
2010-11-23  8:14     ` Li Zefan
2010-11-24  1:24     ` Paul Menage
     [not found]   ` <4CEB77E0.10202@cn.fujitsu.com>
     [not found]     ` <4CEB77E0.10202-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2010-11-23  8:58       ` Colin Cross
     [not found]     ` <AANLkTimjpW6NZ6fEiVi0VzjkpQGVob4=VHsohXUiDQkJ@mail.gmail.com>
     [not found]       ` <AANLkTimjpW6NZ6fEiVi0VzjkpQGVob4=VHsohXUiDQkJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-23 20:22         ` Colin Cross
     [not found]   ` <AANLkTi=4-OgPUugnUBaqSU3oC=3wxTjAsOB_Ais3Or+i@mail.gmail.com>
     [not found]     ` <AANLkTi=4-OgPUugnUBaqSU3oC=3wxTjAsOB_Ais3Or+i-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-24  1:43       ` [PATCH] cgroup: Remove call to synchronize_rcu " Colin Cross
     [not found]         ` <4D3A3024.9040402@codeaurora.org>
     [not found]           ` <4D3A3024.9040402-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2011-01-22  2:04             ` Colin Cross
     [not found]         ` <1290563018-2804-1-git-send-email-ccross-z5hGa2qSFaRBDgjK7y7TUQ@public.gmane.org>
2010-11-24  2:29           ` Colin Cross
2011-01-22  1:17           ` Bryan Huntsman
2011-01-28  1:17           ` Bryan Huntsman
2010-11-24  2:06       ` [PATCH] cgroup: Convert synchronize_rcu to call_rcu " Li Zefan
     [not found]     ` <4CEC7329.7070909@cn.fujitsu.com>
     [not found]       ` <4CEC7329.7070909-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2010-11-24  2:10         ` Colin Cross
2010-11-24 18:58         ` Paul Menage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox