[PATCHSET cgroup-for-3.14] cgroup: restructure pidlist handling

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET cgroup-for-3.14] cgroup: restructure pidlist handling
@ 2013-11-24 22:11 Tejun Heo
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

pidlist is hanlding is quite elaborate.  Because the pidlist files -
"tasks" and "cgroup.pids" - guarantee that the result is sorted and a
task can be associated with different pids, with no inherent order
among them, depending on namespaces, it is impossible to give a
certain order to tasks of a cgroup and then just iterate through them.

Instead, we end up creating tables of the relevant ids and then sort
them before serving them out for reads.  As those tables can be huge,
we also implement logic to share those tables if the id type and
namespace match, which in turn involves reference counting those
tables and synchronizing accesses to them.

What could have been a simple iteration through the member tasks
became this unnecessary hunk of complexity because it, for some
reason, wanted to guarantee sorted output, which is extremely unusual
for this type of interface.

The refcnting is done from open() and release() callbacks, which
kernfs doesn't expose.  This patchset updates pidlist handling so that
pidlists are managed from seq_file operations proper.  As the duration
between the paired start and stop denotes a single read invocation and
we don't want to reload pidlist for each instance of consecutive read
calls, pidlist is released with time delay.  This also bounds the
stale the output of read calls can be.  This makes refcnting
unnecessary - locking is simplified and refcnting is dropped.

In the long term, we want to do away with pidlist and make this a
simple iteration over member tasks.  The last patch scrambles the sort
order of "cgroup.pids" if sane_behavior, so that the sorted
expectation is broken in the new interface and we can eventually drop
pidlist logic.

This patchset contains the following nine patches.

 0001-cgroup-don-t-skip-seq_open-on-write-only-opens-on-pi.patch
 0002-cgroup-remove-cftype-release.patch
 0003-cgroup-implement-delayed-destruction-for-cgroup_pidl.patch
 0004-cgroup-introduce-struct-cgroup_pidlist_open_file.patch
 0005-cgroup-refactor-cgroup_pidlist_find.patch
 0006-cgroup-remove-cgroup_pidlist-rwsem.patch
 0007-cgroup-load-and-release-pidlists-from-seq_file-start.patch
 0008-cgroup-remove-cgroup_pidlist-use_count.patch
 0009-cgroup-don-t-guarantee-cgroup.procs-is-sorted-if-san.patch

0001-0002 are prep patches.

0003-0008 restructure pidlist handling so that it's managed from
seq_file operations.

0009 scrames sort order of cgroup.pids if sane_behavior.

This patchset is on top of cgroup/for-3.14 edab95103d3a ("cgroup:
Merge branch 'memcg_event' into for-3.14") and available in the
following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-pidlist

diffstat follows.

 include/linux/cgroup.h |    5 
 kernel/cgroup.c        |  310 +++++++++++++++++++++++++++++++------------------
 2 files changed, 204 insertions(+), 111 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/9] cgroup: don't skip seq_open on write only opens on pidlist files
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-11-24 22:11   ` Tejun Heo
  2013-11-24 22:11   ` [PATCH 2/9] cgroup: remove cftype->release() Tejun Heo
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

Currently, cgroup_pidlist_open() skips seq_open() and pidlist loading
if the file is opened write-only, which is a sensible optimization as
pidlist loading can be costly and there often are occasions where
tasks or cgroup.procs is opened write-only.  However, pidlist init and
release are planned to be moved to cgroup_pidlist_start/stop()
respectively which would make this optimization unnecessary.

This patch removes the optimization and always fully initializes
pidlist files regardless of open mode.  This will help moving pidlist
handling to start/stop by unifying rw paths and removes the need for
specifying cftype->release() in addition to .release in
cgroup_pidlist_operations as file->f_op is now always overridden.  As
pidlist files were the only user of cftype->release(), the next patch
will remove the method.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index be42967..61a5e2e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3778,12 +3778,7 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 static int cgroup_pidlist_release(struct inode *inode, struct file *file)
 {
 	struct cgroup_pidlist *l;
-	if (!(file->f_mode & FMODE_READ))
-		return 0;
-	/*
-	 * the seq_file will only be initialized if the file was opened for
-	 * reading; hence we check if it's not null only in that case.
-	 */
+
 	l = ((struct seq_file *)file->private_data)->private;
 	cgroup_release_pid_array(l);
 	return seq_release(inode, file);
@@ -3808,10 +3803,6 @@ static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 	struct cgroup_pidlist *l;
 	int retval;
 
-	/* Nothing to do for write-only files */
-	if (!(file->f_mode & FMODE_READ))
-		return 0;
-
 	/* have the array populated */
 	retval = pidlist_array_load(cgrp, type, &l);
 	if (retval)
@@ -3891,7 +3882,6 @@ static struct cftype cgroup_base_files[] = {
 		.name = "cgroup.procs",
 		.open = cgroup_procs_open,
 		.write_u64 = cgroup_procs_write,
-		.release = cgroup_pidlist_release,
 		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
@@ -3916,7 +3906,6 @@ static struct cftype cgroup_base_files[] = {
 		.flags = CFTYPE_INSANE,		/* use "procs" instead */
 		.open = cgroup_tasks_open,
 		.write_u64 = cgroup_tasks_write,
-		.release = cgroup_pidlist_release,
 		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/9] cgroup: remove cftype->release()
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-11-24 22:11   ` [PATCH 1/9] cgroup: don't skip seq_open on write only opens on pidlist files Tejun Heo
@ 2013-11-24 22:11   ` Tejun Heo
  2013-11-24 22:11   ` [PATCH 3/9] cgroup: implement delayed destruction for cgroup_pidlist Tejun Heo
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

Now that pidlist files don't use cftype->release(), it doesn't have
any user left.  Remove it.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/cgroup.h | 2 --
 kernel/cgroup.c        | 3 ---
 2 files changed, 5 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 492fa01..5207c28 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -499,8 +499,6 @@ struct cftype {
 	 * kick type for multiplexing.
 	 */
 	int (*trigger)(struct cgroup_subsys_state *css, unsigned int event);
-
-	int (*release)(struct inode *inode, struct file *file);
 };
 
 /*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 61a5e2e..2fd888f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2447,12 +2447,9 @@ static int cgroup_file_open(struct inode *inode, struct file *file)
 static int cgroup_file_release(struct inode *inode, struct file *file)
 {
 	struct cfent *cfe = __d_cfe(file->f_dentry);
-	struct cftype *cft = __d_cft(file->f_dentry);
 	struct cgroup_subsys_state *css = cfe->css;
 	int ret = 0;
 
-	if (cft->release)
-		ret = cft->release(inode, file);
 	if (css->ss)
 		css_put(css);
 	return ret;
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/9] cgroup: implement delayed destruction for cgroup_pidlist
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-11-24 22:11   ` [PATCH 1/9] cgroup: don't skip seq_open on write only opens on pidlist files Tejun Heo
  2013-11-24 22:11   ` [PATCH 2/9] cgroup: remove cftype->release() Tejun Heo
@ 2013-11-24 22:11   ` Tejun Heo
  2013-11-24 22:11   ` [PATCH 4/9] cgroup: introduce struct cgroup_pidlist_open_file Tejun Heo
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

Currently, pidlists are reference counted from file open and release
methods.  This means that holding onto an open file may waste memory
and reads may return data which is very stale.  Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.

cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show.  This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration.  This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.

This patch implements delayed release of pidlist.  As pidlists could
be lingering on cgroup removal waiting for the timer to expire, cgroup
free path needs to queue the destruction work item immediately and
flush.  As those work items are self-destroying, each work item can't
be flushed directly.  A new workqueue - cgroup_pidlist_destroy_wq - is
added to serve as flush domain.

Note that this patch just adds delayed release on top of the current
implementation and doesn't change where pidlist is loaded and
released.  Following patches will make those changes.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 99 +++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 76 insertions(+), 23 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2fd888f..2e21490 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -62,6 +62,14 @@
 #include <linux/atomic.h>
 
 /*
+ * pidlists linger the following amount before being destroyed.  The goal
+ * is avoiding frequent destruction in the middle of consecutive read calls
+ * Expiring in the middle is a performance problem not a correctness one.
+ * 1 sec should be enough.
+ */
+#define CGROUP_PIDLIST_DESTROY_DELAY	HZ
+
+/*
  * cgroup_mutex is the master lock.  Any modification to cgroup or its
  * hierarchy must be performed while holding it.
  *
@@ -95,6 +103,12 @@ static DEFINE_MUTEX(cgroup_root_mutex);
 static struct workqueue_struct *cgroup_destroy_wq;
 
 /*
+ * pidlist destructions need to be flushed on cgroup destruction.  Use a
+ * separate workqueue as flush domain.
+ */
+static struct workqueue_struct *cgroup_pidlist_destroy_wq;
+
+/*
  * Generate an array of cgroup subsystem pointers. At boot time, this is
  * populated with the built in subsystems, and modular subsystems are
  * registered after that. The mutable section of this array is protected by
@@ -166,6 +180,7 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
+static void cgroup_pidlist_destroy_all(struct cgroup *cgrp);
 
 /**
  * cgroup_css - obtain a cgroup's css for the specified subsystem
@@ -829,11 +844,7 @@ static void cgroup_free_fn(struct work_struct *work)
 	 */
 	deactivate_super(cgrp->root->sb);
 
-	/*
-	 * if we're getting rid of the cgroup, refcount should ensure
-	 * that there are no pidlists left.
-	 */
-	BUG_ON(!list_empty(&cgrp->pidlists));
+	cgroup_pidlist_destroy_all(cgrp);
 
 	simple_xattrs_free(&cgrp->xattrs);
 
@@ -3451,6 +3462,8 @@ struct cgroup_pidlist {
 	struct cgroup *owner;
 	/* protects the other fields */
 	struct rw_semaphore rwsem;
+	/* for delayed destruction */
+	struct delayed_work destroy_dwork;
 };
 
 /*
@@ -3466,6 +3479,7 @@ static void *pidlist_allocate(int count)
 	else
 		return kmalloc(count * sizeof(pid_t), GFP_KERNEL);
 }
+
 static void pidlist_free(void *p)
 {
 	if (is_vmalloc_addr(p))
@@ -3475,6 +3489,49 @@ static void pidlist_free(void *p)
 }
 
 /*
+ * Used to destroy all pidlists lingering waiting for destroy timer.  None
+ * should be left afterwards.
+ */
+static void cgroup_pidlist_destroy_all(struct cgroup *cgrp)
+{
+	struct cgroup_pidlist *l, *tmp_l;
+
+	mutex_lock(&cgrp->pidlist_mutex);
+	list_for_each_entry_safe(l, tmp_l, &cgrp->pidlists, links)
+		mod_delayed_work(cgroup_pidlist_destroy_wq, &l->destroy_dwork, 0);
+	mutex_unlock(&cgrp->pidlist_mutex);
+
+	flush_workqueue(cgroup_pidlist_destroy_wq);
+	BUG_ON(!list_empty(&cgrp->pidlists));
+}
+
+static void cgroup_pidlist_destroy_work_fn(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct cgroup_pidlist *l = container_of(dwork, struct cgroup_pidlist,
+						destroy_dwork);
+	struct cgroup_pidlist *tofree = NULL;
+
+	mutex_lock(&l->owner->pidlist_mutex);
+	down_write(&l->rwsem);
+
+	/*
+	 * Destroy iff we didn't race with a new user or get queued again.
+	 * Queued state won't change as it can only be queued while locked.
+	 */
+	if (!l->use_count && !delayed_work_pending(dwork)) {
+		list_del(&l->links);
+		pidlist_free(l->list);
+		put_pid_ns(l->key.ns);
+		tofree = l;
+	}
+
+	up_write(&l->rwsem);
+	mutex_unlock(&l->owner->pidlist_mutex);
+	kfree(tofree);
+}
+
+/*
  * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
  * Returns the number of unique elements.
  */
@@ -3544,6 +3601,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
 		return l;
 	}
 	init_rwsem(&l->rwsem);
+	INIT_DELAYED_WORK(&l->destroy_dwork, cgroup_pidlist_destroy_work_fn);
 	down_write(&l->rwsem);
 	l->key.type = type;
 	l->key.ns = get_pid_ns(ns);
@@ -3749,26 +3807,12 @@ static const struct seq_operations cgroup_pidlist_seq_operations = {
 
 static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 {
-	/*
-	 * the case where we're the last user of this particular pidlist will
-	 * have us remove it from the cgroup's list, which entails taking the
-	 * mutex. since in pidlist_find the pidlist->lock depends on cgroup->
-	 * pidlist_mutex, we have to take pidlist_mutex first.
-	 */
-	mutex_lock(&l->owner->pidlist_mutex);
 	down_write(&l->rwsem);
 	BUG_ON(!l->use_count);
-	if (!--l->use_count) {
-		/* we're the last user if refcount is 0; remove and free */
-		list_del(&l->links);
-		mutex_unlock(&l->owner->pidlist_mutex);
-		pidlist_free(l->list);
-		put_pid_ns(l->key.ns);
-		up_write(&l->rwsem);
-		kfree(l);
-		return;
-	}
-	mutex_unlock(&l->owner->pidlist_mutex);
+	/* if the last user, arm the destroy work */
+	if (!--l->use_count)
+		mod_delayed_work(cgroup_pidlist_destroy_wq, &l->destroy_dwork,
+				 CGROUP_PIDLIST_DESTROY_DELAY);
 	up_write(&l->rwsem);
 }
 
@@ -4810,6 +4854,15 @@ static int __init cgroup_wq_init(void)
 	 */
 	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
 	BUG_ON(!cgroup_destroy_wq);
+
+	/*
+	 * Used to destroy pidlists and separate to serve as flush domain.
+	 * Cap @max_active to 1 too.
+	 */
+	cgroup_pidlist_destroy_wq = alloc_workqueue("cgroup_pidlist_destroy",
+						    0, 1);
+	BUG_ON(!cgroup_pidlist_destroy_wq);
+
 	return 0;
 }
 core_initcall(cgroup_wq_init);
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/9] cgroup: introduce struct cgroup_pidlist_open_file
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-11-24 22:11   ` [PATCH 3/9] cgroup: implement delayed destruction for cgroup_pidlist Tejun Heo
@ 2013-11-24 22:11   ` Tejun Heo
       [not found]     ` <1385331096-7918-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-11-24 22:11   ` [PATCH 5/9] cgroup: refactor cgroup_pidlist_find() Tejun Heo
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

For pidlist files, seq_file->private pointed to the loaded
cgroup_pidlist; however, pidlist loading is planned to be moved to
cgroup_pidlist_start() for kernfs conversion and seq_file->private
needs to carry more information from open to allow that.

This patch introduces struct cgroup_pidlist_open_file which contains
type, cgrp and pidlist and updates pidlist seq_file->private to point
to it using seq_open_private() and seq_release_private().  Note that
this eventually will be replaced by kernfs_open_file.

While this patch makes more information available to seq_file
operations, they don't use it yet and this patch doesn't introduce any
behavior changes except for allocation of the extra private struct.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2e21490..0e902f4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3466,6 +3466,13 @@ struct cgroup_pidlist {
 	struct delayed_work destroy_dwork;
 };
 
+/* seq_file->private points to the following */
+struct cgroup_pidlist_open_file {
+	enum cgroup_filetype		type;
+	struct cgroup			*cgrp;
+	struct cgroup_pidlist		*pidlist;
+};
+
 /*
  * The following two functions "fix" the issue where there are more pids
  * than kmalloc will give memory for; in such cases, we use vmalloc/vfree.
@@ -3737,7 +3744,8 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 	 * after a seek to the start). Use a binary-search to find the
 	 * next pid to display, if any
 	 */
-	struct cgroup_pidlist *l = s->private;
+	struct cgroup_pidlist_open_file *of = s->private;
+	struct cgroup_pidlist *l = of->pidlist;
 	int index = 0, pid = *pos;
 	int *iter;
 
@@ -3767,13 +3775,15 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 
 static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
-	struct cgroup_pidlist *l = s->private;
-	up_read(&l->rwsem);
+	struct cgroup_pidlist_open_file *of = s->private;
+
+	up_read(&of->pidlist->rwsem);
 }
 
 static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
 {
-	struct cgroup_pidlist *l = s->private;
+	struct cgroup_pidlist_open_file *of = s->private;
+	struct cgroup_pidlist *l = of->pidlist;
 	pid_t *p = v;
 	pid_t *end = l->list + l->length;
 	/*
@@ -3818,11 +3828,11 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 
 static int cgroup_pidlist_release(struct inode *inode, struct file *file)
 {
-	struct cgroup_pidlist *l;
+	struct cgroup_pidlist_open_file *of;
 
-	l = ((struct seq_file *)file->private_data)->private;
-	cgroup_release_pid_array(l);
-	return seq_release(inode, file);
+	of = ((struct seq_file *)file->private_data)->private;
+	cgroup_release_pid_array(of->pidlist);
+	return seq_release_private(inode, file);
 }
 
 static const struct file_operations cgroup_pidlist_operations = {
@@ -3841,6 +3851,7 @@ static const struct file_operations cgroup_pidlist_operations = {
 static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
+	struct cgroup_pidlist_open_file *of;
 	struct cgroup_pidlist *l;
 	int retval;
 
@@ -3851,12 +3862,17 @@ static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 	/* configure file information */
 	file->f_op = &cgroup_pidlist_operations;
 
-	retval = seq_open(file, &cgroup_pidlist_seq_operations);
+	retval = seq_open_private(file, &cgroup_pidlist_seq_operations,
+				  sizeof(*of));
 	if (retval) {
 		cgroup_release_pid_array(l);
 		return retval;
 	}
-	((struct seq_file *)file->private_data)->private = l;
+
+	of = ((struct seq_file *)file->private_data)->private;
+	of->type = type;
+	of->cgrp = cgrp;
+	of->pidlist = l;
 	return 0;
 }
 static int cgroup_tasks_open(struct inode *unused, struct file *file)
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/9] cgroup: refactor cgroup_pidlist_find()
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-11-24 22:11   ` [PATCH 4/9] cgroup: introduce struct cgroup_pidlist_open_file Tejun Heo
@ 2013-11-24 22:11   ` Tejun Heo
  2013-11-24 22:11   ` [PATCH 6/9] cgroup: remove cgroup_pidlist->rwsem Tejun Heo
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

Rename cgroup_pidlist_find() to cgroup_pidlist_find_create() and
separate out finding proper to cgroup_pidlist_find().  Also, move
locking to the caller.

This patch is preparation for pidlist restructure and doesn't
introduce any behavior changes.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 65 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 38 insertions(+), 27 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0e902f4..33b6c4d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3573,48 +3573,50 @@ static int cmppid(const void *a, const void *b)
 	return *(pid_t *)a - *(pid_t *)b;
 }
 
+static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
+						  enum cgroup_filetype type)
+{
+	struct cgroup_pidlist *l;
+	/* don't need task_nsproxy() if we're looking at ourself */
+	struct pid_namespace *ns = task_active_pid_ns(current);
+
+	lockdep_assert_held(&cgrp->pidlist_mutex);
+
+	list_for_each_entry(l, &cgrp->pidlists, links)
+		if (l->key.type == type && l->key.ns == ns)
+			return l;
+	return NULL;
+}
+
 /*
  * find the appropriate pidlist for our purpose (given procs vs tasks)
  * returns with the lock on that pidlist already held, and takes care
  * of the use count, or returns NULL with no locks held if we're out of
  * memory.
  */
-static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
-						  enum cgroup_filetype type)
+static struct cgroup_pidlist *cgroup_pidlist_find_create(struct cgroup *cgrp,
+						enum cgroup_filetype type)
 {
 	struct cgroup_pidlist *l;
-	/* don't need task_nsproxy() if we're looking at ourself */
-	struct pid_namespace *ns = task_active_pid_ns(current);
 
-	/*
-	 * We can't drop the pidlist_mutex before taking the l->rwsem in case
-	 * the last ref-holder is trying to remove l from the list at the same
-	 * time. Holding the pidlist_mutex precludes somebody taking whichever
-	 * list we find out from under us - compare release_pid_array().
-	 */
-	mutex_lock(&cgrp->pidlist_mutex);
-	list_for_each_entry(l, &cgrp->pidlists, links) {
-		if (l->key.type == type && l->key.ns == ns) {
-			/* make sure l doesn't vanish out from under us */
-			down_write(&l->rwsem);
-			mutex_unlock(&cgrp->pidlist_mutex);
-			return l;
-		}
-	}
+	lockdep_assert_held(&cgrp->pidlist_mutex);
+
+	l = cgroup_pidlist_find(cgrp, type);
+	if (l)
+		return l;
+
 	/* entry not found; create a new one */
 	l = kzalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
-	if (!l) {
-		mutex_unlock(&cgrp->pidlist_mutex);
+	if (!l)
 		return l;
-	}
+
 	init_rwsem(&l->rwsem);
 	INIT_DELAYED_WORK(&l->destroy_dwork, cgroup_pidlist_destroy_work_fn);
-	down_write(&l->rwsem);
 	l->key.type = type;
-	l->key.ns = get_pid_ns(ns);
+	/* don't need task_nsproxy() if we're looking at ourself */
+	l->key.ns = get_pid_ns(task_active_pid_ns(current));
 	l->owner = cgrp;
 	list_add(&l->links, &cgrp->pidlists);
-	mutex_unlock(&cgrp->pidlist_mutex);
 	return l;
 }
 
@@ -3660,17 +3662,26 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	sort(array, length, sizeof(pid_t), cmppid, NULL);
 	if (type == CGROUP_FILE_PROCS)
 		length = pidlist_uniq(array, length);
-	l = cgroup_pidlist_find(cgrp, type);
+
+	mutex_lock(&cgrp->pidlist_mutex);
+
+	l = cgroup_pidlist_find_create(cgrp, type);
 	if (!l) {
+		mutex_unlock(&cgrp->pidlist_mutex);
 		pidlist_free(array);
 		return -ENOMEM;
 	}
-	/* store array, freeing old if necessary - lock already held */
+
+	/* store array, freeing old if necessary */
+	down_write(&l->rwsem);
 	pidlist_free(l->list);
 	l->list = array;
 	l->length = length;
 	l->use_count++;
 	up_write(&l->rwsem);
+
+	mutex_unlock(&cgrp->pidlist_mutex);
+
 	*lp = l;
 	return 0;
 }
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/9] cgroup: remove cgroup_pidlist->rwsem
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (4 preceding siblings ...)
  2013-11-24 22:11   ` [PATCH 5/9] cgroup: refactor cgroup_pidlist_find() Tejun Heo
@ 2013-11-24 22:11   ` Tejun Heo
  2013-11-24 22:11   ` [PATCH 7/9] cgroup: load and release pidlists from seq_file start and stop respectively Tejun Heo
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

cgroup_pidlist locking is needlessly complicated.  It has outer
cgroup->pidlist_mutex to protect the list of pidlists associated with
a cgroup and then each pidlist has rwsem to synchronize updates and
reads.  Given that the only read access is from seq_file operations
which are always invoked back-to-back, the rwsem is a giant overkill.
All it does is adding unnecessary complexity.

This patch removes cgroup_pidlist->rwsem and protects all accesses to
pidlists belonging to a cgroup with cgroup->pidlist_mutex.
pidlist->rwsem locking is removed if it's nested inside
cgroup->pidlist_mutex; otherwise, it's replaced with
cgroup->pidlist_mutex locking.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 33b6c4d..4ceeab0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3460,8 +3460,6 @@ struct cgroup_pidlist {
 	struct list_head links;
 	/* pointer to the cgroup we belong to, for list removal purposes */
 	struct cgroup *owner;
-	/* protects the other fields */
-	struct rw_semaphore rwsem;
 	/* for delayed destruction */
 	struct delayed_work destroy_dwork;
 };
@@ -3520,7 +3518,6 @@ static void cgroup_pidlist_destroy_work_fn(struct work_struct *work)
 	struct cgroup_pidlist *tofree = NULL;
 
 	mutex_lock(&l->owner->pidlist_mutex);
-	down_write(&l->rwsem);
 
 	/*
 	 * Destroy iff we didn't race with a new user or get queued again.
@@ -3533,7 +3530,6 @@ static void cgroup_pidlist_destroy_work_fn(struct work_struct *work)
 		tofree = l;
 	}
 
-	up_write(&l->rwsem);
 	mutex_unlock(&l->owner->pidlist_mutex);
 	kfree(tofree);
 }
@@ -3610,7 +3606,6 @@ static struct cgroup_pidlist *cgroup_pidlist_find_create(struct cgroup *cgrp,
 	if (!l)
 		return l;
 
-	init_rwsem(&l->rwsem);
 	INIT_DELAYED_WORK(&l->destroy_dwork, cgroup_pidlist_destroy_work_fn);
 	l->key.type = type;
 	/* don't need task_nsproxy() if we're looking at ourself */
@@ -3673,12 +3668,10 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	}
 
 	/* store array, freeing old if necessary */
-	down_write(&l->rwsem);
 	pidlist_free(l->list);
 	l->list = array;
 	l->length = length;
 	l->use_count++;
-	up_write(&l->rwsem);
 
 	mutex_unlock(&cgrp->pidlist_mutex);
 
@@ -3760,7 +3753,7 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 	int index = 0, pid = *pos;
 	int *iter;
 
-	down_read(&l->rwsem);
+	mutex_lock(&of->cgrp->pidlist_mutex);
 	if (pid) {
 		int end = l->length;
 
@@ -3788,7 +3781,7 @@ static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
 	struct cgroup_pidlist_open_file *of = s->private;
 
-	up_read(&of->pidlist->rwsem);
+	mutex_unlock(&of->cgrp->pidlist_mutex);
 }
 
 static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
@@ -3828,13 +3821,13 @@ static const struct seq_operations cgroup_pidlist_seq_operations = {
 
 static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 {
-	down_write(&l->rwsem);
+	mutex_lock(&l->owner->pidlist_mutex);
 	BUG_ON(!l->use_count);
 	/* if the last user, arm the destroy work */
 	if (!--l->use_count)
 		mod_delayed_work(cgroup_pidlist_destroy_wq, &l->destroy_dwork,
 				 CGROUP_PIDLIST_DESTROY_DELAY);
-	up_write(&l->rwsem);
+	mutex_unlock(&l->owner->pidlist_mutex);
 }
 
 static int cgroup_pidlist_release(struct inode *inode, struct file *file)
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/9] cgroup: load and release pidlists from seq_file start and stop respectively
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (5 preceding siblings ...)
  2013-11-24 22:11   ` [PATCH 6/9] cgroup: remove cgroup_pidlist->rwsem Tejun Heo
@ 2013-11-24 22:11   ` Tejun Heo
       [not found]     ` <1385331096-7918-8-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-11-24 22:11   ` [PATCH 8/9] cgroup: remove cgroup_pidlist->use_count Tejun Heo
                     ` (3 subsequent siblings)
  10 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

Currently, pidlists are reference counted from file open and release
methods.  This means that holding onto an open file may waste memory
and reads may return data which is very stale.  Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.

cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show.  This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration.  This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.

The previous patches implemented delayed release and restructured
pidlist handling so that pidlists can be loaded and released from
seq_file start / stop.  This patch actually moves pidlist load to
start and release to stop.

This means that pidlist is pinned only between start and stop and may
go away between two consecutive read calls if the two calls are apart
by more than CGROUP_PIDLIST_DESTROY_DELAY.  cgroup_pidlist_start()
thus can't re-use the stored cgroup_pid_list_open_file->pidlist
directly.  During start, it's only used as a hint indicating whether
this is the first start after open or not and pidlist is always looked
up or created.

pidlist_mutex locking and reference counting are moved out of
pidlist_array_load() so that pidlist_array_load() can perform lookup
and creation atomically.  While this enlarges the area covered by
pidlist_mutex, given how the lock is used, it's highly unlikely to be
noticeable.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 62 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 34 insertions(+), 28 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4ceeab0..8d2d9d4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3471,6 +3471,8 @@ struct cgroup_pidlist_open_file {
 	struct cgroup_pidlist		*pidlist;
 };
 
+static void cgroup_release_pid_array(struct cgroup_pidlist *l);
+
 /*
  * The following two functions "fix" the issue where there are more pids
  * than kmalloc will give memory for; in such cases, we use vmalloc/vfree.
@@ -3628,6 +3630,8 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	struct task_struct *tsk;
 	struct cgroup_pidlist *l;
 
+	lockdep_assert_held(&cgrp->pidlist_mutex);
+
 	/*
 	 * If cgroup gets more users after we read count, we won't have
 	 * enough space - tough.  This race is indistinguishable to the
@@ -3658,8 +3662,6 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	if (type == CGROUP_FILE_PROCS)
 		length = pidlist_uniq(array, length);
 
-	mutex_lock(&cgrp->pidlist_mutex);
-
 	l = cgroup_pidlist_find_create(cgrp, type);
 	if (!l) {
 		mutex_unlock(&cgrp->pidlist_mutex);
@@ -3671,10 +3673,6 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	pidlist_free(l->list);
 	l->list = array;
 	l->length = length;
-	l->use_count++;
-
-	mutex_unlock(&cgrp->pidlist_mutex);
-
 	*lp = l;
 	return 0;
 }
@@ -3749,11 +3747,34 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 	 * next pid to display, if any
 	 */
 	struct cgroup_pidlist_open_file *of = s->private;
-	struct cgroup_pidlist *l = of->pidlist;
+	struct cgroup *cgrp = of->cgrp;
+	struct cgroup_pidlist *l;
 	int index = 0, pid = *pos;
-	int *iter;
+	int *iter, ret;
+
+	mutex_lock(&cgrp->pidlist_mutex);
+
+	/*
+	 * !NULL @of->pidlist indicates that this isn't the first start()
+	 * after open.  If the matching pidlist is around, we can use that.
+	 * Look for it.  Note that @of->pidlist can't be used directly.  It
+	 * could already have been destroyed.
+	 */
+	if (of->pidlist)
+		of->pidlist = cgroup_pidlist_find(cgrp, of->type);
+
+	/*
+	 * Either this is the first start() after open or the matching
+	 * pidlist has been destroyed inbetween.  Create a new one.
+	 */
+	if (!of->pidlist) {
+		ret = pidlist_array_load(of->cgrp, of->type, &of->pidlist);
+		if (ret)
+			return ERR_PTR(ret);
+	}
+	l = of->pidlist;
+	l->use_count++;
 
-	mutex_lock(&of->cgrp->pidlist_mutex);
 	if (pid) {
 		int end = l->length;
 
@@ -3782,6 +3803,8 @@ static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 	struct cgroup_pidlist_open_file *of = s->private;
 
 	mutex_unlock(&of->cgrp->pidlist_mutex);
+	if (of->pidlist)
+		cgroup_release_pid_array(of->pidlist);
 }
 
 static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
@@ -3830,20 +3853,11 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 	mutex_unlock(&l->owner->pidlist_mutex);
 }
 
-static int cgroup_pidlist_release(struct inode *inode, struct file *file)
-{
-	struct cgroup_pidlist_open_file *of;
-
-	of = ((struct seq_file *)file->private_data)->private;
-	cgroup_release_pid_array(of->pidlist);
-	return seq_release_private(inode, file);
-}
-
 static const struct file_operations cgroup_pidlist_operations = {
 	.read = seq_read,
 	.llseek = seq_lseek,
 	.write = cgroup_file_write,
-	.release = cgroup_pidlist_release,
+	.release = seq_release_private,
 };
 
 /*
@@ -3856,27 +3870,19 @@ static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
 	struct cgroup_pidlist_open_file *of;
-	struct cgroup_pidlist *l;
 	int retval;
 
-	/* have the array populated */
-	retval = pidlist_array_load(cgrp, type, &l);
-	if (retval)
-		return retval;
 	/* configure file information */
 	file->f_op = &cgroup_pidlist_operations;
 
 	retval = seq_open_private(file, &cgroup_pidlist_seq_operations,
 				  sizeof(*of));
-	if (retval) {
-		cgroup_release_pid_array(l);
+	if (retval)
 		return retval;
-	}
 
 	of = ((struct seq_file *)file->private_data)->private;
 	of->type = type;
 	of->cgrp = cgrp;
-	of->pidlist = l;
 	return 0;
 }
 static int cgroup_tasks_open(struct inode *unused, struct file *file)
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 8/9] cgroup: remove cgroup_pidlist->use_count
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (6 preceding siblings ...)
  2013-11-24 22:11   ` [PATCH 7/9] cgroup: load and release pidlists from seq_file start and stop respectively Tejun Heo
@ 2013-11-24 22:11   ` Tejun Heo
  2013-11-24 22:11   ` [PATCH 9/9] cgroup: don't guarantee cgroup.procs is sorted if sane_behavior Tejun Heo
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

After the recent changes, pidlist ref is held only between
cgroup_pidlist_start() and cgroup_pidlist_stop() during which
cgroup->pidlist_mutex is also held.  IOW, the reference count is
redundant now.  While in use, it's always one and pidlist_mutex is
held - holding the mutex has exactly the same protection.

This patch collapses destroy_dwork queueing into cgroup_pidlist_stop()
so that pidlist_mutex is not released inbetween and drops
pidlist->use_count.

This patch shouldn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 28 +++++++---------------------
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 8d2d9d4..5c8cf0b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3454,8 +3454,6 @@ struct cgroup_pidlist {
 	pid_t *list;
 	/* how many elements the above list has */
 	int length;
-	/* how many files are using the current array */
-	int use_count;
 	/* each of these stored in a list by its cgroup */
 	struct list_head links;
 	/* pointer to the cgroup we belong to, for list removal purposes */
@@ -3471,8 +3469,6 @@ struct cgroup_pidlist_open_file {
 	struct cgroup_pidlist		*pidlist;
 };
 
-static void cgroup_release_pid_array(struct cgroup_pidlist *l);
-
 /*
  * The following two functions "fix" the issue where there are more pids
  * than kmalloc will give memory for; in such cases, we use vmalloc/vfree.
@@ -3522,10 +3518,10 @@ static void cgroup_pidlist_destroy_work_fn(struct work_struct *work)
 	mutex_lock(&l->owner->pidlist_mutex);
 
 	/*
-	 * Destroy iff we didn't race with a new user or get queued again.
-	 * Queued state won't change as it can only be queued while locked.
+	 * Destroy iff we didn't get queued again.  The state won't change
+	 * as destroy_dwork can only be queued while locked.
 	 */
-	if (!l->use_count && !delayed_work_pending(dwork)) {
+	if (!delayed_work_pending(dwork)) {
 		list_del(&l->links);
 		pidlist_free(l->list);
 		put_pid_ns(l->key.ns);
@@ -3773,7 +3769,6 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 			return ERR_PTR(ret);
 	}
 	l = of->pidlist;
-	l->use_count++;
 
 	if (pid) {
 		int end = l->length;
@@ -3802,9 +3797,11 @@ static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
 	struct cgroup_pidlist_open_file *of = s->private;
 
-	mutex_unlock(&of->cgrp->pidlist_mutex);
 	if (of->pidlist)
-		cgroup_release_pid_array(of->pidlist);
+		mod_delayed_work(cgroup_pidlist_destroy_wq,
+				 &of->pidlist->destroy_dwork,
+				 CGROUP_PIDLIST_DESTROY_DELAY);
+	mutex_unlock(&of->cgrp->pidlist_mutex);
 }
 
 static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
@@ -3842,17 +3839,6 @@ static const struct seq_operations cgroup_pidlist_seq_operations = {
 	.show = cgroup_pidlist_show,
 };
 
-static void cgroup_release_pid_array(struct cgroup_pidlist *l)
-{
-	mutex_lock(&l->owner->pidlist_mutex);
-	BUG_ON(!l->use_count);
-	/* if the last user, arm the destroy work */
-	if (!--l->use_count)
-		mod_delayed_work(cgroup_pidlist_destroy_wq, &l->destroy_dwork,
-				 CGROUP_PIDLIST_DESTROY_DELAY);
-	mutex_unlock(&l->owner->pidlist_mutex);
-}
-
 static const struct file_operations cgroup_pidlist_operations = {
 	.read = seq_read,
 	.llseek = seq_lseek,
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 9/9] cgroup: don't guarantee cgroup.procs is sorted if sane_behavior
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (7 preceding siblings ...)
  2013-11-24 22:11   ` [PATCH 8/9] cgroup: remove cgroup_pidlist->use_count Tejun Heo
@ 2013-11-24 22:11   ` Tejun Heo
  2013-11-27 23:23   ` [PATCHSET cgroup-for-3.14] cgroup: restructure pidlist handling Tejun Heo
  2013-11-29  1:03   ` Li Zefan
  10 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-24 22:11 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

For some reason, tasks and cgroup.procs guarantee that the result is
sorted.  This is the only reason this whole pidlist logic is necessary
instead of just iterating through sorted member tasks.  We can't do
anything about the existing interface but at least ensure that such
expectation doesn't exist for the new interface so that pidlist logic
may be removed in the distant future.

This patch scrambles the sort order if sane_behavior so that the
output is usually not sorted in the new interface.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/cgroup.h |  3 +++
 kernel/cgroup.c        | 51 +++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5207c28..50d8cc3 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -275,6 +275,9 @@ enum {
 	 * - "tasks" is removed.  Everything should be at process
 	 *   granularity.  Use "cgroup.procs" instead.
 	 *
+	 * - "cgroup.procs" is not sorted.  pids will be unique unless they
+	 *   got recycled inbetween reads.
+	 *
 	 * - "release_agent" and "notify_on_release" are removed.
 	 *   Replacement notification mechanism will be implemented.
 	 *
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5c8cf0b..b848b4f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3562,11 +3562,49 @@ after:
 	return dest;
 }
 
+/*
+ * The two pid files - task and cgroup.procs - guaranteed that the result
+ * is sorted, which forced this whole pidlist fiasco.  As pid order is
+ * different per namespace, each namespace needs differently sorted list,
+ * making it impossible to use, for example, single rbtree of member tasks
+ * sorted by task pointer.  As pidlists can be fairly large, allocating one
+ * per open file is dangerous, so cgroup had to implement shared pool of
+ * pidlists keyed by cgroup and namespace.
+ *
+ * All this extra complexity was caused by the original implementation
+ * committing to an entirely unnecessary property.  In the long term, we
+ * want to do away with it.  Explicitly scramble sort order if
+ * sane_behavior so that no such expectation exists in the new interface.
+ *
+ * Scrambling is done by swapping every two consecutive bits, which is
+ * non-identity one-to-one mapping which disturbs sort order sufficiently.
+ */
+static pid_t pid_fry(pid_t pid)
+{
+	unsigned a = pid & 0x55555555;
+	unsigned b = pid & 0xAAAAAAAA;
+
+	return (a << 1) | (b >> 1);
+}
+
+static pid_t cgroup_pid_fry(struct cgroup *cgrp, pid_t pid)
+{
+	if (cgroup_sane_behavior(cgrp))
+		return pid_fry(pid);
+	else
+		return pid;
+}
+
 static int cmppid(const void *a, const void *b)
 {
 	return *(pid_t *)a - *(pid_t *)b;
 }
 
+static int fried_cmppid(const void *a, const void *b)
+{
+	return pid_fry(*(pid_t *)a) - pid_fry(*(pid_t *)b);
+}
+
 static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
 						  enum cgroup_filetype type)
 {
@@ -3654,7 +3692,10 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	css_task_iter_end(&it);
 	length = n;
 	/* now sort & (if procs) strip out duplicates */
-	sort(array, length, sizeof(pid_t), cmppid, NULL);
+	if (cgroup_sane_behavior(cgrp))
+		sort(array, length, sizeof(pid_t), fried_cmppid, NULL);
+	else
+		sort(array, length, sizeof(pid_t), cmppid, NULL);
 	if (type == CGROUP_FILE_PROCS)
 		length = pidlist_uniq(array, length);
 
@@ -3775,10 +3816,10 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 
 		while (index < end) {
 			int mid = (index + end) / 2;
-			if (l->list[mid] == pid) {
+			if (cgroup_pid_fry(cgrp, l->list[mid]) == pid) {
 				index = mid;
 				break;
-			} else if (l->list[mid] <= pid)
+			} else if (cgroup_pid_fry(cgrp, l->list[mid]) <= pid)
 				index = mid + 1;
 			else
 				end = mid;
@@ -3789,7 +3830,7 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 		return NULL;
 	/* Update the abstract position to be the actual pid that we found */
 	iter = l->list + index;
-	*pos = *iter;
+	*pos = cgroup_pid_fry(cgrp, *iter);
 	return iter;
 }
 
@@ -3818,7 +3859,7 @@ static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
 	if (p >= end) {
 		return NULL;
 	} else {
-		*pos = *p;
+		*pos = cgroup_pid_fry(of->cgrp, *p);
 		return p;
 	}
 }
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCHSET cgroup-for-3.14] cgroup: restructure pidlist handling
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (8 preceding siblings ...)
  2013-11-24 22:11   ` [PATCH 9/9] cgroup: don't guarantee cgroup.procs is sorted if sane_behavior Tejun Heo
@ 2013-11-27 23:23   ` Tejun Heo
  2013-11-29  1:03   ` Li Zefan
  10 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-27 23:23 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Sun, Nov 24, 2013 at 05:11:27PM -0500, Tejun Heo wrote:
> This patchset is on top of cgroup/for-3.14 edab95103d3a ("cgroup:
> Merge branch 'memcg_event' into for-3.14") and available in the
> following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-pidlist

Patchset rebased on cgroup/for-3.14 c729b11edf74 ("cgroup: Merge
branch 'for-3.13-fixes' into for-3.14 ").  git branch updated
accordingly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/9] cgroup: introduce struct cgroup_pidlist_open_file
       [not found]     ` <1385331096-7918-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-11-29  1:03       ` Li Zefan
  2013-11-29 15:44       ` [PATCH v2 " Tejun Heo
  1 sibling, 0 replies; 16+ messages in thread
From: Li Zefan @ 2013-11-29  1:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

> @@ -3841,6 +3851,7 @@ static const struct file_operations cgroup_pidlist_operations = {
>  static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
>  {
>  	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
> +	struct cgroup_pidlist_open_file *of;
>  	struct cgroup_pidlist *l;
>  	int retval;
>  
> @@ -3851,12 +3862,17 @@ static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
>  	/* configure file information */
>  	file->f_op = &cgroup_pidlist_operations;
>  
> -	retval = seq_open(file, &cgroup_pidlist_seq_operations);
> +	retval = seq_open_private(file, &cgroup_pidlist_seq_operations,
> +				  sizeof(*of));

should be a bit simpler to use __seq_open_private():

	of = __seq_open_private(...);
	if (!of) {
		...
		return -ENOMEM;
	}

and then @retval can be removed in 6/7.

>  	if (retval) {
>  		cgroup_release_pid_array(l);
>  		return retval;
>  	}
> -	((struct seq_file *)file->private_data)->private = l;
> +
> +	of = ((struct seq_file *)file->private_data)->private;
> +	of->type = type;
> +	of->cgrp = cgrp;
> +	of->pidlist = l;
>  	return 0;
>  }
>  static int cgroup_tasks_open(struct inode *unused, struct file *file)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHSET cgroup-for-3.14] cgroup: restructure pidlist handling
       [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (9 preceding siblings ...)
  2013-11-27 23:23   ` [PATCHSET cgroup-for-3.14] cgroup: restructure pidlist handling Tejun Heo
@ 2013-11-29  1:03   ` Li Zefan
       [not found]     ` <5297E7E2.8080404-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  10 siblings, 1 reply; 16+ messages in thread
From: Li Zefan @ 2013-11-29  1:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

> pidlist is hanlding is quite elaborate.  Because the pidlist files -
> "tasks" and "cgroup.pids" - guarantee that the result is sorted and a
> task can be associated with different pids, with no inherent order
> among them, depending on namespaces, it is impossible to give a
> certain order to tasks of a cgroup and then just iterate through them.
> 
> Instead, we end up creating tables of the relevant ids and then sort
> them before serving them out for reads.  As those tables can be huge,
> we also implement logic to share those tables if the id type and
> namespace match, which in turn involves reference counting those
> tables and synchronizing accesses to them.
> 
> What could have been a simple iteration through the member tasks
> became this unnecessary hunk of complexity because it, for some
> reason, wanted to guarantee sorted output, which is extremely unusual
> for this type of interface.
> 
> The refcnting is done from open() and release() callbacks, which
> kernfs doesn't expose.  This patchset updates pidlist handling so that
> pidlists are managed from seq_file operations proper.  As the duration
> between the paired start and stop denotes a single read invocation and
> we don't want to reload pidlist for each instance of consecutive read
> calls, pidlist is released with time delay.  This also bounds the
> stale the output of read calls can be.  This makes refcnting
> unnecessary - locking is simplified and refcnting is dropped.
> 
> In the long term, we want to do away with pidlist and make this a
> simple iteration over member tasks.  The last patch scrambles the sort
> order of "cgroup.pids" if sane_behavior, so that the sorted
> expectation is broken in the new interface and we can eventually drop
> pidlist logic.
> 
> This patchset contains the following nine patches.
> 
>  0001-cgroup-don-t-skip-seq_open-on-write-only-opens-on-pi.patch
>  0002-cgroup-remove-cftype-release.patch
>  0003-cgroup-implement-delayed-destruction-for-cgroup_pidl.patch
>  0004-cgroup-introduce-struct-cgroup_pidlist_open_file.patch
>  0005-cgroup-refactor-cgroup_pidlist_find.patch
>  0006-cgroup-remove-cgroup_pidlist-rwsem.patch
>  0007-cgroup-load-and-release-pidlists-from-seq_file-start.patch
>  0008-cgroup-remove-cgroup_pidlist-use_count.patch
>  0009-cgroup-don-t-guarantee-cgroup.procs-is-sorted-if-san.patch
> 

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 4/9] cgroup: introduce struct cgroup_pidlist_open_file
       [not found]     ` <1385331096-7918-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-11-29  1:03       ` Li Zefan
@ 2013-11-29 15:44       ` Tejun Heo
  1 sibling, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-29 15:44 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

From 62236858f3cc558135d1ab6b2af44d22c489bb7f Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Fri, 29 Nov 2013 10:42:58 -0500

For pidlist files, seq_file->private pointed to the loaded
cgroup_pidlist; however, pidlist loading is planned to be moved to
cgroup_pidlist_start() for kernfs conversion and seq_file->private
needs to carry more information from open to allow that.

This patch introduces struct cgroup_pidlist_open_file which contains
type, cgrp and pidlist and updates pidlist seq_file->private to point
to it using seq_open_private() and seq_release_private().  Note that
this eventually will be replaced by kernfs_open_file.

While this patch makes more information available to seq_file
operations, they don't use it yet and this patch doesn't introduce any
behavior changes except for allocation of the extra private struct.

v2: use __seq_open_private() instead of seq_open_private() for brevity
    as suggested by Li.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup.c | 39 +++++++++++++++++++++++++++------------
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index acdcddf..ef019c0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3468,6 +3468,13 @@ struct cgroup_pidlist {
 	struct delayed_work destroy_dwork;
 };
 
+/* seq_file->private points to the following */
+struct cgroup_pidlist_open_file {
+	enum cgroup_filetype		type;
+	struct cgroup			*cgrp;
+	struct cgroup_pidlist		*pidlist;
+};
+
 /*
  * The following two functions "fix" the issue where there are more pids
  * than kmalloc will give memory for; in such cases, we use vmalloc/vfree.
@@ -3739,7 +3746,8 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 	 * after a seek to the start). Use a binary-search to find the
 	 * next pid to display, if any
 	 */
-	struct cgroup_pidlist *l = s->private;
+	struct cgroup_pidlist_open_file *of = s->private;
+	struct cgroup_pidlist *l = of->pidlist;
 	int index = 0, pid = *pos;
 	int *iter;
 
@@ -3769,13 +3777,15 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 
 static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
-	struct cgroup_pidlist *l = s->private;
-	up_read(&l->rwsem);
+	struct cgroup_pidlist_open_file *of = s->private;
+
+	up_read(&of->pidlist->rwsem);
 }
 
 static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
 {
-	struct cgroup_pidlist *l = s->private;
+	struct cgroup_pidlist_open_file *of = s->private;
+	struct cgroup_pidlist *l = of->pidlist;
 	pid_t *p = v;
 	pid_t *end = l->list + l->length;
 	/*
@@ -3820,11 +3830,11 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 
 static int cgroup_pidlist_release(struct inode *inode, struct file *file)
 {
-	struct cgroup_pidlist *l;
+	struct cgroup_pidlist_open_file *of;
 
-	l = ((struct seq_file *)file->private_data)->private;
-	cgroup_release_pid_array(l);
-	return seq_release(inode, file);
+	of = ((struct seq_file *)file->private_data)->private;
+	cgroup_release_pid_array(of->pidlist);
+	return seq_release_private(inode, file);
 }
 
 static const struct file_operations cgroup_pidlist_operations = {
@@ -3843,6 +3853,7 @@ static const struct file_operations cgroup_pidlist_operations = {
 static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
+	struct cgroup_pidlist_open_file *of;
 	struct cgroup_pidlist *l;
 	int retval;
 
@@ -3853,12 +3864,16 @@ static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 	/* configure file information */
 	file->f_op = &cgroup_pidlist_operations;
 
-	retval = seq_open(file, &cgroup_pidlist_seq_operations);
-	if (retval) {
+	of = __seq_open_private(file, &cgroup_pidlist_seq_operations,
+				sizeof(*of));
+	if (!of) {
 		cgroup_release_pid_array(l);
-		return retval;
+		return -ENOMEM;
 	}
-	((struct seq_file *)file->private_data)->private = l;
+
+	of->type = type;
+	of->cgrp = cgrp;
+	of->pidlist = l;
 	return 0;
 }
 static int cgroup_tasks_open(struct inode *unused, struct file *file)
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 7/9] cgroup: load and release pidlists from seq_file start and stop respectively
       [not found]     ` <1385331096-7918-8-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-11-29 15:45       ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-29 15:45 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

From 4bac00d16a8760eae7205e41d2c246477d42a210 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Fri, 29 Nov 2013 10:42:59 -0500

Currently, pidlists are reference counted from file open and release
methods.  This means that holding onto an open file may waste memory
and reads may return data which is very stale.  Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.

cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show.  This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration.  This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.

The previous patches implemented delayed release and restructured
pidlist handling so that pidlists can be loaded and released from
seq_file start / stop.  This patch actually moves pidlist load to
start and release to stop.

This means that pidlist is pinned only between start and stop and may
go away between two consecutive read calls if the two calls are apart
by more than CGROUP_PIDLIST_DESTROY_DELAY.  cgroup_pidlist_start()
thus can't re-use the stored cgroup_pid_list_open_file->pidlist
directly.  During start, it's only used as a hint indicating whether
this is the first start after open or not and pidlist is always looked
up or created.

pidlist_mutex locking and reference counting are moved out of
pidlist_array_load() so that pidlist_array_load() can perform lookup
and creation atomically.  While this enlarges the area covered by
pidlist_mutex, given how the lock is used, it's highly unlikely to be
noticeable.

v2: Refreshed on top of the updated "cgroup: introduce struct
    cgroup_pidlist_open_file".

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup.c | 63 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 34 insertions(+), 29 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dc39e17..671cbde 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3473,6 +3473,8 @@ struct cgroup_pidlist_open_file {
 	struct cgroup_pidlist		*pidlist;
 };
 
+static void cgroup_release_pid_array(struct cgroup_pidlist *l);
+
 /*
  * The following two functions "fix" the issue where there are more pids
  * than kmalloc will give memory for; in such cases, we use vmalloc/vfree.
@@ -3630,6 +3632,8 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	struct task_struct *tsk;
 	struct cgroup_pidlist *l;
 
+	lockdep_assert_held(&cgrp->pidlist_mutex);
+
 	/*
 	 * If cgroup gets more users after we read count, we won't have
 	 * enough space - tough.  This race is indistinguishable to the
@@ -3660,8 +3664,6 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	if (type == CGROUP_FILE_PROCS)
 		length = pidlist_uniq(array, length);
 
-	mutex_lock(&cgrp->pidlist_mutex);
-
 	l = cgroup_pidlist_find_create(cgrp, type);
 	if (!l) {
 		mutex_unlock(&cgrp->pidlist_mutex);
@@ -3673,10 +3675,6 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	pidlist_free(l->list);
 	l->list = array;
 	l->length = length;
-	l->use_count++;
-
-	mutex_unlock(&cgrp->pidlist_mutex);
-
 	*lp = l;
 	return 0;
 }
@@ -3751,11 +3749,34 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 	 * next pid to display, if any
 	 */
 	struct cgroup_pidlist_open_file *of = s->private;
-	struct cgroup_pidlist *l = of->pidlist;
+	struct cgroup *cgrp = of->cgrp;
+	struct cgroup_pidlist *l;
 	int index = 0, pid = *pos;
-	int *iter;
+	int *iter, ret;
+
+	mutex_lock(&cgrp->pidlist_mutex);
+
+	/*
+	 * !NULL @of->pidlist indicates that this isn't the first start()
+	 * after open.  If the matching pidlist is around, we can use that.
+	 * Look for it.  Note that @of->pidlist can't be used directly.  It
+	 * could already have been destroyed.
+	 */
+	if (of->pidlist)
+		of->pidlist = cgroup_pidlist_find(cgrp, of->type);
+
+	/*
+	 * Either this is the first start() after open or the matching
+	 * pidlist has been destroyed inbetween.  Create a new one.
+	 */
+	if (!of->pidlist) {
+		ret = pidlist_array_load(of->cgrp, of->type, &of->pidlist);
+		if (ret)
+			return ERR_PTR(ret);
+	}
+	l = of->pidlist;
+	l->use_count++;
 
-	mutex_lock(&of->cgrp->pidlist_mutex);
 	if (pid) {
 		int end = l->length;
 
@@ -3784,6 +3805,8 @@ static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 	struct cgroup_pidlist_open_file *of = s->private;
 
 	mutex_unlock(&of->cgrp->pidlist_mutex);
+	if (of->pidlist)
+		cgroup_release_pid_array(of->pidlist);
 }
 
 static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
@@ -3832,20 +3855,11 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 	mutex_unlock(&l->owner->pidlist_mutex);
 }
 
-static int cgroup_pidlist_release(struct inode *inode, struct file *file)
-{
-	struct cgroup_pidlist_open_file *of;
-
-	of = ((struct seq_file *)file->private_data)->private;
-	cgroup_release_pid_array(of->pidlist);
-	return seq_release_private(inode, file);
-}
-
 static const struct file_operations cgroup_pidlist_operations = {
 	.read = seq_read,
 	.llseek = seq_lseek,
 	.write = cgroup_file_write,
-	.release = cgroup_pidlist_release,
+	.release = seq_release_private,
 };
 
 /*
@@ -3858,26 +3872,17 @@ static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
 	struct cgroup_pidlist_open_file *of;
-	struct cgroup_pidlist *l;
-	int retval;
 
-	/* have the array populated */
-	retval = pidlist_array_load(cgrp, type, &l);
-	if (retval)
-		return retval;
 	/* configure file information */
 	file->f_op = &cgroup_pidlist_operations;
 
 	of = __seq_open_private(file, &cgroup_pidlist_seq_operations,
 				sizeof(*of));
-	if (!of) {
-		cgroup_release_pid_array(l);
+	if (!of)
 		return -ENOMEM;
-	}
 
 	of->type = type;
 	of->cgrp = cgrp;
-	of->pidlist = l;
 	return 0;
 }
 static int cgroup_tasks_open(struct inode *unused, struct file *file)
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCHSET cgroup-for-3.14] cgroup: restructure pidlist handling
       [not found]     ` <5297E7E2.8080404-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-11-29 15:46       ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2013-11-29 15:46 UTC (permalink / raw)
  To: Li Zefan
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Fri, Nov 29, 2013 at 09:03:30AM +0800, Li Zefan wrote:
> 
> Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

Applied to cgroup/for-3.14 with the v2 updates applied.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-11-29 15:46 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-24 22:11 [PATCHSET cgroup-for-3.14] cgroup: restructure pidlist handling Tejun Heo
     [not found] ` <1385331096-7918-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-11-24 22:11   ` [PATCH 1/9] cgroup: don't skip seq_open on write only opens on pidlist files Tejun Heo
2013-11-24 22:11   ` [PATCH 2/9] cgroup: remove cftype->release() Tejun Heo
2013-11-24 22:11   ` [PATCH 3/9] cgroup: implement delayed destruction for cgroup_pidlist Tejun Heo
2013-11-24 22:11   ` [PATCH 4/9] cgroup: introduce struct cgroup_pidlist_open_file Tejun Heo
     [not found]     ` <1385331096-7918-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-11-29  1:03       ` Li Zefan
2013-11-29 15:44       ` [PATCH v2 " Tejun Heo
2013-11-24 22:11   ` [PATCH 5/9] cgroup: refactor cgroup_pidlist_find() Tejun Heo
2013-11-24 22:11   ` [PATCH 6/9] cgroup: remove cgroup_pidlist->rwsem Tejun Heo
2013-11-24 22:11   ` [PATCH 7/9] cgroup: load and release pidlists from seq_file start and stop respectively Tejun Heo
     [not found]     ` <1385331096-7918-8-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-11-29 15:45       ` [PATCH v2 " Tejun Heo
2013-11-24 22:11   ` [PATCH 8/9] cgroup: remove cgroup_pidlist->use_count Tejun Heo
2013-11-24 22:11   ` [PATCH 9/9] cgroup: don't guarantee cgroup.procs is sorted if sane_behavior Tejun Heo
2013-11-27 23:23   ` [PATCHSET cgroup-for-3.14] cgroup: restructure pidlist handling Tejun Heo
2013-11-29  1:03   ` Li Zefan
     [not found]     ` <5297E7E2.8080404-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-11-29 15:46       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).