[PATCHSET cgroup/for-3.12] cgroup: make cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
@ 2013-08-04 16:07 Tejun Heo
  2013-08-04 16:07 ` [PATCH 1/5] cgroup: implement CFTYPE_NO_PREFIX Tejun Heo
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-04 16:07 UTC (permalink / raw)
  To: lizefan, hannes, mhocko, bsingharora, kamezawa.hiroyu
  Cc: cgroups, linux-mm, linux-kernel

Hello,

Like many other things in cgroup, cgroup_event is way too flexible and
complex - it strives to provide completely flexible event monitoring
facility in cgroup proper which allows any number of users to monitor
custom events.  This is overboard, to say the least, and I strongly
think that cgroup should not any new usages of this facility and
preferably deprecate the existing usages if at all possible.

Fortunately, memcg along with vmpressure is the only user of the
facility and gets to keep it.  This patchset makes cgroup_event
specific to memcg, moves all related code into mm/memcontrol.c and
renames it to mem_cgroup_event so that its usage can't spread to other
subsystems and later deprecation and cleanup can be localized.

Note that after this patchset, cgroup.event_control file exists only
for the hierarchy which has memcg attached to it.  This is a userland
visible change but unlikely to be noticeable as the file has never
been meaningful outside memcg.  If this ever becomes problematic, we
can add a dummy file on hierarchies w/o memcg when !sane_behavior.

This patchset is consited of the following five patches.

 0001-cgroup-implement-CFTYPE_NO_PREFIX.patch
 0002-cgroup-export-__cgroup_from_dentry-and-__cgroup_dput.patch
 0003-cgroup-memcg-move-cgroup_event-implementation-to-mem.patch
 0004-cgroup-memcg-move-cgroup-event_list-_lock-and-event-.patch
 0005-memcg-rename-cgroup_event-to-mem_cgroup_event.patch

The patchset is on top of

  cgroup/for-3.12 61584e3f4 ("cgroup: Merge branch 'for-3.11-fixes' into for-3.12")
+ [1] cgroup: use cgroup_subsys_state as the primary subsystem interface handle

and available in the following branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-memcg_event

diffstat follows.

 Documentation/cgroups/cgroups.txt |   19 --
 include/linux/cgroup.h            |   29 ---
 include/linux/vmpressure.h        |    1
 kernel/cgroup.c                   |  265 +--------------------------------
 mm/memcontrol.c                   |  302 ++++++++++++++++++++++++++++++++++++--
 5 files changed, 315 insertions(+), 301 deletions(-)

Thanks.

--
tejun

[1] https://lkml.org/lkml/2013/8/1/722

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/5] cgroup: implement CFTYPE_NO_PREFIX
  2013-08-04 16:07 [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Tejun Heo
@ 2013-08-04 16:07 ` Tejun Heo
  2013-08-04 16:07 ` [PATCH 2/5] cgroup: export __cgroup_from_dentry() and __cgroup_dput() Tejun Heo
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-04 16:07 UTC (permalink / raw)
  To: lizefan, hannes, mhocko, bsingharora, kamezawa.hiroyu
  Cc: cgroups, linux-mm, linux-kernel, Tejun Heo, Glauber Costa

When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix.  This patch adds CFTYPE_NO_ which
disables the automatic prefix.  This is to work around historical
baggages and shouldn't be used for new files.

This will be used to move "cgroup.event_control" from cgroup core to
memcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/cgroup.h | 1 +
 kernel/cgroup.c        | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c40e508..30d6ec4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -405,6 +405,7 @@ enum {
 	CFTYPE_ONLY_ON_ROOT	= (1 << 0),	/* only create on root cgrp */
 	CFTYPE_NOT_ON_ROOT	= (1 << 1),	/* don't create on root cgrp */
 	CFTYPE_INSANE		= (1 << 2),	/* don't create if sane_behavior */
+	CFTYPE_NO_PREFIX	= (1 << 3),	/* (DON'T USE FOR NEW FILES) no subsys prefix */
 };
 
 #define MAX_CFTYPE_NAME		64
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c02a288..1b87e2b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2749,7 +2749,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
 	umode_t mode;
 	char name[MAX_CGROUP_TYPE_NAMELEN + MAX_CFTYPE_NAME + 2] = { 0 };
 
-	if (cft->ss && !(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) {
+	if (cft->ss && !(cft->flags & CFTYPE_NO_PREFIX) &&
+	    !(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) {
 		strcpy(name, cft->ss->name);
 		strcat(name, ".");
 	}
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/5] cgroup: export __cgroup_from_dentry() and __cgroup_dput()
  2013-08-04 16:07 [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Tejun Heo
  2013-08-04 16:07 ` [PATCH 1/5] cgroup: implement CFTYPE_NO_PREFIX Tejun Heo
@ 2013-08-04 16:07 ` Tejun Heo
  2013-08-05  2:58   ` Li Zefan
  2013-08-05 17:08   ` [PATCH v2 2/5] cgroup: make __cgroup_from_dentry() and __cgroup_dput() global Tejun Heo
  2013-08-04 16:07 ` [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg Tejun Heo
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-04 16:07 UTC (permalink / raw)
  To: lizefan, hannes, mhocko, bsingharora, kamezawa.hiroyu
  Cc: cgroups, linux-mm, linux-kernel, Tejun Heo

cgroup_event will no longer be supported as cgroup generic mechanism
and be moved to memcg.  To enable the relocation, implement and export
__cgroup_from_dentry() which combines cgroup file dentry -> croup
mapping and cft discovery, and prefix cgroup_dput() with __ and export
it.

These functions exist and are exported only to enable moving
cgroup_event implementation to memcg and shouldn't grow any new users
and thus the __ prefix.

This patch is pure reorganization and doesn't introduce any functional
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h |  4 ++++
 kernel/cgroup.c        | 31 +++++++++++++++++--------------
 2 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 30d6ec4..2ac1021 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -923,6 +923,10 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
 unsigned short css_id(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 
+/* do not add new users of the following two functions */
+struct cgroup *__cgroup_from_dentry(struct dentry *dentry, struct cftype **cftp);
+void __cgroup_dput(struct cgroup *cgrp);
+
 #else /* !CONFIG_CGROUPS */
 
 static inline int cgroup_init_early(void) { return 0; }
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1b87e2b..2583b7b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2661,15 +2661,18 @@ static struct dentry *cgroup_lookup(struct inode *dir, struct dentry *dentry, un
 	return NULL;
 }
 
-/*
- * Check if a file is a control file
- */
-static inline struct cftype *__file_cft(struct file *file)
+/* do not add new users */
+struct cgroup *__cgroup_from_dentry(struct dentry *dentry, struct cftype **cftp)
 {
-	if (file_inode(file)->i_fop != &cgroup_file_operations)
-		return ERR_PTR(-EINVAL);
-	return __d_cft(file->f_dentry);
+	if (!dentry->d_inode ||
+	    dentry->d_inode->i_op != &cgroup_file_inode_operations)
+		return NULL;
+
+	if (cftp)
+		*cftp = __d_cft(dentry);
+	return __d_cgrp(dentry->d_parent);
 }
+EXPORT_SYMBOL_GPL(__cgroup_from_dentry);
 
 static int cgroup_create_file(struct dentry *dentry, umode_t mode,
 				struct super_block *sb)
@@ -3953,7 +3956,7 @@ static int cgroup_write_notify_on_release(struct cgroup_subsys_state *css,
  *
  * That's why we hold a reference before dput() and drop it right after.
  */
-static void cgroup_dput(struct cgroup *cgrp)
+void __cgroup_dput(struct cgroup *cgrp)
 {
 	struct super_block *sb = cgrp->root->sb;
 
@@ -3961,6 +3964,7 @@ static void cgroup_dput(struct cgroup *cgrp)
 	dput(cgrp->dentry);
 	deactivate_super(sb);
 }
+EXPORT_SYMBOL_GPL(__cgroup_dput);
 
 /*
  * Unregister event and free resources.
@@ -3983,7 +3987,7 @@ static void cgroup_event_remove(struct work_struct *work)
 
 	eventfd_ctx_put(event->eventfd);
 	kfree(event);
-	cgroup_dput(cgrp);
+	__cgroup_dput(cgrp);
 }
 
 /*
@@ -4095,9 +4099,9 @@ static int cgroup_write_event_control(struct cgroup_subsys_state *css,
 	if (ret < 0)
 		goto out_put_cfile;
 
-	event->cft = __file_cft(cfile);
-	if (IS_ERR(event->cft)) {
-		ret = PTR_ERR(event->cft);
+	cgrp_cfile = __cgroup_from_dentry(cfile->f_dentry, &event->cft);
+	if (!cgrp_cfile) {
+		ret = -EINVAL;
 		goto out_put_cfile;
 	}
 
@@ -4105,7 +4109,6 @@ static int cgroup_write_event_control(struct cgroup_subsys_state *css,
 	 * The file to be monitored must be in the same cgroup as
 	 * cgroup.event_control is.
 	 */
-	cgrp_cfile = __d_cgrp(cfile->f_dentry->d_parent);
 	if (cgrp_cfile != cgrp) {
 		ret = -EINVAL;
 		goto out_put_cfile;
@@ -4272,7 +4275,7 @@ static void css_dput_fn(struct work_struct *work)
 	struct cgroup_subsys_state *css =
 		container_of(work, struct cgroup_subsys_state, dput_work);
 
-	cgroup_dput(css->cgroup);
+	__cgroup_dput(css->cgroup);
 }
 
 static void css_release(struct percpu_ref *ref)
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg
  2013-08-04 16:07 [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Tejun Heo
  2013-08-04 16:07 ` [PATCH 1/5] cgroup: implement CFTYPE_NO_PREFIX Tejun Heo
  2013-08-04 16:07 ` [PATCH 2/5] cgroup: export __cgroup_from_dentry() and __cgroup_dput() Tejun Heo
@ 2013-08-04 16:07 ` Tejun Heo
  2013-08-05  3:14   ` Li Zefan
                     ` (2 more replies)
  2013-08-04 16:07 ` [PATCH 4/5] cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg Tejun Heo
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-04 16:07 UTC (permalink / raw)
  To: lizefan, hannes, mhocko, bsingharora, kamezawa.hiroyu
  Cc: cgroups, linux-mm, linux-kernel, Tejun Heo

cgroup_event is way over-designed and tries to build a generic
flexible event mechanism into cgroup - fully customizable event
specification for each user of the interface.  This is utterly
unnecessary and overboard especially in the light of the planned
unified hierarchy as there's gonna be single agent.  Simply generating
events at fixed points, or if that's too restrictive, configureable
cadence or single set of configureable points should be enough.

Thankfully, memcg is the only user and gets to keep it.  Replacing it
with something simpler on sane_behavior is strongly recommended.

This patch moves cgroup_event and "cgroup.event_control"
implementation to mm/memcontrol.c.  Clearing of events on cgroup
destruction is moved from cgroup_destroy_locked() to
mem_cgroup_css_offline(), which shouldn't make any noticeable
difference.

Note that "cgroup.event_control" will now exist only on the hierarchy
with memcg attached to it.  While this change is visible to userland,
it is unlikely to be noticeable as the file has never been meaningful
outside memcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
---
 kernel/cgroup.c | 237 -------------------------------------------------------
 mm/memcontrol.c | 238 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 238 insertions(+), 237 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2583b7b..a0b5e22 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -56,7 +56,6 @@
 #include <linux/pid_namespace.h>
 #include <linux/idr.h>
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
-#include <linux/eventfd.h>
 #include <linux/poll.h>
 #include <linux/flex_array.h> /* used in cgroup_attach_task */
 #include <linux/kthread.h>
@@ -154,36 +153,6 @@ struct css_id {
 	unsigned short stack[0]; /* Array of Length (depth+1) */
 };
 
-/*
- * cgroup_event represents events which userspace want to receive.
- */
-struct cgroup_event {
-	/*
-	 * css which the event belongs to.
-	 */
-	struct cgroup_subsys_state *css;
-	/*
-	 * Control file which the event associated.
-	 */
-	struct cftype *cft;
-	/*
-	 * eventfd to signal userspace about the event.
-	 */
-	struct eventfd_ctx *eventfd;
-	/*
-	 * Each of these stored in a list by the cgroup.
-	 */
-	struct list_head list;
-	/*
-	 * All fields below needed to unregister event when
-	 * userspace closes eventfd.
-	 */
-	poll_table pt;
-	wait_queue_head_t *wqh;
-	wait_queue_t wait;
-	struct work_struct remove;
-};
-
 /* The list of hierarchy roots */
 
 static LIST_HEAD(cgroup_roots);
@@ -3966,194 +3935,6 @@ void __cgroup_dput(struct cgroup *cgrp)
 }
 EXPORT_SYMBOL_GPL(__cgroup_dput);
 
-/*
- * Unregister event and free resources.
- *
- * Gets called from workqueue.
- */
-static void cgroup_event_remove(struct work_struct *work)
-{
-	struct cgroup_event *event = container_of(work, struct cgroup_event,
-			remove);
-	struct cgroup_subsys_state *css = event->css;
-	struct cgroup *cgrp = css->cgroup;
-
-	remove_wait_queue(event->wqh, &event->wait);
-
-	event->cft->unregister_event(css, event->cft, event->eventfd);
-
-	/* Notify userspace the event is going away. */
-	eventfd_signal(event->eventfd, 1);
-
-	eventfd_ctx_put(event->eventfd);
-	kfree(event);
-	__cgroup_dput(cgrp);
-}
-
-/*
- * Gets called on POLLHUP on eventfd when user closes it.
- *
- * Called with wqh->lock held and interrupts disabled.
- */
-static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
-		int sync, void *key)
-{
-	struct cgroup_event *event = container_of(wait,
-			struct cgroup_event, wait);
-	struct cgroup *cgrp = event->css->cgroup;
-	unsigned long flags = (unsigned long)key;
-
-	if (flags & POLLHUP) {
-		/*
-		 * If the event has been detached at cgroup removal, we
-		 * can simply return knowing the other side will cleanup
-		 * for us.
-		 *
-		 * We can't race against event freeing since the other
-		 * side will require wqh->lock via remove_wait_queue(),
-		 * which we hold.
-		 */
-		spin_lock(&cgrp->event_list_lock);
-		if (!list_empty(&event->list)) {
-			list_del_init(&event->list);
-			/*
-			 * We are in atomic context, but cgroup_event_remove()
-			 * may sleep, so we have to call it in workqueue.
-			 */
-			schedule_work(&event->remove);
-		}
-		spin_unlock(&cgrp->event_list_lock);
-	}
-
-	return 0;
-}
-
-static void cgroup_event_ptable_queue_proc(struct file *file,
-		wait_queue_head_t *wqh, poll_table *pt)
-{
-	struct cgroup_event *event = container_of(pt,
-			struct cgroup_event, pt);
-
-	event->wqh = wqh;
-	add_wait_queue(wqh, &event->wait);
-}
-
-/*
- * Parse input and register new cgroup event handler.
- *
- * Input must be in format '<event_fd> <control_fd> <args>'.
- * Interpretation of args is defined by control file implementation.
- */
-static int cgroup_write_event_control(struct cgroup_subsys_state *css,
-				      struct cftype *cft, const char *buffer)
-{
-	struct cgroup *cgrp = css->cgroup;
-	struct cgroup_event *event;
-	struct cgroup *cgrp_cfile;
-	unsigned int efd, cfd;
-	struct file *efile;
-	struct file *cfile;
-	char *endp;
-	int ret;
-
-	efd = simple_strtoul(buffer, &endp, 10);
-	if (*endp != ' ')
-		return -EINVAL;
-	buffer = endp + 1;
-
-	cfd = simple_strtoul(buffer, &endp, 10);
-	if ((*endp != ' ') && (*endp != '\0'))
-		return -EINVAL;
-	buffer = endp + 1;
-
-	event = kzalloc(sizeof(*event), GFP_KERNEL);
-	if (!event)
-		return -ENOMEM;
-	event->css = css;
-	INIT_LIST_HEAD(&event->list);
-	init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
-	init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
-	INIT_WORK(&event->remove, cgroup_event_remove);
-
-	efile = eventfd_fget(efd);
-	if (IS_ERR(efile)) {
-		ret = PTR_ERR(efile);
-		goto out_kfree;
-	}
-
-	event->eventfd = eventfd_ctx_fileget(efile);
-	if (IS_ERR(event->eventfd)) {
-		ret = PTR_ERR(event->eventfd);
-		goto out_put_efile;
-	}
-
-	cfile = fget(cfd);
-	if (!cfile) {
-		ret = -EBADF;
-		goto out_put_eventfd;
-	}
-
-	/* the process need read permission on control file */
-	/* AV: shouldn't we check that it's been opened for read instead? */
-	ret = inode_permission(file_inode(cfile), MAY_READ);
-	if (ret < 0)
-		goto out_put_cfile;
-
-	cgrp_cfile = __cgroup_from_dentry(cfile->f_dentry, &event->cft);
-	if (!cgrp_cfile) {
-		ret = -EINVAL;
-		goto out_put_cfile;
-	}
-
-	/*
-	 * The file to be monitored must be in the same cgroup as
-	 * cgroup.event_control is.
-	 */
-	if (cgrp_cfile != cgrp) {
-		ret = -EINVAL;
-		goto out_put_cfile;
-	}
-
-	if (!event->cft->register_event || !event->cft->unregister_event) {
-		ret = -EINVAL;
-		goto out_put_cfile;
-	}
-
-	ret = event->cft->register_event(css, event->cft,
-			event->eventfd, buffer);
-	if (ret)
-		goto out_put_cfile;
-
-	efile->f_op->poll(efile, &event->pt);
-
-	/*
-	 * Events should be removed after rmdir of cgroup directory, but before
-	 * destroying subsystem state objects. Let's take reference to cgroup
-	 * directory dentry to do that.
-	 */
-	dget(cgrp->dentry);
-
-	spin_lock(&cgrp->event_list_lock);
-	list_add(&event->list, &cgrp->event_list);
-	spin_unlock(&cgrp->event_list_lock);
-
-	fput(cfile);
-	fput(efile);
-
-	return 0;
-
-out_put_cfile:
-	fput(cfile);
-out_put_eventfd:
-	eventfd_ctx_put(event->eventfd);
-out_put_efile:
-	fput(efile);
-out_kfree:
-	kfree(event);
-
-	return ret;
-}
-
 static u64 cgroup_clone_children_read(struct cgroup_subsys_state *css,
 				      struct cftype *cft)
 {
@@ -4179,11 +3960,6 @@ static struct cftype cgroup_base_files[] = {
 		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
-		.name = "cgroup.event_control",
-		.write_string = cgroup_write_event_control,
-		.mode = S_IWUGO,
-	},
-	{
 		.name = "cgroup.clone_children",
 		.flags = CFTYPE_INSANE,
 		.read_u64 = cgroup_clone_children_read,
@@ -4567,7 +4343,6 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
 {
 	struct dentry *d = cgrp->dentry;
-	struct cgroup_event *event, *tmp;
 	struct cgroup_subsys *ss;
 	bool empty;
 
@@ -4638,18 +4413,6 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	dget(d);
 	cgroup_d_remove_dir(d);
 
-	/*
-	 * Unregister events and notify userspace.
-	 * Notify userspace about cgroup removing only after rmdir of cgroup
-	 * directory to avoid race between userspace and kernelspace.
-	 */
-	spin_lock(&cgrp->event_list_lock);
-	list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
-		list_del_init(&event->list);
-		schedule_work(&event->remove);
-	}
-	spin_unlock(&cgrp->event_list_lock);
-
 	return 0;
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2885e3e..3700b65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,6 +239,36 @@ struct mem_cgroup_eventfd_list {
 	struct eventfd_ctx *eventfd;
 };
 
+/*
+ * cgroup_event represents events which userspace want to receive.
+ */
+struct cgroup_event {
+	/*
+	 * css which the event belongs to.
+	 */
+	struct cgroup_subsys_state *css;
+	/*
+	 * Control file which the event associated.
+	 */
+	struct cftype *cft;
+	/*
+	 * eventfd to signal userspace about the event.
+	 */
+	struct eventfd_ctx *eventfd;
+	/*
+	 * Each of these stored in a list by the cgroup.
+	 */
+	struct list_head list;
+	/*
+	 * All fields below needed to unregister event when
+	 * userspace closes eventfd.
+	 */
+	poll_table pt;
+	wait_queue_head_t *wqh;
+	wait_queue_t wait;
+	struct work_struct remove;
+};
+
 static void mem_cgroup_threshold(struct mem_cgroup *memcg);
 static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
 
@@ -5926,6 +5956,194 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 }
 #endif
 
+/*
+ * Unregister event and free resources.
+ *
+ * Gets called from workqueue.
+ */
+static void cgroup_event_remove(struct work_struct *work)
+{
+	struct cgroup_event *event = container_of(work, struct cgroup_event,
+			remove);
+	struct cgroup_subsys_state *css = event->css;
+	struct cgroup *cgrp = css->cgroup;
+
+	remove_wait_queue(event->wqh, &event->wait);
+
+	event->cft->unregister_event(css, event->cft, event->eventfd);
+
+	/* Notify userspace the event is going away. */
+	eventfd_signal(event->eventfd, 1);
+
+	eventfd_ctx_put(event->eventfd);
+	kfree(event);
+	__cgroup_dput(cgrp);
+}
+
+/*
+ * Gets called on POLLHUP on eventfd when user closes it.
+ *
+ * Called with wqh->lock held and interrupts disabled.
+ */
+static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
+		int sync, void *key)
+{
+	struct cgroup_event *event = container_of(wait,
+			struct cgroup_event, wait);
+	struct cgroup *cgrp = event->css->cgroup;
+	unsigned long flags = (unsigned long)key;
+
+	if (flags & POLLHUP) {
+		/*
+		 * If the event has been detached at cgroup removal, we
+		 * can simply return knowing the other side will cleanup
+		 * for us.
+		 *
+		 * We can't race against event freeing since the other
+		 * side will require wqh->lock via remove_wait_queue(),
+		 * which we hold.
+		 */
+		spin_lock(&cgrp->event_list_lock);
+		if (!list_empty(&event->list)) {
+			list_del_init(&event->list);
+			/*
+			 * We are in atomic context, but cgroup_event_remove()
+			 * may sleep, so we have to call it in workqueue.
+			 */
+			schedule_work(&event->remove);
+		}
+		spin_unlock(&cgrp->event_list_lock);
+	}
+
+	return 0;
+}
+
+static void cgroup_event_ptable_queue_proc(struct file *file,
+		wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct cgroup_event *event = container_of(pt,
+			struct cgroup_event, pt);
+
+	event->wqh = wqh;
+	add_wait_queue(wqh, &event->wait);
+}
+
+/*
+ * Parse input and register new memcg event handler.
+ *
+ * Input must be in format '<event_fd> <control_fd> <args>'.
+ * Interpretation of args is defined by control file implementation.
+ */
+static int cgroup_write_event_control(struct cgroup_subsys_state *css,
+				      struct cftype *cft, const char *buffer)
+{
+	struct cgroup *cgrp = css->cgroup;
+	struct cgroup_event *event;
+	struct cgroup *cgrp_cfile;
+	unsigned int efd, cfd;
+	struct file *efile;
+	struct file *cfile;
+	char *endp;
+	int ret;
+
+	efd = simple_strtoul(buffer, &endp, 10);
+	if (*endp != ' ')
+		return -EINVAL;
+	buffer = endp + 1;
+
+	cfd = simple_strtoul(buffer, &endp, 10);
+	if ((*endp != ' ') && (*endp != '\0'))
+		return -EINVAL;
+	buffer = endp + 1;
+
+	event = kzalloc(sizeof(*event), GFP_KERNEL);
+	if (!event)
+		return -ENOMEM;
+	event->css = css;
+	INIT_LIST_HEAD(&event->list);
+	init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
+	init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
+	INIT_WORK(&event->remove, cgroup_event_remove);
+
+	efile = eventfd_fget(efd);
+	if (IS_ERR(efile)) {
+		ret = PTR_ERR(efile);
+		goto out_kfree;
+	}
+
+	event->eventfd = eventfd_ctx_fileget(efile);
+	if (IS_ERR(event->eventfd)) {
+		ret = PTR_ERR(event->eventfd);
+		goto out_put_efile;
+	}
+
+	cfile = fget(cfd);
+	if (!cfile) {
+		ret = -EBADF;
+		goto out_put_eventfd;
+	}
+
+	/* the process need read permission on control file */
+	/* AV: shouldn't we check that it's been opened for read instead? */
+	ret = inode_permission(file_inode(cfile), MAY_READ);
+	if (ret < 0)
+		goto out_put_cfile;
+
+	cgrp_cfile = __cgroup_from_dentry(cfile->f_dentry, &event->cft);
+	if (!cgrp_cfile) {
+		ret = -EINVAL;
+		goto out_put_cfile;
+	}
+
+	/*
+	 * The file to be monitored must be in the same cgroup as
+	 * cgroup.event_control is.
+	 */
+	if (cgrp_cfile != cgrp) {
+		ret = -EINVAL;
+		goto out_put_cfile;
+	}
+
+	if (!event->cft->register_event || !event->cft->unregister_event) {
+		ret = -EINVAL;
+		goto out_put_cfile;
+	}
+
+	ret = event->cft->register_event(css, event->cft,
+			event->eventfd, buffer);
+	if (ret)
+		goto out_put_cfile;
+
+	efile->f_op->poll(efile, &event->pt);
+
+	/*
+	 * Events should be removed after rmdir of cgroup directory, but before
+	 * destroying subsystem state objects. Let's take reference to cgroup
+	 * directory dentry to do that.
+	 */
+	dget(cgrp->dentry);
+
+	spin_lock(&cgrp->event_list_lock);
+	list_add(&event->list, &cgrp->event_list);
+	spin_unlock(&cgrp->event_list_lock);
+
+	fput(cfile);
+	fput(efile);
+
+	return 0;
+
+out_put_cfile:
+	fput(cfile);
+out_put_eventfd:
+	eventfd_ctx_put(event->eventfd);
+out_put_efile:
+	fput(efile);
+out_kfree:
+	kfree(event);
+
+	return ret;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -5973,6 +6191,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_hierarchy_read,
 	},
 	{
+		.name = "cgroup.event_control",
+		.write_string = cgroup_write_event_control,
+		.flags = CFTYPE_NO_PREFIX,
+		.mode = S_IWUGO,
+	},
+	{
 		.name = "swappiness",
 		.read_u64 = mem_cgroup_swappiness_read,
 		.write_u64 = mem_cgroup_swappiness_write,
@@ -6305,6 +6529,20 @@ static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct cgroup *cgrp = css->cgroup;
+	struct cgroup_event *event, *tmp;
+
+	/*
+	 * Unregister events and notify userspace.
+	 * Notify userspace about cgroup removing only after rmdir of cgroup
+	 * directory to avoid race between userspace and kernelspace.
+	 */
+	spin_lock(&cgrp->event_list_lock);
+	list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
+		list_del_init(&event->list);
+		schedule_work(&event->remove);
+	}
+	spin_unlock(&cgrp->event_list_lock);
 
 	kmem_cgroup_css_offline(memcg);
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/5] cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg
  2013-08-04 16:07 [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Tejun Heo
                   ` (2 preceding siblings ...)
  2013-08-04 16:07 ` [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg Tejun Heo
@ 2013-08-04 16:07 ` Tejun Heo
  2013-08-04 16:07 ` [PATCH 5/5] memcg: rename cgroup_event to mem_cgroup_event Tejun Heo
  2013-08-05 16:01 ` [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Michal Hocko
  5 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-04 16:07 UTC (permalink / raw)
  To: lizefan, hannes, mhocko, bsingharora, kamezawa.hiroyu
  Cc: cgroups, linux-mm, linux-kernel, Tejun Heo

cgroup_event is being moved from cgroup core to memcg and the
implementation is already moved by the previous patch.  This patch
moves the data fields and callbacks.

* cgroup->event_list[_lock] are moved to mem_cgroup.

* cftype->[un]register_event() are moved to cgroup_event.  This makes
  it impossible for individual cftype definitions to specify their
  event callbacks.  This is worked around by simply hard-coding
  filename to event callback mapping in cgroup_write_event_control().
  This is awkward and inflexible, which is actually desirable given
  that we don't want to grow more usages of this feature.

* eventfd_ctx declaration is removed from cgroup.h, which makes
  vmpressure.h miss eventfd_ctx declaration.  Include eventfd.h from
  vmpressure.h.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
---
 include/linux/cgroup.h     | 24 ---------------
 include/linux/vmpressure.h |  1 +
 kernel/cgroup.c            |  2 --
 mm/memcontrol.c            | 75 ++++++++++++++++++++++++++++++++--------------
 4 files changed, 54 insertions(+), 48 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2ac1021..e33eb7b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -29,7 +29,6 @@ struct cgroup_subsys;
 struct inode;
 struct cgroup;
 struct css_id;
-struct eventfd_ctx;
 
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
@@ -233,10 +232,6 @@ struct cgroup {
 	struct work_struct destroy_work;
 	atomic_t css_kill_cnt;
 
-	/* List of events which userspace want to receive */
-	struct list_head event_list;
-	spinlock_t event_list_lock;
-
 	/* directory xattrs */
 	struct simple_xattrs xattrs;
 };
@@ -500,25 +495,6 @@ struct cftype {
 	int (*trigger)(struct cgroup_subsys_state *css, unsigned int event);
 
 	int (*release)(struct inode *inode, struct file *file);
-
-	/*
-	 * register_event() callback will be used to add new userspace
-	 * waiter for changes related to the cftype. Implement it if
-	 * you want to provide this functionality. Use eventfd_signal()
-	 * on eventfd to send notification to userspace.
-	 */
-	int (*register_event)(struct cgroup_subsys_state *css,
-			      struct cftype *cft, struct eventfd_ctx *eventfd,
-			      const char *args);
-	/*
-	 * unregister_event() callback will be called when userspace
-	 * closes the eventfd or on cgroup removing.
-	 * This callback must be implemented, if you want provide
-	 * notification functionality.
-	 */
-	void (*unregister_event)(struct cgroup_subsys_state *css,
-				 struct cftype *cft,
-				 struct eventfd_ctx *eventfd);
 };
 
 /*
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index b239482..324ea7a 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -7,6 +7,7 @@
 #include <linux/gfp.h>
 #include <linux/types.h>
 #include <linux/cgroup.h>
+#include <linux/eventfd.h>
 
 struct vmpressure {
 	unsigned long scanned;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a0b5e22..2bc45d3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1352,8 +1352,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->pidlists);
 	mutex_init(&cgrp->pidlist_mutex);
 	cgrp->dummy_css.cgroup = cgrp;
-	INIT_LIST_HEAD(&cgrp->event_list);
-	spin_lock_init(&cgrp->event_list_lock);
 	simple_xattrs_init(&cgrp->xattrs);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3700b65..e988bf1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -260,6 +260,22 @@ struct cgroup_event {
 	 */
 	struct list_head list;
 	/*
+	 * register_event() callback will be used to add new userspace
+	 * waiter for changes related to this event.  Use eventfd_signal()
+	 * on eventfd to send notification to userspace.
+	 */
+	int (*register_event)(struct cgroup_subsys_state *css,
+			      struct cftype *cft, struct eventfd_ctx *eventfd,
+			      const char *args);
+	/*
+	 * unregister_event() callback will be called when userspace closes
+	 * the eventfd or on cgroup removing.  This callback must be set,
+	 * if you want provide notification functionality.
+	 */
+	void (*unregister_event)(struct cgroup_subsys_state *css,
+				 struct cftype *cft,
+				 struct eventfd_ctx *eventfd);
+	/*
 	 * All fields below needed to unregister event when
 	 * userspace closes eventfd.
 	 */
@@ -372,6 +388,10 @@ struct mem_cgroup {
 	atomic_t	numainfo_updating;
 #endif
 
+	/* List of events which userspace want to receive */
+	struct list_head event_list;
+	spinlock_t event_list_lock;
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
@@ -5970,7 +5990,7 @@ static void cgroup_event_remove(struct work_struct *work)
 
 	remove_wait_queue(event->wqh, &event->wait);
 
-	event->cft->unregister_event(css, event->cft, event->eventfd);
+	event->unregister_event(css, event->cft, event->eventfd);
 
 	/* Notify userspace the event is going away. */
 	eventfd_signal(event->eventfd, 1);
@@ -5990,7 +6010,7 @@ static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
 {
 	struct cgroup_event *event = container_of(wait,
 			struct cgroup_event, wait);
-	struct cgroup *cgrp = event->css->cgroup;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(event->css);
 	unsigned long flags = (unsigned long)key;
 
 	if (flags & POLLHUP) {
@@ -6003,7 +6023,7 @@ static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
 		 * side will require wqh->lock via remove_wait_queue(),
 		 * which we hold.
 		 */
-		spin_lock(&cgrp->event_list_lock);
+		spin_lock(&memcg->event_list_lock);
 		if (!list_empty(&event->list)) {
 			list_del_init(&event->list);
 			/*
@@ -6012,7 +6032,7 @@ static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
 			 */
 			schedule_work(&event->remove);
 		}
-		spin_unlock(&cgrp->event_list_lock);
+		spin_unlock(&memcg->event_list_lock);
 	}
 
 	return 0;
@@ -6037,6 +6057,7 @@ static void cgroup_event_ptable_queue_proc(struct file *file,
 static int cgroup_write_event_control(struct cgroup_subsys_state *css,
 				      struct cftype *cft, const char *buffer)
 {
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct cgroup *cgrp = css->cgroup;
 	struct cgroup_event *event;
 	struct cgroup *cgrp_cfile;
@@ -6104,13 +6125,30 @@ static int cgroup_write_event_control(struct cgroup_subsys_state *css,
 		goto out_put_cfile;
 	}
 
-	if (!event->cft->register_event || !event->cft->unregister_event) {
+	/*
+	 * Determine the event callbacks and set them in @event.  This used
+	 * to be done via struct cftype but cgroup core no longer knows
+	 * about these events.  The following is crude but the whole thing
+	 * is for compatibility anyway.
+	 */
+	if (!strcmp(event->cft->name, "usage_in_bytes")) {
+		event->register_event = mem_cgroup_usage_register_event;
+		event->unregister_event = mem_cgroup_usage_unregister_event;
+	} else if (!strcmp(event->cft->name, "oom_control")) {
+		event->register_event = mem_cgroup_oom_register_event;
+		event->unregister_event = mem_cgroup_oom_unregister_event;
+	} else if (!strcmp(event->cft->name, "pressure_level")) {
+		event->register_event = vmpressure_register_event;
+		event->unregister_event = vmpressure_unregister_event;
+	} else if (!strcmp(event->cft->name, "memsw.usage_in_bytes")) {
+		event->register_event = mem_cgroup_usage_register_event;
+		event->unregister_event = mem_cgroup_usage_unregister_event;
+	} else {
 		ret = -EINVAL;
 		goto out_put_cfile;
 	}
 
-	ret = event->cft->register_event(css, event->cft,
-			event->eventfd, buffer);
+	ret = event->register_event(css, event->cft, event->eventfd, buffer);
 	if (ret)
 		goto out_put_cfile;
 
@@ -6123,9 +6161,9 @@ static int cgroup_write_event_control(struct cgroup_subsys_state *css,
 	 */
 	dget(cgrp->dentry);
 
-	spin_lock(&cgrp->event_list_lock);
-	list_add(&event->list, &cgrp->event_list);
-	spin_unlock(&cgrp->event_list_lock);
+	spin_lock(&memcg->event_list_lock);
+	list_add(&event->list, &memcg->event_list);
+	spin_unlock(&memcg->event_list_lock);
 
 	fput(cfile);
 	fput(efile);
@@ -6149,8 +6187,6 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
 		.read = mem_cgroup_read,
-		.register_event = mem_cgroup_usage_register_event,
-		.unregister_event = mem_cgroup_usage_unregister_event,
 	},
 	{
 		.name = "max_usage_in_bytes",
@@ -6210,14 +6246,10 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "oom_control",
 		.read_map = mem_cgroup_oom_control_read,
 		.write_u64 = mem_cgroup_oom_control_write,
-		.register_event = mem_cgroup_oom_register_event,
-		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
 	{
 		.name = "pressure_level",
-		.register_event = vmpressure_register_event,
-		.unregister_event = vmpressure_unregister_event,
 	},
 #ifdef CONFIG_NUMA
 	{
@@ -6265,8 +6297,6 @@ static struct cftype memsw_cgroup_files[] = {
 		.name = "memsw.usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
 		.read = mem_cgroup_read,
-		.register_event = mem_cgroup_usage_register_event,
-		.unregister_event = mem_cgroup_usage_unregister_event,
 	},
 	{
 		.name = "memsw.max_usage_in_bytes",
@@ -6457,6 +6487,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	mutex_init(&memcg->thresholds_lock);
 	spin_lock_init(&memcg->move_lock);
 	vmpressure_init(&memcg->vmpressure);
+	INIT_LIST_HEAD(&memcg->event_list);
+	spin_lock_init(&memcg->event_list_lock);
 
 	return &memcg->css;
 
@@ -6529,7 +6561,6 @@ static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	struct cgroup *cgrp = css->cgroup;
 	struct cgroup_event *event, *tmp;
 
 	/*
@@ -6537,12 +6568,12 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	 * Notify userspace about cgroup removing only after rmdir of cgroup
 	 * directory to avoid race between userspace and kernelspace.
 	 */
-	spin_lock(&cgrp->event_list_lock);
-	list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
+	spin_lock(&memcg->event_list_lock);
+	list_for_each_entry_safe(event, tmp, &memcg->event_list, list) {
 		list_del_init(&event->list);
 		schedule_work(&event->remove);
 	}
-	spin_unlock(&cgrp->event_list_lock);
+	spin_unlock(&memcg->event_list_lock);
 
 	kmem_cgroup_css_offline(memcg);
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/5] memcg: rename cgroup_event to mem_cgroup_event
  2013-08-04 16:07 [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Tejun Heo
                   ` (3 preceding siblings ...)
  2013-08-04 16:07 ` [PATCH 4/5] cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg Tejun Heo
@ 2013-08-04 16:07 ` Tejun Heo
  2013-08-05  3:26   ` Li Zefan
  2013-08-05 17:10   ` [PATCH v2 " Tejun Heo
  2013-08-05 16:01 ` [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Michal Hocko
  5 siblings, 2 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-04 16:07 UTC (permalink / raw)
  To: lizefan, hannes, mhocko, bsingharora, kamezawa.hiroyu
  Cc: cgroups, linux-mm, linux-kernel, Tejun Heo

cgroup_event is only available in memcg now.  Let's brand it that way.
While at it, add a comment encouraging deprecation of the feature and
remove the respective section from cgroup documentation.

This patch is cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/cgroups.txt | 19 -------------
 mm/memcontrol.c                   | 57 +++++++++++++++++++++++++--------------
 2 files changed, 37 insertions(+), 39 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 638bf17..ca5aee9 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -472,25 +472,6 @@ you give a subsystem a name.
 The name of the subsystem appears as part of the hierarchy description
 in /proc/mounts and /proc/<pid>/cgroups.
 
-2.4 Notification API
---------------------
-
-There is mechanism which allows to get notifications about changing
-status of a cgroup.
-
-To register a new notification handler you need to:
- - create a file descriptor for event notification using eventfd(2);
- - open a control file to be monitored (e.g. memory.usage_in_bytes);
- - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
-   Interpretation of args is defined by control file implementation;
-
-eventfd will be woken up by control file implementation or when the
-cgroup is removed.
-
-To unregister a notification handler just close eventfd.
-
-NOTE: Support of notifications should be implemented for the control
-file. See documentation for the subsystem.
 
 3. Kernel API
 =============
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e988bf1..240ae72 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -242,7 +242,7 @@ struct mem_cgroup_eventfd_list {
 /*
  * cgroup_event represents events which userspace want to receive.
  */
-struct cgroup_event {
+struct mem_cgroup_event {
 	/*
 	 * css which the event belongs to.
 	 */
@@ -5977,14 +5977,27 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 #endif
 
 /*
+ * DO NOT USE IN NEW FILES.
+ *
+ * "cgroup.event_control" implementation.
+ *
+ * This is way over-engineered.  It tries to support fully configureable
+ * events for each user.  Such level of flexibility is completely
+ * unnecessary especially in the light of the planned unified hierarchy.
+ *
+ * Please deprecate this and replace with something simpler if at all
+ * possible.
+ */
+
+/*
  * Unregister event and free resources.
  *
  * Gets called from workqueue.
  */
-static void cgroup_event_remove(struct work_struct *work)
+static void memcg_event_remove(struct work_struct *work)
 {
-	struct cgroup_event *event = container_of(work, struct cgroup_event,
-			remove);
+	struct mem_cgroup_event *event = container_of(work,
+					struct mem_cgroup_event, remove);
 	struct cgroup_subsys_state *css = event->css;
 	struct cgroup *cgrp = css->cgroup;
 
@@ -6005,11 +6018,11 @@ static void cgroup_event_remove(struct work_struct *work)
  *
  * Called with wqh->lock held and interrupts disabled.
  */
-static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
-		int sync, void *key)
+static int memcg_event_wake(wait_queue_t *wait, unsigned mode,
+			    int sync, void *key)
 {
-	struct cgroup_event *event = container_of(wait,
-			struct cgroup_event, wait);
+	struct mem_cgroup_event *event =
+		container_of(wait, struct mem_cgroup_event, wait);
 	struct mem_cgroup *memcg = mem_cgroup_from_css(event->css);
 	unsigned long flags = (unsigned long)key;
 
@@ -6038,28 +6051,30 @@ static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
 	return 0;
 }
 
-static void cgroup_event_ptable_queue_proc(struct file *file,
+static void memcg_event_ptable_queue_proc(struct file *file,
 		wait_queue_head_t *wqh, poll_table *pt)
 {
-	struct cgroup_event *event = container_of(pt,
-			struct cgroup_event, pt);
+	struct mem_cgroup_event *event =
+		container_of(pt, struct mem_cgroup_event, pt);
 
 	event->wqh = wqh;
 	add_wait_queue(wqh, &event->wait);
 }
 
 /*
+ * DO NOT USE IN NEW FILES.
+ *
  * Parse input and register new memcg event handler.
  *
  * Input must be in format '<event_fd> <control_fd> <args>'.
  * Interpretation of args is defined by control file implementation.
  */
-static int cgroup_write_event_control(struct cgroup_subsys_state *css,
-				      struct cftype *cft, const char *buffer)
+static int memcg_write_event_control(struct cgroup_subsys_state *css,
+				     struct cftype *cft, const char *buffer)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct cgroup *cgrp = css->cgroup;
-	struct cgroup_event *event;
+	struct mem_cgroup_event *event;
 	struct cgroup *cgrp_cfile;
 	unsigned int efd, cfd;
 	struct file *efile;
@@ -6082,9 +6097,9 @@ static int cgroup_write_event_control(struct cgroup_subsys_state *css,
 		return -ENOMEM;
 	event->css = css;
 	INIT_LIST_HEAD(&event->list);
-	init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
-	init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
-	INIT_WORK(&event->remove, cgroup_event_remove);
+	init_poll_funcptr(&event->pt, memcg_event_ptable_queue_proc);
+	init_waitqueue_func_entry(&event->wait, memcg_event_wake);
+	INIT_WORK(&event->remove, memcg_event_remove);
 
 	efile = eventfd_fget(efd);
 	if (IS_ERR(efile)) {
@@ -6130,6 +6145,8 @@ static int cgroup_write_event_control(struct cgroup_subsys_state *css,
 	 * to be done via struct cftype but cgroup core no longer knows
 	 * about these events.  The following is crude but the whole thing
 	 * is for compatibility anyway.
+	 *
+	 * DO NOT ADD NEW FILES.
 	 */
 	if (!strcmp(event->cft->name, "usage_in_bytes")) {
 		event->register_event = mem_cgroup_usage_register_event;
@@ -6227,8 +6244,8 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_hierarchy_read,
 	},
 	{
-		.name = "cgroup.event_control",
-		.write_string = cgroup_write_event_control,
+		.name = "cgroup.event_control",		/* XXX: for compat */
+		.write_string = memcg_write_event_control,
 		.flags = CFTYPE_NO_PREFIX,
 		.mode = S_IWUGO,
 	},
@@ -6561,7 +6578,7 @@ static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	struct cgroup_event *event, *tmp;
+	struct mem_cgroup_event *event, *tmp;
 
 	/*
 	 * Unregister events and notify userspace.
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] cgroup: export __cgroup_from_dentry() and __cgroup_dput()
  2013-08-04 16:07 ` [PATCH 2/5] cgroup: export __cgroup_from_dentry() and __cgroup_dput() Tejun Heo
@ 2013-08-05  2:58   ` Li Zefan
  2013-08-05 15:40     ` Tejun Heo
  2013-08-05 17:08   ` [PATCH v2 2/5] cgroup: make __cgroup_from_dentry() and __cgroup_dput() global Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Li Zefan @ 2013-08-05  2:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes, mhocko, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

> +struct cgroup *__cgroup_from_dentry(struct dentry *dentry, struct cftype **cftp)
>  {
> -	if (file_inode(file)->i_fop != &cgroup_file_operations)
> -		return ERR_PTR(-EINVAL);
> -	return __d_cft(file->f_dentry);
> +	if (!dentry->d_inode ||
> +	    dentry->d_inode->i_op != &cgroup_file_inode_operations)
> +		return NULL;
> +
> +	if (cftp)
> +		*cftp = __d_cft(dentry);
> +	return __d_cgrp(dentry->d_parent);
>  }
> +EXPORT_SYMBOL_GPL(__cgroup_from_dentry);

As we don't expect new users, why export this symbol? memcg can't be
built as a module.

>  
>  static int cgroup_create_file(struct dentry *dentry, umode_t mode,
>  				struct super_block *sb)
> @@ -3953,7 +3956,7 @@ static int cgroup_write_notify_on_release(struct cgroup_subsys_state *css,
>   *
>   * That's why we hold a reference before dput() and drop it right after.
>   */
> -static void cgroup_dput(struct cgroup *cgrp)
> +void __cgroup_dput(struct cgroup *cgrp)
>  {
>  	struct super_block *sb = cgrp->root->sb;
>  
> @@ -3961,6 +3964,7 @@ static void cgroup_dput(struct cgroup *cgrp)
>  	dput(cgrp->dentry);
>  	deactivate_super(sb);
>  }
> +EXPORT_SYMBOL_GPL(__cgroup_dput);

ditto

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg
  2013-08-04 16:07 ` [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg Tejun Heo
@ 2013-08-05  3:14   ` Li Zefan
  2013-08-05 17:09   ` [PATCH v2 " Tejun Heo
  2013-08-06  3:26   ` [PATCH " Balbir Singh
  2 siblings, 0 replies; 30+ messages in thread
From: Li Zefan @ 2013-08-05  3:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes, mhocko, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

OU 2013/8/5 0:07, Tejun Heo D'uA:
> cgroup_event is way over-designed and tries to build a generic
> flexible event mechanism into cgroup - fully customizable event
> specification for each user of the interface.  This is utterly
> unnecessary and overboard especially in the light of the planned
> unified hierarchy as there's gonna be single agent.  Simply generating
> events at fixed points, or if that's too restrictive, configureable
> cadence or single set of configureable points should be enough.
> 
> Thankfully, memcg is the only user and gets to keep it.  Replacing it
> with something simpler on sane_behavior is strongly recommended.
> 
> This patch moves cgroup_event and "cgroup.event_control"
> implementation to mm/memcontrol.c.  Clearing of events on cgroup
> destruction is moved from cgroup_destroy_locked() to
> mem_cgroup_css_offline(), which shouldn't make any noticeable
> difference.
> 
> Note that "cgroup.event_control" will now exist only on the hierarchy
> with memcg attached to it.  While this change is visible to userland,
> it is unlikely to be noticeable as the file has never been meaningful
> outside memcg.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Balbir Singh <bsingharora@gmail.com>
> ---
>  kernel/cgroup.c | 237 -------------------------------------------------------
>  mm/memcontrol.c | 238 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 238 insertions(+), 237 deletions(-)
> 

init/Kconfig needs to be updated too:

menuconfig CGROUPS
        boolean "Control Group support"
        depends on EVENTFD
...
config SCHED_AUTOGROUP
        bool "Automatic process group scheduling"
        select EVENTFD
        select CGROUPS

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2583b7b..a0b5e22 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -56,7 +56,6 @@
>  #include <linux/pid_namespace.h>
>  #include <linux/idr.h>
>  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
> -#include <linux/eventfd.h>
>  #include <linux/poll.h>

poll.h also can be removed.

>  #include <linux/flex_array.h> /* used in cgroup_attach_task */
>  #include <linux/kthread.h>
> @@ -154,36 +153,6 @@ struct css_id {
>  	unsigned short stack[0]; /* Array of Length (depth+1) */
>  };
>  

[...]

>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2885e3e..3700b65 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c

#include <linux/eventfd.h>
#include <linux/poll.h>

> @@ -239,6 +239,36 @@ struct mem_cgroup_eventfd_list {
>  	struct eventfd_ctx *eventfd;
>  };


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] memcg: rename cgroup_event to mem_cgroup_event
  2013-08-04 16:07 ` [PATCH 5/5] memcg: rename cgroup_event to mem_cgroup_event Tejun Heo
@ 2013-08-05  3:26   ` Li Zefan
  2013-08-05 17:10   ` [PATCH v2 " Tejun Heo
  1 sibling, 0 replies; 30+ messages in thread
From: Li Zefan @ 2013-08-05  3:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes, mhocko, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

On 2013/8/5 0:07, Tejun Heo wrote:
> cgroup_event is only available in memcg now.  Let's brand it that way.
> While at it, add a comment encouraging deprecation of the feature and
> remove the respective section from cgroup documentation.
> 
> This patch is cosmetic.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  Documentation/cgroups/cgroups.txt | 19 -------------
>  mm/memcontrol.c                   | 57 +++++++++++++++++++++++++--------------
>  2 files changed, 37 insertions(+), 39 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 638bf17..ca5aee9 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -472,25 +472,6 @@ you give a subsystem a name.
>  The name of the subsystem appears as part of the hierarchy description
>  in /proc/mounts and /proc/<pid>/cgroups.
> 

2. Usage Examples and Syntax
  2.1 Basic Usage
  2.2 Attaching processes
  2.3 Mounting hierarchies by name
  2.4 Notification API

remove the index ?
 
> -2.4 Notification API
> ---------------------
> -
> -There is mechanism which allows to get notifications about changing
> -status of a cgroup.
> -
> -To register a new notification handler you need to:
> - - create a file descriptor for event notification using eventfd(2);
> - - open a control file to be monitored (e.g. memory.usage_in_bytes);
> - - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
> -   Interpretation of args is defined by control file implementation;
> -
> -eventfd will be woken up by control file implementation or when the
> -cgroup is removed.
> -
> -To unregister a notification handler just close eventfd.
> -
> -NOTE: Support of notifications should be implemented for the control
> -file. See documentation for the subsystem.
>  

Why not move this section to Documentation/cgroups/memory.txt?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] cgroup: export __cgroup_from_dentry() and __cgroup_dput()
  2013-08-05  2:58   ` Li Zefan
@ 2013-08-05 15:40     ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-05 15:40 UTC (permalink / raw)
  To: Li Zefan
  Cc: hannes, mhocko, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

On Mon, Aug 05, 2013 at 10:58:13AM +0800, Li Zefan wrote:
> > +struct cgroup *__cgroup_from_dentry(struct dentry *dentry, struct cftype **cftp)
> >  {
> > -	if (file_inode(file)->i_fop != &cgroup_file_operations)
> > -		return ERR_PTR(-EINVAL);
> > -	return __d_cft(file->f_dentry);
> > +	if (!dentry->d_inode ||
> > +	    dentry->d_inode->i_op != &cgroup_file_inode_operations)
> > +		return NULL;
> > +
> > +	if (cftp)
> > +		*cftp = __d_cft(dentry);
> > +	return __d_cgrp(dentry->d_parent);
> >  }
> > +EXPORT_SYMBOL_GPL(__cgroup_from_dentry);
> 
> As we don't expect new users, why export this symbol? memcg can't be
> built as a module.

Yeah, I for some reason was thinking memcg could be bulit as module.
Brainfart.  Dropped.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-04 16:07 [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Tejun Heo
                   ` (4 preceding siblings ...)
  2013-08-04 16:07 ` [PATCH 5/5] memcg: rename cgroup_event to mem_cgroup_event Tejun Heo
@ 2013-08-05 16:01 ` Michal Hocko
  2013-08-05 16:29   ` Tejun Heo
  5 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-08-05 16:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

On Sun 04-08-13 12:07:21, Tejun Heo wrote:
> Hello,

Hi Tejun,

> Like many other things in cgroup, cgroup_event is way too flexible and
> complex - it strives to provide completely flexible event monitoring
> facility in cgroup proper which allows any number of users to monitor
> custom events.  This is overboard, to say the least,

Could you be more specific about what is so "overboard" about this
interface? I am not familiar with internals much, so I cannot judge the
complexity part, but I thought that eventfd was intended for this kind
of kernel->userspace notifications.

> and I strongly think that cgroup should not any new usages of this
> facility and preferably deprecate the existing usages if at all
> possible.

So you think that vmpressure, oom notification or thresholds are
an abuse of this interface? What would you consider a reasonable
replacement for those notifications?  Or do you think that controller
shouldn't be signaling any conditions to the userspace at all?

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-05 16:01 ` [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Michal Hocko
@ 2013-08-05 16:29   ` Tejun Heo
  2013-08-05 19:16     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2013-08-05 16:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

Hello, Michal.

On Mon, Aug 05, 2013 at 06:01:07PM +0200, Michal Hocko wrote:
> Could you be more specific about what is so "overboard" about this
> interface? I am not familiar with internals much, so I cannot judge the
> complexity part, but I thought that eventfd was intended for this kind
> of kernel->userspace notifications.

It's just way over-engineered like many other things in cgroup, most
likely misguided by the appearance that cgroup could be delegated and
accessed by multiple actors concurrently.

The most clear example would be the vmpressure event.  When it could
have just called fsnotify_modify() unconditionally when the state
changes, now it involves parsing, dynamic list of events and so on
without actually adding any benefits.  For the usage ones,
configurability makes some sense but even then just giving it a single
array of event points of limited size would be sufficient.

It's just way over-done.

> So you think that vmpressure, oom notification or thresholds are
> an abuse of this interface? What would you consider a reasonable
> replacement for those notifications?  Or do you think that controller
> shouldn't be signaling any conditions to the userspace at all?

I don't think the ability to generate events are an abuse, just that
the facility itself is way over-engineered.  Just generate a file
changed event unconditionally for vmpressure and oom and maybe
implement configureable cadence or single set of threshold array for
threshold events.  These are things which can and should be done in a
a few tens of lines of code with far simpler interface.  There's no
need for this obsecenely flexible event infrastructure, which of
course leads to things like shared contiguous threshold table without
any size limit and allocated with kmalloc().

So, let's please move towards something simple.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 2/5] cgroup: make __cgroup_from_dentry() and __cgroup_dput() global
  2013-08-04 16:07 ` [PATCH 2/5] cgroup: export __cgroup_from_dentry() and __cgroup_dput() Tejun Heo
  2013-08-05  2:58   ` Li Zefan
@ 2013-08-05 17:08   ` Tejun Heo
  1 sibling, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-05 17:08 UTC (permalink / raw)
  To: lizefan, hannes, mhocko, bsingharora, kamezawa.hiroyu
  Cc: cgroups, linux-mm, linux-kernel



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 3/5] cgroup, memcg: move cgroup_event implementation to memcg
  2013-08-04 16:07 ` [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg Tejun Heo
  2013-08-05  3:14   ` Li Zefan
@ 2013-08-05 17:09   ` Tejun Heo
  2013-08-06  2:02     ` Li Zefan
  2013-08-06  3:26   ` [PATCH " Balbir Singh
  2 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2013-08-05 17:09 UTC (permalink / raw)
  To: lizefan, hannes, mhocko, bsingharora, kamezawa.hiroyu
  Cc: cgroups, linux-mm, linux-kernel



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 5/5] memcg: rename cgroup_event to mem_cgroup_event
  2013-08-04 16:07 ` [PATCH 5/5] memcg: rename cgroup_event to mem_cgroup_event Tejun Heo
  2013-08-05  3:26   ` Li Zefan
@ 2013-08-05 17:10   ` Tejun Heo
  1 sibling, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-05 17:10 UTC (permalink / raw)
  To: lizefan, hannes, mhocko, bsingharora, kamezawa.hiroyu
  Cc: cgroups, linux-mm, linux-kernel



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-05 16:29   ` Tejun Heo
@ 2013-08-05 19:16     ` Michal Hocko
  2013-08-05 19:44       ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-08-05 19:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

On Mon 05-08-13 12:29:58, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Aug 05, 2013 at 06:01:07PM +0200, Michal Hocko wrote:
> > Could you be more specific about what is so "overboard" about this
> > interface? I am not familiar with internals much, so I cannot judge the
> > complexity part, but I thought that eventfd was intended for this kind
> > of kernel->userspace notifications.
> 
> It's just way over-engineered like many other things in cgroup, most
> likely misguided by the appearance that cgroup could be delegated and
> accessed by multiple actors concurrently.

I keep hearing that over and over. And I also keep hearing that there
are users who do not like many simplifications because they are breaking
their usecases. Users are those who matter to me. Hey some of them are
even sane...

> The most clear example would be the vmpressure event.  When it could
> have just called fsnotify_modify() unconditionally when the state
> changes, now it involves parsing, dynamic list of events and so on
> without actually adding any benefits.

I am neither author nor user of this interface but my understanding is
that there are different requirements from different usecases and it
would be hard to satisfy them without having a way for userspace
to tell the kernel what it is interested in. There was a discussion
about edge vs. all-events triggered signaling recently for example.

Besides that, is fsnotify really an interface to be used under memory
pressure? I might be wrong but from a quick look fsnotify depends on
GFP_KERNEL allocation which would be no-go for oom_control at least.
How does the reclaim context gets to struct file to notify? I am pretty
sure we would get to more and more questions when digging further.

I am all for simplifications, but removing interfaces just because you
feel they are "over-done" is not a way to go IMHO. In this particular
case you are removing an interface from cgroup core which has users,
and will have to support them for very long time. "It is just memcg
so move it there" is not a way that different subsystems should work
together and I am _not_ going to ack such a move. All the flexibility that
you are so complaining about is hidden from the cgroup core in register
callbacks and the rest is only the core infrastructure (registration and
unregistration).

And btw. a common notification interface at least makes things
consistent and prevents controllers to invent their one purpose
solutions.

So I am really skeptical about this patch set. It doesn't give anything.
It just moves a code which you do not like out of your sight hoping that
something will change.

There were mistakes done in the past. And some interfaces are really too
flexible but that doesn't mean we should be militant about everything.

> For the usage ones, configurability makes some sense but even then
> just giving it a single array of event points of limited size would be
> sufficient.

This would be a question for users. I am not one of those so I cannot
tell you but I certainly cannot claim that something more coarse would
be sufficient either.

> It's just way over-done.

> > So you think that vmpressure, oom notification or thresholds are
> > an abuse of this interface? What would you consider a reasonable
> > replacement for those notifications?  Or do you think that controller
> > shouldn't be signaling any conditions to the userspace at all?
> 
> I don't think the ability to generate events are an abuse, just that
> the facility itself is way over-engineered.  Just generate a file
> changed event unconditionally for vmpressure and oom and maybe
> implement configureable cadence or single set of threshold array for
> threshold events.  These are things which can and should be done in a
> a few tens of lines of code with far simpler interface. 

These are strong words without any justification.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-05 19:16     ` Michal Hocko
@ 2013-08-05 19:44       ` Tejun Heo
  2013-08-06 15:58         ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2013-08-05 19:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

Hey, Michal.

On Mon, Aug 05, 2013 at 09:16:41PM +0200, Michal Hocko wrote:
> I keep hearing that over and over. And I also keep hearing that there
> are users who do not like many simplifications because they are breaking
> their usecases. Users are those who matter to me. Hey some of them are
> even sane...

There needs to be some balance between serving use cases and cutting
off edge ones which pollutes the code with unjustified complexity.  We
went way far to one end and are trying to cut back, so complaints are
expected.

> Besides that, is fsnotify really an interface to be used under memory
> pressure? I might be wrong but from a quick look fsnotify depends on
> GFP_KERNEL allocation which would be no-go for oom_control at least.

Yeap, I'm pretty sure you're wrong.  I didn't follow the development
but this is wrapper to combine inotify and dnotify and kernel's main
FS event mechanism and none of the two mechanisms required memory
allocation during delivery.

> How does the reclaim context gets to struct file to notify? I am pretty
> sure we would get to more and more questions when digging further.

The file is optional and probably just to extract path for dnotify, I
think.  Not sure about namespace implications but I'll build cgroup
level interface for it so that controllers won't have to deal with it
directly.  ie. there'll be css_notify_file(css, cfe) interface.

> I am all for simplifications, but removing interfaces just because you
> feel they are "over-done" is not a way to go IMHO. In this particular

We can keep it around but it's a pretty good time to gradually move
towards something saner as we're essentially doing v2 of the
interface.  Note that there's no reason for the interface to disappear
immediately.  You can keep it and there won't be any problem for it to
co-exist with simpler mechanism.

> case you are removing an interface from cgroup core which has users,
> and will have to support them for very long time. "It is just memcg
> so move it there" is not a way that different subsystems should work
> together and I am _not_ going to ack such a move. All the flexibility that
> you are so complaining about is hidden from the cgroup core in register
> callbacks and the rest is only the core infrastructure (registration and
> unregistration).

memcg is the only user and will stay that way.  If you wanna keep it
around, be my guest.  Also, cftype is planned to be simplified so that
it just provides seq_file interface and the types become helper
routines like a normal file callback interface.  There'll be nothing
tying the event file handling to cgroup core and it surely won't be
used by anyone else.  memcg is and will be the only user.  It's the
natural place for the code.

> And btw. a common notification interface at least makes things
> consistent and prevents controllers to invent their one purpose
> solutions.

We'll surely have a common notification interface which is simple -
basically just one function - as it should be.

> So I am really skeptical about this patch set. It doesn't give anything.
> It just moves a code which you do not like out of your sight hoping that
> something will change.
> 
> There were mistakes done in the past. And some interfaces are really too
> flexible but that doesn't mean we should be militant about everything.
> 
> > For the usage ones, configurability makes some sense but even then
> > just giving it a single array of event points of limited size would be
> > sufficient.
> 
> This would be a question for users. I am not one of those so I cannot
> tell you but I certainly cannot claim that something more coarse would
> be sufficient either.

I can tell you because there will be a single agent of the hierarchy
with unified hierarchy and both control and events will be routed
through it.  As such usage restriction rises from inherent properties
of the current cgroup design and implementation, it isn't something
which can be properly worked around.

> > It's just way over-done.
> 
> > > So you think that vmpressure, oom notification or thresholds are
> > > an abuse of this interface? What would you consider a reasonable
> > > replacement for those notifications?  Or do you think that controller
> > > shouldn't be signaling any conditions to the userspace at all?
> > 
> > I don't think the ability to generate events are an abuse, just that
> > the facility itself is way over-engineered.  Just generate a file
> > changed event unconditionally for vmpressure and oom and maybe
> > implement configureable cadence or single set of threshold array for
> > threshold events.  These are things which can and should be done in a
> > a few tens of lines of code with far simpler interface. 
> 
> These are strong words without any justification.

If you think this level of flexibility is healthy, we'll have to agree
to disagree.  Just look at it this way - this is memcg specific piece
of code and this patchset is just natural reorganization of code.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/5] cgroup, memcg: move cgroup_event implementation to memcg
  2013-08-05 17:09   ` [PATCH v2 " Tejun Heo
@ 2013-08-06  2:02     ` Li Zefan
  2013-08-06  2:21       ` Li Zefan
  0 siblings, 1 reply; 30+ messages in thread
From: Li Zefan @ 2013-08-06  2:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes, mhocko, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -5973,6 +6192,12 @@ static struct cftype mem_cgroup_files[] = {
>  		.read_u64 = mem_cgroup_hierarchy_read,
>  	},
>  	{
> +		.name = "cgroup.event_control",
> +		.write_string = cgroup_write_event_control,
> +		.flags = CFTYPE_NO_PREFIX,
> +		.mode = S_IWUGO,
> +	},

One of the misdesign of cgroup eventfd is, cgroup.event_control is
totally redunant...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/5] cgroup, memcg: move cgroup_event implementation to memcg
  2013-08-06  2:02     ` Li Zefan
@ 2013-08-06  2:21       ` Li Zefan
  0 siblings, 0 replies; 30+ messages in thread
From: Li Zefan @ 2013-08-06  2:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes, mhocko, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

On 2013/8/6 10:02, Li Zefan wrote:
>>  static struct cftype mem_cgroup_files[] = {
>>  	{
>>  		.name = "usage_in_bytes",
>> @@ -5973,6 +6192,12 @@ static struct cftype mem_cgroup_files[] = {
>>  		.read_u64 = mem_cgroup_hierarchy_read,
>>  	},
>>  	{
>> +		.name = "cgroup.event_control",
>> +		.write_string = cgroup_write_event_control,
>> +		.flags = CFTYPE_NO_PREFIX,
>> +		.mode = S_IWUGO,
>> +	},
> 
> One of the misdesign of cgroup eventfd is, cgroup.event_control is
> totally redunant...
> 

ok. write_string() is needed to accept arguments and pass them to
the event register function. still not good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg
  2013-08-04 16:07 ` [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg Tejun Heo
  2013-08-05  3:14   ` Li Zefan
  2013-08-05 17:09   ` [PATCH v2 " Tejun Heo
@ 2013-08-06  3:26   ` Balbir Singh
  2013-08-06 14:09     ` Tejun Heo
  2 siblings, 1 reply; 30+ messages in thread
From: Balbir Singh @ 2013-08-06  3:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
	cgroups@vger.kernel.org, linux-mm, linux-kernel@vger.kernel.org

On Sun, Aug 4, 2013 at 9:37 PM, Tejun Heo <tj@kernel.org> wrote:
> cgroup_event is way over-designed and tries to build a generic
> flexible event mechanism into cgroup - fully customizable event
> specification for each user of the interface.  This is utterly
> unnecessary and overboard especially in the light of the planned
> unified hierarchy as there's gonna be single agent.  Simply generating

[off-topic] Has the unified hierarchy been agreed upon? I did not
follow that thread

> events at fixed points, or if that's too restrictive, configureable
> cadence or single set of configureable points should be enough.
>

Nit-pick: typo on the spelling of configurable

> Thankfully, memcg is the only user and gets to keep it.  Replacing it
> with something simpler on sane_behavior is strongly recommended.
>
> This patch moves cgroup_event and "cgroup.event_control"
> implementation to mm/memcontrol.c.  Clearing of events on cgroup
> destruction is moved from cgroup_destroy_locked() to
> mem_cgroup_css_offline(), which shouldn't make any noticeable
> difference.
>
> Note that "cgroup.event_control" will now exist only on the hierarchy
> with memcg attached to it.  While this change is visible to userland,
> it is unlikely to be noticeable as the file has never been meaningful
> outside memcg.
>

Tejun, I think the framework was designed to be flexible. Do you see
cgroup subsystems never using this?

> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Balbir Singh <bsingharora@gmail.com>
> ---
>  kernel/cgroup.c | 237 -------------------------------------------------------
>  mm/memcontrol.c | 238 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 238 insertions(+), 237 deletions(-)
>
...
> +/*
> + * cgroup_event represents events which userspace want to receive.
> + */
> +struct cgroup_event {
> +       /*
> +        * css which the event belongs to.
> +        */
> +       struct cgroup_subsys_state *css;
> +       /*
> +        * Control file which the event associated.
> +        */
> +       struct cftype *cft;
> +       /*
> +        * eventfd to signal userspace about the event.
> +        */
> +       struct eventfd_ctx *eventfd;
> +       /*
> +        * Each of these stored in a list by the cgroup.
> +        */
> +       struct list_head list;
> +       /*
> +        * All fields below needed to unregister event when
> +        * userspace closes eventfd.
> +        */
> +       poll_table pt;
> +       wait_queue_head_t *wqh;
> +       wait_queue_t wait;
> +       struct work_struct remove;
> +};
> +
>  static void mem_cgroup_threshold(struct mem_cgroup *memcg);
>  static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
>
> @@ -5926,6 +5956,194 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
>  }
>  #endif
>
> +/*
> + * Unregister event and free resources.
> + *
> + * Gets called from workqueue.
> + */
> +static void cgroup_event_remove(struct work_struct *work)
> +{
> +       struct cgroup_event *event = container_of(work, struct cgroup_event,
> +                       remove);
> +       struct cgroup_subsys_state *css = event->css;
> +       struct cgroup *cgrp = css->cgroup;
> +
> +       remove_wait_queue(event->wqh, &event->wait);
> +
> +       event->cft->unregister_event(css, event->cft, event->eventfd);
> +
> +       /* Notify userspace the event is going away. */
> +       eventfd_signal(event->eventfd, 1);
> +
> +       eventfd_ctx_put(event->eventfd);
> +       kfree(event);
> +       __cgroup_dput(cgrp);
> +}
> +
> +/*
> + * Gets called on POLLHUP on eventfd when user closes it.
> + *
> + * Called with wqh->lock held and interrupts disabled.
> + */
> +static int cgroup_event_wake(wait_queue_t *wait, unsigned mode,
> +               int sync, void *key)
> +{
> +       struct cgroup_event *event = container_of(wait,
> +                       struct cgroup_event, wait);
> +       struct cgroup *cgrp = event->css->cgroup;
> +       unsigned long flags = (unsigned long)key;
> +
> +       if (flags & POLLHUP) {
> +               /*
> +                * If the event has been detached at cgroup removal, we
> +                * can simply return knowing the other side will cleanup
> +                * for us.
> +                *
> +                * We can't race against event freeing since the other
> +                * side will require wqh->lock via remove_wait_queue(),
> +                * which we hold.
> +                */
> +               spin_lock(&cgrp->event_list_lock);
> +               if (!list_empty(&event->list)) {
> +                       list_del_init(&event->list);
> +                       /*
> +                        * We are in atomic context, but cgroup_event_remove()
> +                        * may sleep, so we have to call it in workqueue.
> +                        */
> +                       schedule_work(&event->remove);
> +               }
> +               spin_unlock(&cgrp->event_list_lock);
> +       }
> +
> +       return 0;
> +}
> +
> +static void cgroup_event_ptable_queue_proc(struct file *file,
> +               wait_queue_head_t *wqh, poll_table *pt)
> +{
> +       struct cgroup_event *event = container_of(pt,
> +                       struct cgroup_event, pt);
> +
> +       event->wqh = wqh;
> +       add_wait_queue(wqh, &event->wait);
> +}
> +
> +/*
> + * Parse input and register new memcg event handler.
> + *
> + * Input must be in format '<event_fd> <control_fd> <args>'.
> + * Interpretation of args is defined by control file implementation.
> + */
> +static int cgroup_write_event_control(struct cgroup_subsys_state *css,
> +                                     struct cftype *cft, const char *buffer)
> +{
> +       struct cgroup *cgrp = css->cgroup;
> +       struct cgroup_event *event;
> +       struct cgroup *cgrp_cfile;
> +       unsigned int efd, cfd;
> +       struct file *efile;
> +       struct file *cfile;
> +       char *endp;
> +       int ret;
> +

Can we assert that buffer is NOT NULL here?

> +       efd = simple_strtoul(buffer, &endp, 10);
> +       if (*endp != ' ')
> +               return -EINVAL;
> +       buffer = endp + 1;
> +
> +       cfd = simple_strtoul(buffer, &endp, 10);
> +       if ((*endp != ' ') && (*endp != '\0'))
> +               return -EINVAL;

Shouldn't we be using kstroull(), I thought the simple functions were
obsolete now.

> +       buffer = endp + 1;
> +
> +       event = kzalloc(sizeof(*event), GFP_KERNEL);
> +       if (!event)
> +               return -ENOMEM;
> +       event->css = css;
> +       INIT_LIST_HEAD(&event->list);
> +       init_poll_funcptr(&event->pt, cgroup_event_ptable_queue_proc);
> +       init_waitqueue_func_entry(&event->wait, cgroup_event_wake);
> +       INIT_WORK(&event->remove, cgroup_event_remove);
> +
> +       efile = eventfd_fget(efd);
> +       if (IS_ERR(efile)) {
> +               ret = PTR_ERR(efile);
> +               goto out_kfree;
> +       }
> +
> +       event->eventfd = eventfd_ctx_fileget(efile);
> +       if (IS_ERR(event->eventfd)) {
> +               ret = PTR_ERR(event->eventfd);
> +               goto out_put_efile;
> +       }
> +
> +       cfile = fget(cfd);
> +       if (!cfile) {
> +               ret = -EBADF;
> +               goto out_put_eventfd;
> +       }
> +
> +       /* the process need read permission on control file */
> +       /* AV: shouldn't we check that it's been opened for read instead? */
> +       ret = inode_permission(file_inode(cfile), MAY_READ);
> +       if (ret < 0)
> +               goto out_put_cfile;
> +
> +       cgrp_cfile = __cgroup_from_dentry(cfile->f_dentry, &event->cft);
> +       if (!cgrp_cfile) {
> +               ret = -EINVAL;
> +               goto out_put_cfile;
> +       }
> +
> +       /*
> +        * The file to be monitored must be in the same cgroup as
> +        * cgroup.event_control is.
> +        */
> +       if (cgrp_cfile != cgrp) {
> +               ret = -EINVAL;
> +               goto out_put_cfile;
> +       }
> +
> +       if (!event->cft->register_event || !event->cft->unregister_event) {
> +               ret = -EINVAL;
> +               goto out_put_cfile;
> +       }
> +
> +       ret = event->cft->register_event(css, event->cft,
> +                       event->eventfd, buffer);
> +       if (ret)
> +               goto out_put_cfile;
> +
> +       efile->f_op->poll(efile, &event->pt);
> +
> +       /*
> +        * Events should be removed after rmdir of cgroup directory, but before
> +        * destroying subsystem state objects. Let's take reference to cgroup
> +        * directory dentry to do that.
> +        */
> +       dget(cgrp->dentry);
> +
> +       spin_lock(&cgrp->event_list_lock);
> +       list_add(&event->list, &cgrp->event_list);
> +       spin_unlock(&cgrp->event_list_lock);
> +
> +       fput(cfile);
> +       fput(efile);
> +
> +       return 0;
> +
> +out_put_cfile:
> +       fput(cfile);
> +out_put_eventfd:
> +       eventfd_ctx_put(event->eventfd);
> +out_put_efile:
> +       fput(efile);
> +out_kfree:
> +       kfree(event);
> +
> +       return ret;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>         {
>                 .name = "usage_in_bytes",
> @@ -5973,6 +6191,12 @@ static struct cftype mem_cgroup_files[] = {
>                 .read_u64 = mem_cgroup_hierarchy_read,
>         },
>         {
> +               .name = "cgroup.event_control",
> +               .write_string = cgroup_write_event_control,
> +               .flags = CFTYPE_NO_PREFIX,
> +               .mode = S_IWUGO,

So everyone has write permissions? I guess we don't want to break this

> +       },
> +       {
>                 .name = "swappiness",
>                 .read_u64 = mem_cgroup_swappiness_read,
>                 .write_u64 = mem_cgroup_swappiness_write,
> @@ -6305,6 +6529,20 @@ static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
>  static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  {
>         struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +       struct cgroup *cgrp = css->cgroup;

With the new refactoring, I presume css and cgroup always co-exist, so
css->cgroup cannot change.

> +       struct cgroup_event *event, *tmp;
> +
> +       /*
> +        * Unregister events and notify userspace.
> +        * Notify userspace about cgroup removing only after rmdir of cgroup
> +        * directory to avoid race between userspace and kernelspace.
> +        */
> +       spin_lock(&cgrp->event_list_lock);
> +       list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
> +               list_del_init(&event->list);
> +               schedule_work(&event->remove);
> +       }
> +       spin_unlock(&cgrp->event_list_lock);
>
>         kmem_cgroup_css_offline(memcg);
>
> --
> 1.8.3.1
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg
  2013-08-06  3:26   ` [PATCH " Balbir Singh
@ 2013-08-06 14:09     ` Tejun Heo
  2013-08-06 16:03       ` Balbir Singh
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2013-08-06 14:09 UTC (permalink / raw)
  To: Balbir Singh
  Cc: lizefan, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
	cgroups@vger.kernel.org, linux-mm, linux-kernel@vger.kernel.org

Hello, Balbir.

On Tue, Aug 06, 2013 at 08:56:34AM +0530, Balbir Singh wrote:
> [off-topic] Has the unified hierarchy been agreed upon? I did not
> follow that thread

I consider it agreed upon enough.  There of course are objections but
I feel fairly comfortable with the amount of existing consensus and
considering the current state of cgroup in general, especially the API
leaks, I don't think we have many other choices.  The devil is always
in the details but unless we meet a major technical barrier, I'm
pretty sure it's happening.

> > events at fixed points, or if that's too restrictive, configureable
> > cadence or single set of configureable points should be enough.
> 
> Nit-pick: typo on the spelling of configurable

Will update.

> Tejun, I think the framework was designed to be flexible. Do you see
> cgroup subsystems never using this?

I can't be a hundred percent sure that we won't need events which are
configureable per-listener but it's highly unlikely given that we're
moving onto single agent model and the nature of event delivery -
spurious events are unlikely to be noticeable unless the frequency is
very high.  In general, as anything, aiming for extremes isn't a
healthy design practice.  Maximum flexibility sounds good in isolation
but nothing is free and it entails unneeded complexity both in
implementation and usage.  Note that even for memcg, both oom and
vmpressure don't benefit in any way from all the added complexity at
all.  The only other place that I can see event being useful at the
moment is freezer state change notification and that also would only
require unconditional file modified notification.

> > +static int cgroup_write_event_control(struct cgroup_subsys_state *css,
> > +                                     struct cftype *cft, const char *buffer)
> > +{
> > +       struct cgroup *cgrp = css->cgroup;
> > +       struct cgroup_event *event;
> > +       struct cgroup *cgrp_cfile;
> > +       unsigned int efd, cfd;
> > +       struct file *efile;
> > +       struct file *cfile;
> > +       char *endp;
> > +       int ret;
> > +
> 
> Can we assert that buffer is NOT NULL here?

The patch moves the code as-is as things become difficult to review
otherwise.  After the patchset, it belongs to memcg, please feel free
to modify as memcg people see fit.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-05 19:44       ` Tejun Heo
@ 2013-08-06 15:58         ` Michal Hocko
  2013-08-06 16:15           ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-08-06 15:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

On Mon 05-08-13 15:44:31, Tejun Heo wrote:
> Hey, Michal.
> 
> On Mon, Aug 05, 2013 at 09:16:41PM +0200, Michal Hocko wrote:
[...]
> > Besides that, is fsnotify really an interface to be used under memory
> > pressure? I might be wrong but from a quick look fsnotify depends on
> > GFP_KERNEL allocation which would be no-go for oom_control at least.
> 
> Yeap, I'm pretty sure you're wrong.  I didn't follow the development
> but this is wrapper to combine inotify and dnotify and kernel's main
> FS event mechanism and none of the two mechanisms required memory
> allocation during delivery.

I might be really wrong here (I see memory allocations down the
fsnotify_modify callpath) but this is totally irrelevant to the
patchset.

If there is a better interface in the future then I have no objection
to move to it and keep the old one just for legacy usage for certain
amount of time. I am definitely not arguing for eventfd being the best
interface.

I am objecting to moving the generic part of that code into memcg. The
memcg part and the additional complexity (all the parsing and conditions
for signalling) is already in the memcg code.
 
> > How does the reclaim context gets to struct file to notify? I am pretty
> > sure we would get to more and more questions when digging further.
> 
> The file is optional and probably just to extract path for dnotify, I
> think.  Not sure about namespace implications but I'll build cgroup
> level interface for it so that controllers won't have to deal with it
> directly.  ie. there'll be css_notify_file(css, cfe) interface.

Such an interface would be really welcome but I would also ask how
it would implement/allow context passing. E.g. how do we know which
treshold has been reached? How do we find out the vmpressure level? Is
the consumer supposed to do an additional action after it gets
notification?
Etc.

> > I am all for simplifications, but removing interfaces just because you
> > feel they are "over-done" is not a way to go IMHO. In this particular
> 
> We can keep it around but it's a pretty good time to gradually move
> towards something saner as we're essentially doing v2 of the
> interface.  Note that there's no reason for the interface to disappear
> immediately.  You can keep it and there won't be any problem for it to
> co-exist with simpler mechanism.

Yes that is an expectation of the users...

> > case you are removing an interface from cgroup core which has users,
> > and will have to support them for very long time. "It is just memcg
> > so move it there" is not a way that different subsystems should work
> > together and I am _not_ going to ack such a move. All the flexibility that
> > you are so complaining about is hidden from the cgroup core in register
> > callbacks and the rest is only the core infrastructure (registration and
> > unregistration).
> 
> memcg is the only user and will stay that way.  If you wanna keep it
> around, be my guest.

OK, so why do you move the generic part to the memcg then?

> Also, cftype is planned to be simplified so that
> it just provides seq_file interface and the types become helper
> routines like a normal file callback interface.  There'll be nothing
> tying the event file handling to cgroup core and it surely won't be
> used by anyone else.  memcg is and will be the only user.  It's the
> natural place for the code.

Really that natural? So memcg should touch internals like cgroup dentry
reference counting. You seem have forgotten all the hassles with
cgroup_mutex, haven't you?
No that part doesn't belong to memcg! You can discourage from new usage
of this interface of course.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg
  2013-08-06 14:09     ` Tejun Heo
@ 2013-08-06 16:03       ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2013-08-06 16:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
	cgroups@vger.kernel.org, linux-mm, linux-kernel@vger.kernel.org

On Tue, Aug 6, 2013 at 7:39 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Balbir.
>
> On Tue, Aug 06, 2013 at 08:56:34AM +0530, Balbir Singh wrote:
>> [off-topic] Has the unified hierarchy been agreed upon? I did not
>> follow that thread
>
> I consider it agreed upon enough.  There of course are objections but
> I feel fairly comfortable with the amount of existing consensus and
> considering the current state of cgroup in general, especially the API
> leaks, I don't think we have many other choices.  The devil is always
> in the details but unless we meet a major technical barrier, I'm
> pretty sure it's happening.
>

I guess we'll need to see what the details look like :)

>> > events at fixed points, or if that's too restrictive, configureable
>> > cadence or single set of configureable points should be enough.
>>
>> Nit-pick: typo on the spelling of configurable
>
> Will update.
>
>> Tejun, I think the framework was designed to be flexible. Do you see
>> cgroup subsystems never using this?
>
> I can't be a hundred percent sure that we won't need events which are
> configureable per-listener but it's highly unlikely given that we're
> moving onto single agent model and the nature of event delivery -
> spurious events are unlikely to be noticeable unless the frequency is
> very high.  In general, as anything, aiming for extremes isn't a
> healthy design practice.  Maximum flexibility sounds good in isolation
> but nothing is free and it entails unneeded complexity both in
> implementation and usage.  Note that even for memcg, both oom and
> vmpressure don't benefit in any way from all the added complexity at
> all.  The only other place that I can see event being useful at the
> moment is freezer state change notification and that also would only
> require unconditional file modified notification.
>

Hmm.. I think it would be interesting to set thresholds on other
resources instead of pure monitoring. One might want to set thresholds
for CPUACCT usage for example. Freezer is another example like you
mentioned.

>> > +static int cgroup_write_event_control(struct cgroup_subsys_state *css,
>> > +                                     struct cftype *cft, const char *buffer)
>> > +{
>> > +       struct cgroup *cgrp = css->cgroup;
>> > +       struct cgroup_event *event;
>> > +       struct cgroup *cgrp_cfile;
>> > +       unsigned int efd, cfd;
>> > +       struct file *efile;
>> > +       struct file *cfile;
>> > +       char *endp;
>> > +       int ret;
>> > +
>>
>> Can we assert that buffer is NOT NULL here?
>
> The patch moves the code as-is as things become difficult to review
> otherwise.  After the patchset, it belongs to memcg, please feel free
> to modify as memcg people see fit.

OK, Thanks

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-06 15:58         ` Michal Hocko
@ 2013-08-06 16:15           ` Tejun Heo
  2013-08-07 12:18             ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2013-08-06 16:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

Hello, Michal.

On Tue, Aug 06, 2013 at 05:58:04PM +0200, Michal Hocko wrote:
> I am objecting to moving the generic part of that code into memcg. The
> memcg part and the additional complexity (all the parsing and conditions
> for signalling) is already in the memcg code.

But how is it generic if it's specific to memcg?  The practical
purpose here is making it clear that the interface is only used by
memcg and preventing any new usages from sprining up and the best way
to achieve that is making the code actually memcg-specific.  It also
helps cleaning up cftype in general.  I'm not sure what you're
objecting to here.

> Such an interface would be really welcome but I would also ask how
> it would implement/allow context passing. E.g. how do we know which
> treshold has been reached? How do we find out the vmpressure level? Is
> the consumer supposed to do an additional action after it gets
> notification?
> Etc.

Yeap, exactly and that's how it should have been from the beginning.
Attaching information to notification itself isn't a particularly good
design (anyone remembers rtsig?) if there's polling mechanism to
report the current state.  It essentially amounts to duplicate
delivery mechanisms for the same information, which you usually don't
want.  Here, the inconvenience / performance implications are
negligible or even net-positive.  Plain file modified notification is
way more familiar / conventional and the overhead of an extra read
call, which is highly unlikely to be relevant given the expected
frequency of the events we're talking about, is small compared to the
action of event delivery and context switch.

> Really that natural? So memcg should touch internals like cgroup dentry

Functionally, it is completely specific to memcg at this point.  It's
the only user and will stay the only user.

> reference counting. You seem have forgotten all the hassles with
> cgroup_mutex, haven't you?

Was the above sentence necessary?

> No that part doesn't belong to memcg! You can discourage from new usage
> of this interface of course.

Oh, if you're objecting to the details of the implementation, we of
course can clean it up.  It should conceptually and functionally be
part of memcg and that is the guiding line we follow.  Implementations
follow the concepts and functions, not the other way around.  The
refcnt of course can be replaced with memcg css refcnting and we can
of course factor out dentry comparison in a prettier form.

Compare it to the other way around - having event callbacks in cftype
and clearing code embedded in cgroup core destruction path when both
of which are completely irrelevant to all other controllers.  Let's
clean up the implementation details and put things where they belong.
What's the excuse for not doing so when it's almost trivially doable?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-06 16:15           ` Tejun Heo
@ 2013-08-07 12:18             ` Michal Hocko
  2013-08-07 12:43               ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-08-07 12:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

On Tue 06-08-13 12:15:09, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Aug 06, 2013 at 05:58:04PM +0200, Michal Hocko wrote:
> > I am objecting to moving the generic part of that code into memcg. The
> > memcg part and the additional complexity (all the parsing and conditions
> > for signalling) is already in the memcg code.
> 
> But how is it generic if it's specific to memcg?

How is it specific to memcg? The fact only memcg uses the interface
doesn't imply it is memcg specific.

> The practical
> purpose here is making it clear that the interface is only used by
> memcg and preventing any new usages from sprining up and the best way
> to achieve that is making the code actually memcg-specific. 

There are other ways to achieve the same. E.g. not ack new usage of
register callback users. We have done similar with other things like
use_hierarchy...

> It also helps cleaning up cftype in general.  I'm not sure what you're
> objecting to here.

The cleanup is removing 2 callbacks with a cost of moving non-memcg
specific code inside memcg. That is what I am objecting to.
 
> > Such an interface would be really welcome but I would also ask how
> > it would implement/allow context passing. E.g. how do we know which
> > treshold has been reached? How do we find out the vmpressure level? Is
> > the consumer supposed to do an additional action after it gets
> > notification?
> > Etc.
> 
> Yeap, exactly and that's how it should have been from the beginning.
> Attaching information to notification itself isn't a particularly good
> design (anyone remembers rtsig?) if there's polling mechanism to
> report the current state. 

There are pros and cons for both approaches and it should be discussed
in a separate thread with a code to back all the claims.

> It essentially amounts to duplicate delivery mechanisms for the same
> information, which you usually don't want.  Here, the inconvenience /
> performance implications are negligible or even net-positive.  Plain
> file modified notification is way more familiar / conventional and
> the overhead of an extra read call, which is highly unlikely to be
> relevant given the expected frequency of the events we're talking
> about, is small compared to the action of event delivery and context
> switch.
> 
> > Really that natural? So memcg should touch internals like cgroup dentry
> 
> Functionally, it is completely specific to memcg at this point.  It's
> the only user and will stay the only user.
> 
> > reference counting. You seem have forgotten all the hassles with
> > cgroup_mutex, haven't you?
> 
> Was the above sentence necessary?
> 
> > No that part doesn't belong to memcg! You can discourage from new usage
> > of this interface of course.
> 
> Oh, if you're objecting to the details of the implementation, we of
> course can clean it up.  It should conceptually and functionally be
> part of memcg and that is the guiding line we follow.  Implementations
> follow the concepts and functions, not the other way around.  The
> refcnt of course can be replaced with memcg css refcnting and we can
> of course factor out dentry comparison in a prettier form.
> 
> Compare it to the other way around - having event callbacks in cftype
> and clearing code embedded in cgroup core destruction path when both
> of which are completely irrelevant to all other controllers.  Let's
> clean up the implementation details and put things where they belong.
> What's the excuse for not doing so when it's almost trivially doable?

I will not repeat myself. We seem to disagree on where the code belongs.
As I've said I will not ack this code, try to find somebody else who
think it is a good idea. I do not see any added value.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-07 12:18             ` Michal Hocko
@ 2013-08-07 12:43               ` Tejun Heo
  2013-08-07 13:26                 ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2013-08-07 12:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

Hello, Michal.

On Wed, Aug 07, 2013 at 02:18:36PM +0200, Michal Hocko wrote:
> How is it specific to memcg? The fact only memcg uses the interface
> doesn't imply it is memcg specific.

I don't follow.  It's only for memcg.  That is *by definition* memcg
specific.  It's the verbatim meaning of the word.  Now, I do
understand that it can be a concern the implementation details as-is
could be a bit too invasive into cgroup core to be moved to memcg, but
that's something we can work on, right?  Can you at least agree that
the feature is nmemcg specific and it'd be better to be located in
memcg if possible?  That really isn't not much to ask and is a logical
thing to do.

> There are other ways to achieve the same. E.g. not ack new usage of
> register callback users. We have done similar with other things like
> use_hierarchy...

Yes, but those are all inferior to actually moving the code where it
belongs.  Those makes the code harder to follow and people
misunderstand and waste time working on stuff (either in the core or
controllers) which eventually end up getting nacked.  Why do that when
we can easily do better?  What's the rationale behind that?

> The cleanup is removing 2 callbacks with a cost of moving non-memcg
> specific code inside memcg. That is what I am objecting to.

I don't really get your "non-memcg" specific code assertion when it is
by definition memcg-specific.  What are you talking about?

> I will not repeat myself. We seem to disagree on where the code belongs.
> As I've said I will not ack this code, try to find somebody else who
> think it is a good idea. I do not see any added value.

Nacking is part of your authority as maintainer but you should still
provide plausible rationale for that.  Are you saying that even if the
code is restructured so that it's not invasive into cgroup core, you
are still gonna disagree with it because it's still somehow not
memcg-specifc?  Please don't repeat yourself but do explain your
rationale.  That's part of your duty as a maintainer too.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-07 12:43               ` Tejun Heo
@ 2013-08-07 13:26                 ` Michal Hocko
  2013-08-07 13:36                   ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2013-08-07 13:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

On Wed 07-08-13 08:43:21, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Aug 07, 2013 at 02:18:36PM +0200, Michal Hocko wrote:
> > How is it specific to memcg? The fact only memcg uses the interface
> > doesn't imply it is memcg specific.
> 
> I don't follow.  It's only for memcg.  That is *by definition* memcg
> specific.  It's the verbatim meaning of the word.

My understanding of "memcg specific" is that it uses memcg specific
code/data structures. But let's not play with words.

> Now, I do
> understand that it can be a concern the implementation details as-is
> could be a bit too invasive into cgroup core to be moved to memcg, but
> that's something we can work on, right?

Does it really make sense to work on this interface if it is planned to
be replaced by something different. Isn't that just a waste of time?

> Can you at least agree that the feature is nmemcg specific and it'd be
> better to be located in memcg if possible?  That really isn't not much
> to ask and is a logical thing to do.

I would rather see it not changed unless it really is a big win in the
cgroup core. So far I do not see anything like that (just look at
__cgroup_from_dentry which needs to be exported to allow for the move).
You reduce the amount of code in cgroup.c, alright, but the code
doesn't go away really. It just moves out of your sight and moves the
same burden on somebody else without providing a new generic interface.

> > There are other ways to achieve the same. E.g. not ack new usage of
> > register callback users. We have done similar with other things like
> > use_hierarchy...
> 
> Yes, but those are all inferior to actually moving the code where it
> belongs.  Those makes the code harder to follow and people
> misunderstand and waste time working on stuff (either in the core or
> controllers) which eventually end up getting nacked.  Why do that when
> we can easily do better?  What's the rationale behind that?

If somebody needs a notification interface (and there is no one available
right now) then you cannot prevent from such a pointless work anyway...

> > The cleanup is removing 2 callbacks with a cost of moving non-memcg
> > specific code inside memcg. That is what I am objecting to.
> 
> I don't really get your "non-memcg" specific code assertion when it is
> by definition memcg-specific.  What are you talking about?

cgroup_event_* don't sound memcg specific at all. They are playing with
cgroup dentry reference counting and do a generic functionality which
memcg doesn't need to know about.

> > I will not repeat myself. We seem to disagree on where the code belongs.
> > As I've said I will not ack this code, try to find somebody else who
> > think it is a good idea. I do not see any added value.
> 
> Nacking is part of your authority as maintainer but you should still
> provide plausible rationale for that.

I didn't say I Nack it. I said I won't Ack it. If Johannes or Kamezawa
think this is OK and another bloat in memcg is not a big deal I will not
block it. I won't be happy but how is the life.

> Are you saying that even if the
> code is restructured so that it's not invasive into cgroup core, you
> are still gonna disagree with it because it's still somehow not
> memcg-specifc?

I wouldn't object to having non-cgroup internals playing variant. I just
do not think it makes sense to invest time to something that should go
away long term.

> Please don't repeat yourself but do explain your rationale.  That's
> part of your duty as a maintainer too.

I think I am clear what I do not like about this move.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-07 13:26                 ` Michal Hocko
@ 2013-08-07 13:36                   ` Tejun Heo
  2013-08-08  2:53                     ` Li Zefan
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2013-08-07 13:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan, hannes, bsingharora, kamezawa.hiroyu, cgroups, linux-mm,
	linux-kernel

Hello, Michal.

On Wed, Aug 07, 2013 at 03:26:13PM +0200, Michal Hocko wrote:
> I would rather see it not changed unless it really is a big win in the
> cgroup core. So far I do not see anything like that (just look at
> __cgroup_from_dentry which needs to be exported to allow for the move).

The end goal is cleaning up cftype so that it becomes a thin wrapper
around seq_file and I'd really like to keep the interface minimal so
that it's difficult to misunderstand.

> You reduce the amount of code in cgroup.c, alright, but the code
> doesn't go away really. It just moves out of your sight and moves the
> same burden on somebody else without providing a new generic interface.

If the implementation details are all that you're objecting, I'll be
happy to restructure it.  I just didn't pay too much attention to it
because I considered it to be mostly deprecated.  I don't think it'll
be too much work and strongly think it'll be worth the effort.  Our
code base is extremely nasty is and I'll try to get any ounce of
cleanup I can get.

> If somebody needs a notification interface (and there is no one available
> right now) then you cannot prevent from such a pointless work anyway...

I'm gonna add one for freezer state transitions.  It'll be simple
"this file changed" thing and will probably apply that to at least oom
and vmpressure.  I'm relatively confident that it's gonna be pretty
simple and that's gonna be the cgroup event mechanism.

> cgroup_event_* don't sound memcg specific at all. They are playing with
> cgroup dentry reference counting and do a generic functionality which
> memcg doesn't need to know about.

Sure, I'll try to clean it up so that it doesn't meddle with cgroup
internals directly.

> I wouldn't object to having non-cgroup internals playing variant. I just
> do not think it makes sense to invest time to something that should go
> away long term.

I suppose it's priority thing.  To me, cleaning up cgroup core API is
quite important and I'd be happy to sink time and effort into it and
it's not like we can drop the event thing in a release cycle or two.
We'd have to carry it for years, so I think the effort is justified.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-07 13:36                   ` Tejun Heo
@ 2013-08-08  2:53                     ` Li Zefan
  2013-08-09  1:00                       ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Li Zefan @ 2013-08-08  2:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, hannes, bsingharora, kamezawa.hiroyu, cgroups,
	linux-mm, linux-kernel

>> If somebody needs a notification interface (and there is no one available
>> right now) then you cannot prevent from such a pointless work anyway...
> 
> I'm gonna add one for freezer state transitions.  It'll be simple
> "this file changed" thing and will probably apply that to at least oom
> and vmpressure.  I'm relatively confident that it's gonna be pretty
> simple and that's gonna be the cgroup event mechanism.
> 

I would like to see this happen. I have a feeling that we're deprecating
features a bit aggressively without providing alternatives.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
  2013-08-08  2:53                     ` Li Zefan
@ 2013-08-09  1:00                       ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2013-08-09  1:00 UTC (permalink / raw)
  To: Li Zefan
  Cc: Michal Hocko, hannes, bsingharora, kamezawa.hiroyu, cgroups,
	linux-mm, linux-kernel

Hello, Li.

On Thu, Aug 08, 2013 at 10:53:16AM +0800, Li Zefan wrote:
> I would like to see this happen. I have a feeling that we're deprecating
> features a bit aggressively without providing alternatives.

I'd rework it prolly next week but this has to go one way or another.
There's no way we're implementing userland interface this complex in
cgroup proper.  It is a gross layering violation.  We don't implement
userland visible interface this complex in low level subsystems.  It's
wrong both in principle and leads to all sorts of problems in practice
like ending up worrying about userland abuses in memcg event source
implementation, which is utterly bonkers if you ask me.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-08-09  1:00 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-04 16:07 [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Tejun Heo
2013-08-04 16:07 ` [PATCH 1/5] cgroup: implement CFTYPE_NO_PREFIX Tejun Heo
2013-08-04 16:07 ` [PATCH 2/5] cgroup: export __cgroup_from_dentry() and __cgroup_dput() Tejun Heo
2013-08-05  2:58   ` Li Zefan
2013-08-05 15:40     ` Tejun Heo
2013-08-05 17:08   ` [PATCH v2 2/5] cgroup: make __cgroup_from_dentry() and __cgroup_dput() global Tejun Heo
2013-08-04 16:07 ` [PATCH 3/5] cgroup, memcg: move cgroup_event implementation to memcg Tejun Heo
2013-08-05  3:14   ` Li Zefan
2013-08-05 17:09   ` [PATCH v2 " Tejun Heo
2013-08-06  2:02     ` Li Zefan
2013-08-06  2:21       ` Li Zefan
2013-08-06  3:26   ` [PATCH " Balbir Singh
2013-08-06 14:09     ` Tejun Heo
2013-08-06 16:03       ` Balbir Singh
2013-08-04 16:07 ` [PATCH 4/5] cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg Tejun Heo
2013-08-04 16:07 ` [PATCH 5/5] memcg: rename cgroup_event to mem_cgroup_event Tejun Heo
2013-08-05  3:26   ` Li Zefan
2013-08-05 17:10   ` [PATCH v2 " Tejun Heo
2013-08-05 16:01 ` [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg Michal Hocko
2013-08-05 16:29   ` Tejun Heo
2013-08-05 19:16     ` Michal Hocko
2013-08-05 19:44       ` Tejun Heo
2013-08-06 15:58         ` Michal Hocko
2013-08-06 16:15           ` Tejun Heo
2013-08-07 12:18             ` Michal Hocko
2013-08-07 12:43               ` Tejun Heo
2013-08-07 13:26                 ` Michal Hocko
2013-08-07 13:36                   ` Tejun Heo
2013-08-08  2:53                     ` Li Zefan
2013-08-09  1:00                       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).