[RFC] [PATCH] Cgroup based OOM killer controller

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] [PATCH] Cgroup based OOM killer controller
@ 2009-01-21 11:08 Nikanth Karthikesan
  2009-01-21 13:17 ` Evgeniy Polyakov
                   ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-21 11:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Linus Torvalds, Evgeniy Polyakov, Chris Snook,
	Alan Cox, Arve Hjønnevåg, Paul Menage, containers

As Alan Cox suggested/wondered in this thread, 
http://lkml.org/lkml/2009/1/12/235 , this is a container group based approach 
to override the oom killer selection without losing all the benefits of the 
current oom killer heuristics and oom_adj interface.

It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
process using the usual badness value but only within the cgroup with the 
maximum value for oom.victim before killing any process from a cgroup with a 
lesser oom.victim number. Oom killing could be disabled by setting 
oom.victim=0.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

---

diff --git a/Documentation/cgroups/oom.txt b/Documentation/cgroups/oom.txt
new file mode 100644
index 0000000..9102830
--- /dev/null
+++ b/Documentation/cgroups/oom.txt
@@ -0,0 +1,29 @@
+OOM Killer controller
+--- ------ ----------
+
+The OOM killer kills the process based on a set of heuristics such that only
+minimum amount of work done will be lost, a large amount of memory would be
+recovered and minimum no of processes are killed.
+
+The user can adjust the score used to select the processes to be killed using
+/proc/<pid>/oom_adj. Giving it a high score will increase the likelihood of 
+this process being killed by the oom-killer.  Valid values are in the range 
+-16 to +15, plus the special value -17, which disables oom-killing altogether
+for that process.
+
+But it is very difficult to suggest an order among tasks to be killed during
+Out Of Memory situation. The OOM Killer controller aids in doing that.
+
+USAGE
+-----
+
+Mount the oom controller by passing 'oom' when mounting cgroups. Echo
+a value in oom.victim file to change the order. The oom killer would kill
+all the processes in a cgroup with a higher oom.victim value before killing a
+process in a cgroup with lower oom.victim value. Among those tasks with same
+oom.victim value, the usual badness heuristics would be applied. The 
+/proc/<pid>/oom_adj still helps adjusting the oom killer score. Also having
+oom.victim = 0 would disable oom killing for the tasks in that cgroup.
+
+Note: If this is used without proper consideration, innocent processes may
+get killed unnecesarily.
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..6944f99 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,4 +59,8 @@ SUBSYS(freezer)
 SUBSYS(net_cls)
 #endif
 
+#ifdef CONFIG_CGROUP_OOM
+SUBSYS(oom)
+#endif
+
 /* */
diff --git a/include/linux/oomcontrol.h b/include/linux/oomcontrol.h
new file mode 100644
index 0000000..18146ab
--- /dev/null
+++ b/include/linux/oomcontrol.h
@@ -0,0 +1,14 @@
+#ifndef _LINUX_OOMCONTROL_H
+#define _LINUX_OOMCONTROL_H
+
+#ifdef CONFIG_CGROUP_OOM
+struct oom_cgroup { 
+	struct cgroup_subsys_state css;
+	/*
+	 * the order to be victimized for this group
+	 */  
+	u64 victim;
+};
+
+#endif
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 2af8382..99ed0de 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -354,6 +354,15 @@ config CGROUP_DEBUG
 
 	  Say N if unsure.
 
+config CGROUP_OOM
+	bool "Oom cgroup subsystem"
+	depends on CGROUPS
+	help
+	  This provides a cgroup subsystem which aids controlling
+	  the order in which tasks whould be killed during
+	  out of memory situations.
+	
+
 config CGROUP_NS
 	bool "Namespace cgroup subsystem"
 	depends on CGROUPS
diff --git a/mm/Makefile b/mm/Makefile
index 72255be..a5d7222 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,3 +33,4 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_OOM) += oomcontrol.o 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 40ba050..f697d50 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/oomcontrol.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -205,6 +206,9 @@ static struct task_struct *select_bad_process(unsigned 
long *ppoints,
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
 	struct timespec uptime;
+#ifdef CONFIG_CGROUP_OOM
+	u64 chosenvictim = 1, taskvictim;
+#endif
 	*ppoints = 0;
 
 	do_posix_clock_monotonic_gettime(&uptime);
@@ -257,10 +261,26 @@ static struct task_struct *select_bad_process(unsigned 
long *ppoints,
 			continue;
 
 		points = badness(p, uptime.tv_sec);
+#ifdef CONFIG_CGROUP_OOM
+		taskvictim = (container_of(p->cgroups->subsys[oom_subsys_id],
+						struct oom_cgroup, css))->victim;
+
+		if (taskvictim > chosenvictim ||
+			(taskvictim == chosenvictim && points > *ppoints) ||
+			!chosen) {
+
+			chosen = p;
+			*ppoints = points;
+			chosenvictim = taskvictim;
+
+		}
+		
+#else
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
 		}
+#endif
 	} while_each_thread(g, p);
 
 	return chosen;
diff --git a/mm/oomcontrol.c b/mm/oomcontrol.c
new file mode 100644
index 0000000..2bfa325
--- /dev/null
+++ b/mm/oomcontrol.c
@@ -0,0 +1,95 @@
+/*
+ * kernel/cgroup_oom.c - oom handler cgroup.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/oomcontrol.h>
+
+/*
+ * Helper to retrieve oom controller data from cgroup
+ */
+static struct oom_cgroup *oom_css_from_cgroup(struct cgroup *cgrp)
+{
+        return container_of(cgroup_subsys_state(cgrp,
+                                oom_subsys_id), struct oom_cgroup,
+                                css);
+}
+
+
+static struct cgroup_subsys_state *oom_create(struct cgroup_subsys *ss,
+						   struct cgroup *cont)
+{
+	struct oom_cgroup *oom_css = kzalloc(sizeof(*oom_css), GFP_KERNEL);
+	struct oom_cgroup *parent;
+
+	if (!oom_css)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * if root last/only group to be victimized
+	 * else inherit parents value
+	 */
+	if (cont->parent == NULL)
+		oom_css->victim = 1;
+	else {
+		parent = oom_css_from_cgroup(cont->parent);
+		oom_css->victim = parent->victim;
+	}
+
+	return &oom_css->css;
+}
+
+static void oom_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	kfree(cont->subsys[oom_subsys_id]);
+}
+
+static int oom_victim_write(struct cgroup *cgrp, struct cftype *cft,
+                                       u64 val)
+{
+
+        cgroup_lock();
+
+	(oom_css_from_cgroup(cgrp))->victim = val;
+
+        cgroup_unlock();
+
+        return 0;
+}
+
+static u64 oom_victim_read(struct cgroup *cgrp, struct cftype *cft)
+{
+        u64 victim = (oom_css_from_cgroup(cgrp))->victim;
+
+        return victim;
+}
+
+
+static struct cftype oom_cgroup_files[] = {
+	{
+		.name = "victim",
+		.read_u64 = oom_victim_read,
+		.write_u64 = oom_victim_write,
+	},
+};
+
+static int oom_populate(struct cgroup_subsys *ss,
+                                struct cgroup *cont)
+{
+        int ret;
+
+        ret = cgroup_add_files(cont, ss, oom_cgroup_files,
+                                ARRAY_SIZE(oom_cgroup_files));
+
+        return ret;
+}
+
+struct cgroup_subsys oom_subsys = {
+	.name = "oom",
+	.subsys_id = oom_subsys_id,
+	.create = oom_create,
+	.destroy = oom_destroy,
+	.populate = oom_populate,
+};


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-21 11:08 [RFC] [PATCH] Cgroup based OOM killer controller Nikanth Karthikesan
@ 2009-01-21 13:17 ` Evgeniy Polyakov
  2009-01-21 15:24   ` Nikanth Karthikesan
  2009-01-22  3:28 ` KAMEZAWA Hiroyuki
  2009-01-26 19:54 ` Balbir Singh
  2 siblings, 1 reply; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-21 13:17 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: linux-kernel, Andrew Morton, Linus Torvalds, Chris Snook,
	Alan Cox, Arve Hjønnevåg, Paul Menage, containers

On Wed, Jan 21, 2009 at 04:38:21PM +0530, Nikanth Karthikesan (knikanth@suse.de) wrote:
> As Alan Cox suggested/wondered in this thread, 
> http://lkml.org/lkml/2009/1/12/235 , this is a container group based approach 
> to override the oom killer selection without losing all the benefits of the 
> current oom killer heuristics and oom_adj interface.
> 
> It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
> process using the usual badness value but only within the cgroup with the 
> maximum value for oom.victim before killing any process from a cgroup with a 
> lesser oom.victim number. Oom killing could be disabled by setting 
> oom.victim=0.

Looks good, except that in some conditions (everyone has zero for
example) victim=0 does not disalbe oom-killer because of !chosen
part of the condition.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-21 13:17 ` Evgeniy Polyakov
@ 2009-01-21 15:24   ` Nikanth Karthikesan
  2009-01-21 20:49     ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-21 15:24 UTC (permalink / raw)
  To: Evgeniy Polyakov, Andrew Morton, Alan Cox
  Cc: linux-kernel, Linus Torvalds, Chris Snook,
	Arve Hjønnevåg, Paul Menage, containers

On Wednesday 21 January 2009 18:47:39 Evgeniy Polyakov wrote:
> On Wed, Jan 21, 2009 at 04:38:21PM +0530, Nikanth Karthikesan 
(knikanth@suse.de) wrote:
> > As Alan Cox suggested/wondered in this thread,
> > http://lkml.org/lkml/2009/1/12/235 , this is a container group based
> > approach to override the oom killer selection without losing all the
> > benefits of the current oom killer heuristics and oom_adj interface.
> >
> > It adds a tunable oom.victim to the oom cgroup. The oom killer will kill
> > the process using the usual badness value but only within the cgroup with
> > the maximum value for oom.victim before killing any process from a cgroup
> > with a lesser oom.victim number. Oom killing could be disabled by setting
> > oom.victim=0.
>
> Looks good, except that in some conditions (everyone has zero for
> example) victim=0 does not disalbe oom-killer because of !chosen
> part of the condition.

Thanks for reviewing. Yes, I missed that condition. Adding a check for 
victim!=0 should fix it. Here is a revised patch with that fixed.

Thanks
Nikanth Karthikesan

From: Nikanth Karthikesan <knikanth@suse.de>

Cgroup based OOM killer controller

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

---

This is a container group based approach to override the oom killer selection 
without losing all the benefits of the current oom killer heuristics and 
oom_adj interface.

It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
process using the usual badness value but only within the cgroup with the 
maximum value for oom.victim before killing any process from a cgroup with a 
lesser oom.victim number. Oom killing could be disabled by setting 
oom.victim=0.

diff --git a/Documentation/cgroups/oom.txt b/Documentation/cgroups/oom.txt
new file mode 100644
index 0000000..9102830
--- /dev/null
+++ b/Documentation/cgroups/oom.txt
@@ -0,0 +1,29 @@
+OOM Killer controller
+--- ------ ----------
+
+The OOM killer kills the process based on a set of heuristics such that only
+minimum amount of work done will be lost, a large amount of memory would be
+recovered and minimum no of processes are killed.
+
+The user can adjust the score used to select the processes to be killed using
+/proc/<pid>/oom_adj. Giving it a high score will increase the likelihood of 
+this process being killed by the oom-killer.  Valid values are in the range 
+-16 to +15, plus the special value -17, which disables oom-killing altogether
+for that process.
+
+But it is very difficult to suggest an order among tasks to be killed during
+Out Of Memory situation. The OOM Killer controller aids in doing that.
+
+USAGE
+-----
+
+Mount the oom controller by passing 'oom' when mounting cgroups. Echo
+a value in oom.victim file to change the order. The oom killer would kill
+all the processes in a cgroup with a higher oom.victim value before killing a
+process in a cgroup with lower oom.victim value. Among those tasks with same
+oom.victim value, the usual badness heuristics would be applied. The 
+/proc/<pid>/oom_adj still helps adjusting the oom killer score. Also having
+oom.victim = 0 would disable oom killing for the tasks in that cgroup.
+
+Note: If this is used without proper consideration, innocent processes may
+get killed unnecesarily.
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..6944f99 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,4 +59,8 @@ SUBSYS(freezer)
 SUBSYS(net_cls)
 #endif
 
+#ifdef CONFIG_CGROUP_OOM
+SUBSYS(oom)
+#endif
+
 /* */
diff --git a/include/linux/oomcontrol.h b/include/linux/oomcontrol.h
new file mode 100644
index 0000000..18146ab
--- /dev/null
+++ b/include/linux/oomcontrol.h
@@ -0,0 +1,14 @@
+#ifndef _LINUX_OOMCONTROL_H
+#define _LINUX_OOMCONTROL_H
+
+#ifdef CONFIG_CGROUP_OOM
+struct oom_cgroup { 
+	struct cgroup_subsys_state css;
+	/*
+	 * the order to be victimized for this group
+	 */  
+	u64 victim;
+};
+
+#endif
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 2af8382..99ed0de 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -354,6 +354,15 @@ config CGROUP_DEBUG
 
 	  Say N if unsure.
 
+config CGROUP_OOM
+	bool "Oom cgroup subsystem"
+	depends on CGROUPS
+	help
+	  This provides a cgroup subsystem which aids controlling
+	  the order in which tasks whould be killed during
+	  out of memory situations.
+	
+
 config CGROUP_NS
 	bool "Namespace cgroup subsystem"
 	depends on CGROUPS
diff --git a/mm/Makefile b/mm/Makefile
index 72255be..a5d7222 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,3 +33,4 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_OOM) += oomcontrol.o 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 40ba050..64fa38e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/oomcontrol.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -205,6 +206,9 @@ static struct task_struct *select_bad_process(unsigned 
long *ppoints,
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
 	struct timespec uptime;
+#ifdef CONFIG_CGROUP_OOM
+	u64 chosenvictim = 1, taskvictim;
+#endif
 	*ppoints = 0;
 
 	do_posix_clock_monotonic_gettime(&uptime);
@@ -257,10 +261,26 @@ static struct task_struct *select_bad_process(unsigned 
long *ppoints,
 			continue;
 
 		points = badness(p, uptime.tv_sec);
+#ifdef CONFIG_CGROUP_OOM
+		taskvictim = (container_of(p->cgroups->subsys[oom_subsys_id],
+						struct oom_cgroup, css))->victim;
+
+		if (taskvictim > chosenvictim ||
+			(taskvictim == chosenvictim && points > *ppoints) ||
+			(taskvictim && !chosen)) {
+
+			chosen = p;
+			*ppoints = points;
+			chosenvictim = taskvictim;
+
+		}
+		
+#else
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
 		}
+#endif
 	} while_each_thread(g, p);
 
 	return chosen;
diff --git a/mm/oomcontrol.c b/mm/oomcontrol.c
new file mode 100644
index 0000000..2bfa325
--- /dev/null
+++ b/mm/oomcontrol.c
@@ -0,0 +1,95 @@
+/*
+ * kernel/cgroup_oom.c - oom handler cgroup.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/oomcontrol.h>
+
+/*
+ * Helper to retrieve oom controller data from cgroup
+ */
+static struct oom_cgroup *oom_css_from_cgroup(struct cgroup *cgrp)
+{
+        return container_of(cgroup_subsys_state(cgrp,
+                                oom_subsys_id), struct oom_cgroup,
+                                css);
+}
+
+
+static struct cgroup_subsys_state *oom_create(struct cgroup_subsys *ss,
+						   struct cgroup *cont)
+{
+	struct oom_cgroup *oom_css = kzalloc(sizeof(*oom_css), GFP_KERNEL);
+	struct oom_cgroup *parent;
+
+	if (!oom_css)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * if root last/only group to be victimized
+	 * else inherit parents value
+	 */
+	if (cont->parent == NULL)
+		oom_css->victim = 1;
+	else {
+		parent = oom_css_from_cgroup(cont->parent);
+		oom_css->victim = parent->victim;
+	}
+
+	return &oom_css->css;
+}
+
+static void oom_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	kfree(cont->subsys[oom_subsys_id]);
+}
+
+static int oom_victim_write(struct cgroup *cgrp, struct cftype *cft,
+                                       u64 val)
+{
+
+        cgroup_lock();
+
+	(oom_css_from_cgroup(cgrp))->victim = val;
+
+        cgroup_unlock();
+
+        return 0;
+}
+
+static u64 oom_victim_read(struct cgroup *cgrp, struct cftype *cft)
+{
+        u64 victim = (oom_css_from_cgroup(cgrp))->victim;
+
+        return victim;
+}
+
+
+static struct cftype oom_cgroup_files[] = {
+	{
+		.name = "victim",
+		.read_u64 = oom_victim_read,
+		.write_u64 = oom_victim_write,
+	},
+};
+
+static int oom_populate(struct cgroup_subsys *ss,
+                                struct cgroup *cont)
+{
+        int ret;
+
+        ret = cgroup_add_files(cont, ss, oom_cgroup_files,
+                                ARRAY_SIZE(oom_cgroup_files));
+
+        return ret;
+}
+
+struct cgroup_subsys oom_subsys = {
+	.name = "oom",
+	.subsys_id = oom_subsys_id,
+	.create = oom_create,
+	.destroy = oom_destroy,
+	.populate = oom_populate,
+};


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-21 15:24   ` Nikanth Karthikesan
@ 2009-01-21 20:49     ` David Rientjes
  2009-01-22  2:53       ` KAMEZAWA Hiroyuki
  2009-01-22  5:12       ` Nikanth Karthikesan
  0 siblings, 2 replies; 63+ messages in thread
From: David Rientjes @ 2009-01-21 20:49 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Wed, 21 Jan 2009, Nikanth Karthikesan wrote:

> This is a container group based approach to override the oom killer selection 
> without losing all the benefits of the current oom killer heuristics and 
> oom_adj interface.
> 
> It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
> process using the usual badness value but only within the cgroup with the 
> maximum value for oom.victim before killing any process from a cgroup with a 
> lesser oom.victim number. Oom killing could be disabled by setting 
> oom.victim=0.
> 

This doesn't help in memcg or cpuset constrained oom conditions, which 
still go through select_bad_process().

If the oom.victim value is high for a specific cgroup and a memory 
controller oom occurs in a disjoint cgroup, for example, it's possible to 
needlessly kill tasks.  Obviously that is up to the administrator to 
configure, but may not be his or her desire for system-wide oom 
conditions.

It may be preferred to kill tasks in a specific cgroup first when the 
entire system is out of memory or kill tasks within a cgroup attached to a 
memory controller when it is oom.

The same scenario applies for cpuset-constrained ooms.  Since oom.victim 
is given higher preference than all tasks' oom_adj values, it is possible 
to needlessly kill tasks that do not lead to future memory freeing for the 
nodes attached to that cpuset.

It also requires that you synchronize the oom.victim values amongst your 
cgroups.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-21 20:49     ` David Rientjes
@ 2009-01-22  2:53       ` KAMEZAWA Hiroyuki
  2009-01-22  5:12         ` Nikanth Karthikesan
  2009-01-22  5:12       ` Nikanth Karthikesan
  1 sibling, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-22  2:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nikanth Karthikesan, containers, linux-kernel,
	Arve Hjønnevåg, Evgeniy Polyakov, Andrew Morton,
	Chris Snook, Linus Torvalds, Paul Menage, Alan Cox

On Wed, 21 Jan 2009 12:49:50 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 21 Jan 2009, Nikanth Karthikesan wrote:
> 
> > This is a container group based approach to override the oom killer selection 
> > without losing all the benefits of the current oom killer heuristics and 
> > oom_adj interface.
> > 
> > It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
> > process using the usual badness value but only within the cgroup with the 
> > maximum value for oom.victim before killing any process from a cgroup with a 
> > lesser oom.victim number. Oom killing could be disabled by setting 
> > oom.victim=0.
> > 
> 
> This doesn't help in memcg or cpuset constrained oom conditions, which 
> still go through select_bad_process().
> 
> If the oom.victim value is high for a specific cgroup and a memory 
> controller oom occurs in a disjoint cgroup, for example, it's possible to 
> needlessly kill tasks.  Obviously that is up to the administrator to 
> configure, but may not be his or her desire for system-wide oom 
> conditions.
> 
Hmm...after this patch, select_bad_process's filter to select process will be

==
	1. ->mm is NULL ?		=> don't select this
	2. is init task ?		=> don't select this
	3. is under specified memcg ?   => don't select this
	4. marked as MEMDIE ?		=> return -1.
	5. PF_EXITING?			=> select this.
	6. OOM_DISABLE ?		=> don't select this
	points = badness(p, uptime.tv_sec);
	7. adjust point & select logic depends on OOM cgroup
==

Not looks good ;)

> It may be preferred to kill tasks in a specific cgroup first when the 
> entire system is out of memory or kill tasks within a cgroup attached to a 
> memory controller when it is oom.
> 

I agree here.

Above filter logic should be
==
	current_victim_level++;
	1. p is under oom cgroup of victim_level > current_victim_level => don't select this.
	2. ->mm is NULL ?		=> don't select this
	3. is init task ?		=> don't select this
	4. is under specified memcg ?   => don't select this
	5. marked as MEMDIE ?		=> return -1.
	6. PF_EXITING?			=> select this.
	7. OOM_DISABLE ?		=> don't select this
	points = badness(p, uptime.tv_sec)
==
But this will be too slow.

I think do_each_thread() in select_bad_process() should be replaced with
a routine like this, finally.
==
	for_each_oom_cgroup_in_victim_value_order() {
		for_each_threads_in_oom_cgroup(oom) {
			select one bad thread.
		}
		if (selected_one_is_enough_bad ?)
			return selected_one;
	}
==

And this can be a help for "spped up OOM killer" problem. 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  2:53       ` KAMEZAWA Hiroyuki
@ 2009-01-22  5:12         ` Nikanth Karthikesan
  0 siblings, 0 replies; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-22  5:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, containers, linux-kernel,
	Arve Hjønnevåg, Evgeniy Polyakov, Andrew Morton,
	Chris Snook, Linus Torvalds, Paul Menage, Alan Cox

On Thursday 22 January 2009 08:23:24 KAMEZAWA Hiroyuki wrote:
> On Wed, 21 Jan 2009 12:49:50 -0800 (PST)
>
> David Rientjes <rientjes@google.com> wrote:
> > On Wed, 21 Jan 2009, Nikanth Karthikesan wrote:
> > > This is a container group based approach to override the oom killer
> > > selection without losing all the benefits of the current oom killer
> > > heuristics and oom_adj interface.
> > >
> > > It adds a tunable oom.victim to the oom cgroup. The oom killer will
> > > kill the process using the usual badness value but only within the
> > > cgroup with the maximum value for oom.victim before killing any process
> > > from a cgroup with a lesser oom.victim number. Oom killing could be
> > > disabled by setting oom.victim=0.
> >
> > This doesn't help in memcg or cpuset constrained oom conditions, which
> > still go through select_bad_process().
> >
> > If the oom.victim value is high for a specific cgroup and a memory
> > controller oom occurs in a disjoint cgroup, for example, it's possible to
> > needlessly kill tasks.  Obviously that is up to the administrator to
> > configure, but may not be his or her desire for system-wide oom
> > conditions.
>
> Hmm...after this patch, select_bad_process's filter to select process will
> be
>
> ==
> 	1. ->mm is NULL ?		=> don't select this
> 	2. is init task ?		=> don't select this
> 	3. is under specified memcg ?   => don't select this
> 	4. marked as MEMDIE ?		=> return -1.
> 	5. PF_EXITING?			=> select this.
> 	6. OOM_DISABLE ?		=> don't select this
> 	points = badness(p, uptime.tv_sec);
> 	7. adjust point & select logic depends on OOM cgroup
> ==
>
> Not looks good ;)
>

Yes, we do throw away a lot of needless work done. But this is how we already 
do and this is not a regression. But this could be used to improve the OOM 
killer's speed.

> > It may be preferred to kill tasks in a specific cgroup first when the
> > entire system is out of memory or kill tasks within a cgroup attached to
> > a memory controller when it is oom.
>
> I agree here.
>
> Above filter logic should be
> ==
> 	current_victim_level++;
> 	1. p is under oom cgroup of victim_level > current_victim_level => don't
> select this. 2. ->mm is NULL ?		=> don't select this
> 	3. is init task ?		=> don't select this
> 	4. is under specified memcg ?   => don't select this
> 	5. marked as MEMDIE ?		=> return -1.
> 	6. PF_EXITING?			=> select this.
> 	7. OOM_DISABLE ?		=> don't select this
> 	points = badness(p, uptime.tv_sec)
> ==
> But this will be too slow.
>
> I think do_each_thread() in select_bad_process() should be replaced with
> a routine like this, finally.
> ==
> 	for_each_oom_cgroup_in_victim_value_order() {
> 		for_each_threads_in_oom_cgroup(oom) {
> 			select one bad thread.
> 		}
> 		if (selected_one_is_enough_bad ?)
> 			return selected_one;
> 	}
> ==
>

Yes.

> And this can be a help for "spped up OOM killer" problem.
>

Yes.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-21 20:49     ` David Rientjes
  2009-01-22  2:53       ` KAMEZAWA Hiroyuki
@ 2009-01-22  5:12       ` Nikanth Karthikesan
  2009-01-22  8:43         ` David Rientjes
  1 sibling, 1 reply; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-22  5:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thursday 22 January 2009 02:19:50 David Rientjes wrote:
> On Wed, 21 Jan 2009, Nikanth Karthikesan wrote:
> > This is a container group based approach to override the oom killer
> > selection without losing all the benefits of the current oom killer
> > heuristics and oom_adj interface.
> >
> > It adds a tunable oom.victim to the oom cgroup. The oom killer will kill
> > the process using the usual badness value but only within the cgroup with
> > the maximum value for oom.victim before killing any process from a cgroup
> > with a lesser oom.victim number. Oom killing could be disabled by setting
> > oom.victim=0.
>
> This doesn't help in memcg or cpuset constrained oom conditions, which
> still go through select_bad_process().
>
> If the oom.victim value is high for a specific cgroup and a memory
> controller oom occurs in a disjoint cgroup, for example, it's possible to
> needlessly kill tasks.  Obviously that is up to the administrator to
> configure, but may not be his or her desire for system-wide oom
> conditions.
>
> It may be preferred to kill tasks in a specific cgroup first when the
> entire system is out of memory or kill tasks within a cgroup attached to a
> memory controller when it is oom.
>
> The same scenario applies for cpuset-constrained ooms.  Since oom.victim
> is given higher preference than all tasks' oom_adj values, it is possible
> to needlessly kill tasks that do not lead to future memory freeing for the
> nodes attached to that cpuset.
>
> It also requires that you synchronize the oom.victim values amongst your
> cgroups.

No, this is not specific to memcg or cpuset cases alone. The same needless 
kills will take place even without memcg or cpuset when an administrator 
specifies a light memory consumer to be killed before a heavy memory user. But 
it is up to the administrator to use it wisely. We also provide a panic_on_oom 
option that an administrator could use, not just to kill few more tasks but 
all tasks in the system ;)

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  5:12       ` Nikanth Karthikesan
@ 2009-01-22  8:43         ` David Rientjes
  2009-01-22  9:23           ` Nikanth Karthikesan
  2009-01-22  9:50           ` Evgeniy Polyakov
  0 siblings, 2 replies; 63+ messages in thread
From: David Rientjes @ 2009-01-22  8:43 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, 22 Jan 2009, Nikanth Karthikesan wrote:

> No, this is not specific to memcg or cpuset cases alone. The same needless 
> kills will take place even without memcg or cpuset when an administrator 
> specifies a light memory consumer to be killed before a heavy memory user. But 
> it is up to the administrator to use it wisely.

You can't specify different behavior for an oom cgroup depending on what 
type of oom it is, which is the problem with this proposal.

For example, if your task triggers an oom as the result of its exclusive 
cpuset placement, the oom killer should prefer to kill a task within that 
cpuset to allow for future memory freeing.

So, with your proposal, an administrator can specify the oom priority of 
an entire aggregate of tasks but the behavior may not be desired for a 
cpuset-constrained oom, while it may be perfectly legitimate for a global 
unconstrained oom.

I can specify a higher oom priority for a cpuset because its jobs are less 
critical and I would prefer it gets killed in a system-wide oom, but any 
other cpuset that ooms will needlessly kill these tasks when there is no 
benefit.

> We also provide a panic_on_oom 
> option that an administrator could use, not just to kill few more tasks but 
> all tasks in the system ;)
> 

panic_on_oom allows you to specify whether you want to panic on any oom or 
on global non-constrained ooms, as well.  When the sysctl is not set to 2, 
it is a no-op in a cpuset, mempolicy, or memcg oom condition.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  8:43         ` David Rientjes
@ 2009-01-22  9:23           ` Nikanth Karthikesan
  2009-01-22  9:39             ` David Rientjes
  2009-01-22  9:50           ` Evgeniy Polyakov
  1 sibling, 1 reply; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-22  9:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thursday 22 January 2009 14:13:38 David Rientjes wrote:
> On Thu, 22 Jan 2009, Nikanth Karthikesan wrote:
> > No, this is not specific to memcg or cpuset cases alone. The same
> > needless kills will take place even without memcg or cpuset when an
> > administrator specifies a light memory consumer to be killed before a
> > heavy memory user. But it is up to the administrator to use it wisely.
>
> You can't specify different behavior for an oom cgroup depending on what
> type of oom it is, which is the problem with this proposal.
>

No. This does not disable any such special selection criteria which is used 
without this controller.

> For example, if your task triggers an oom as the result of its exclusive
> cpuset placement, the oom killer should prefer to kill a task within that
> cpuset to allow for future memory freeing.
>
> So, with your proposal, an administrator can specify the oom priority of
> an entire aggregate of tasks but the behavior may not be desired for a
> cpuset-constrained oom, while it may be perfectly legitimate for a global
> unconstrained oom.
>
> I can specify a higher oom priority for a cpuset because its jobs are less
> critical and I would prefer it gets killed in a system-wide oom, but any
> other cpuset that ooms will needlessly kill these tasks when there is no
> benefit.
>

This patch just chooses the task with highest oom.victim among those tasks 
which would have been chosen without this controller. So all the "kill within 
memcg/cpuset" should work as always! It should just kill a task within the 
memcg with highest oom.victim.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  9:23           ` Nikanth Karthikesan
@ 2009-01-22  9:39             ` David Rientjes
  2009-01-22 10:10               ` Nikanth Karthikesan
  0 siblings, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-22  9:39 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, 22 Jan 2009, Nikanth Karthikesan wrote:

> > You can't specify different behavior for an oom cgroup depending on what
> > type of oom it is, which is the problem with this proposal.
> >
> 
> No. This does not disable any such special selection criteria which is used 
> without this controller.
> 

I didn't say it disabled it; the cpuset preference is actually implemented 
in the badness() score and not specifically excluded in 
select_bad_process().  That's because it's quite possible that a task has 
allocated memory in a cpuset and then either moved to a separate cpuset or 
had it's mems_allowed changed.

Please try it and you'll see.  Create two cpusets, cpuset A and cpuset B.  
Elevate cpuset A's oom.victim value and then trigger an oom in cpuset B.  
Your patch will cause a task from cpuset A to be killed for a cpuset B 
triggered oom which, more often than not, will not lead to future memory 
freeing.

It's quite possible that cpuset A would be preferred to be killed in a 
global unconstrained oom condition, however.  That's the only reason why 
one would elevate its oom.victim score to begin with.  But it doesn't work 
for cpuset-constrained ooms.

It's not going to help if it I explain it further and you don't try it out 
on your own.  Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  9:39             ` David Rientjes
@ 2009-01-22 10:10               ` Nikanth Karthikesan
  2009-01-22 10:18                 ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-22 10:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thursday 22 January 2009 15:09:28 David Rientjes wrote:
> On Thu, 22 Jan 2009, Nikanth Karthikesan wrote:
> > > You can't specify different behavior for an oom cgroup depending on
> > > what type of oom it is, which is the problem with this proposal.
> >
> > No. This does not disable any such special selection criteria which is
> > used without this controller.
>
> I didn't say it disabled it; the cpuset preference is actually implemented
> in the badness() score and not specifically excluded in
> select_bad_process().  That's because it's quite possible that a task has
> allocated memory in a cpuset and then either moved to a separate cpuset or
> had it's mems_allowed changed.
>
> Please try it and you'll see.  Create two cpusets, cpuset A and cpuset B.
> Elevate cpuset A's oom.victim value and then trigger an oom in cpuset B.
> Your patch will cause a task from cpuset A to be killed for a cpuset B
> triggered oom which, more often than not, will not lead to future memory
> freeing.
>
> It's quite possible that cpuset A would be preferred to be killed in a
> global unconstrained oom condition, however.  That's the only reason why
> one would elevate its oom.victim score to begin with.  But it doesn't work
> for cpuset-constrained ooms.
>
> It's not going to help if it I explain it further and you don't try it out
> on your own.  Thanks.

Thanks for the clear explanation. Cpuset does it by reducing the badness to 
1/8th for tasks. So using oom-controller could kill some innocent processes on 
some other cpuset! 

But it is possible to have the same effect with oom_adj, having oom_adj=4 for 
a task on a diff cpuset will do the same(assuming they have similar badness).

I think cpusets preference could be improved, not to depend on badness, with 
something similar to what memcg does. With or without adding overhead of 
tracking processes that has memory from a node.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 10:10               ` Nikanth Karthikesan
@ 2009-01-22 10:18                 ` David Rientjes
  0 siblings, 0 replies; 63+ messages in thread
From: David Rientjes @ 2009-01-22 10:18 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, 22 Jan 2009, Nikanth Karthikesan wrote:

> I think cpusets preference could be improved, not to depend on badness, with 
> something similar to what memcg does. With or without adding overhead of 
> tracking processes that has memory from a node.
> 

We actually used to do that: we excluded all tasks that did not share the 
same cpuset in select_bad_process().  That exclusion was reimplemented as 
a preference in badness() since, again, it is quite possible that a large 
memory-hogging task without a sufficient oom_adj score, as you mentioned, 
has allocated memory on the cpuset's nodes before being moved to a 
different cpuset or changing its set of allowable nodes.

I think you would find the per-cgroup oom notifier patch[*] of interest.  
It seems to have been dropped after some discussion on improvements, but 
allows you to defer all of these decisions to userspace.  Would something 
like that fix your problem?

 [*] http://marc.info/?l=linux-mm&m=122575082227252

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  8:43         ` David Rientjes
  2009-01-22  9:23           ` Nikanth Karthikesan
@ 2009-01-22  9:50           ` Evgeniy Polyakov
  2009-01-22 10:00             ` David Rientjes
  1 sibling, 1 reply; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-22  9:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, Jan 22, 2009 at 12:43:38AM -0800, David Rientjes (rientjes@google.com) wrote:
> For example, if your task triggers an oom as the result of its exclusive 
> cpuset placement, the oom killer should prefer to kill a task within that 
> cpuset to allow for future memory freeing.

This it not true for all cases. What if you do need to start this task
and free something else outside the given set? This should be an
administrative decision and not forced by the kernel. We used to have it
that way, but it does not mean that it is the only correct way to do the
things.

> So, with your proposal, an administrator can specify the oom priority of 
> an entire aggregate of tasks but the behavior may not be desired for a 
> cpuset-constrained oom, while it may be perfectly legitimate for a global 
> unconstrained oom.

In this case administrator will not do this. It is up to him to decide
and not some inner kernel policy.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  9:50           ` Evgeniy Polyakov
@ 2009-01-22 10:00             ` David Rientjes
  2009-01-22 10:14               ` Evgeniy Polyakov
  0 siblings, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-22 10:00 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, 22 Jan 2009, Evgeniy Polyakov wrote:

> > For example, if your task triggers an oom as the result of its exclusive 
> > cpuset placement, the oom killer should prefer to kill a task within that 
> > cpuset to allow for future memory freeing.
> 
> This it not true for all cases. What if you do need to start this task
> and free something else outside the given set? This should be an
> administrative decision and not forced by the kernel. We used to have it
> that way, but it does not mean that it is the only correct way to do the
> things.
> 

In an exclusive cpuset, a task's memory is restricted to a set of mems 
that the administrator has designated.  If it is oom, the kernel must free 
memory on those nodes or the next allocation will again trigger an oom 
(leading to a needlessly killed task that was in a disjoint cpuset).

Really.

> > So, with your proposal, an administrator can specify the oom priority of 
> > an entire aggregate of tasks but the behavior may not be desired for a 
> > cpuset-constrained oom, while it may be perfectly legitimate for a global 
> > unconstrained oom.
> 
> In this case administrator will not do this. It is up to him to decide
> and not some inner kernel policy.
> 

Then the scope of this new cgroup is restricted to not being used with 
cpusets that could oom.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 10:00             ` David Rientjes
@ 2009-01-22 10:14               ` Evgeniy Polyakov
  2009-01-22 10:27                 ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-22 10:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, Jan 22, 2009 at 02:00:55AM -0800, David Rientjes (rientjes@google.com) wrote:
> 
> In an exclusive cpuset, a task's memory is restricted to a set of mems 
> that the administrator has designated.  If it is oom, the kernel must free 
> memory on those nodes or the next allocation will again trigger an oom 
> (leading to a needlessly killed task that was in a disjoint cpuset).
> 
> Really.

The whole point of oom-killer is to kill the most appropriate task to
free the memory. And while task is selected system-wide and some
tunables are added to tweak the behaviour local to some subsystems, this
cpuset feature is hardcoded into the selection algorithm.
And when some tunable starts doing own calculation, behaviour of this
hardcoded feature changes.

This is intended to change it. Because admin has to have ability to tune
system the way he needs and not some special hueristics, which may not
work all the time.

That is the point against cpuset argument. Make it tunable the same way
we have oom_adj and/or this cgroup order feature.

> > In this case administrator will not do this. It is up to him to decide
> > and not some inner kernel policy.
> > 
> 
> Then the scope of this new cgroup is restricted to not being used with 
> cpusets that could oom.

These are perpendicular tasks - cpusets limit one area of the oom
handling, cgroup order - another. Some people needs cpusets, others want
cgroups. cpusets are not something exceptional so that only they have to
be taken into account when doing system-wide operation like OOM
condition handling.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 10:14               ` Evgeniy Polyakov
@ 2009-01-22 10:27                 ` David Rientjes
  2009-01-22 13:21                   ` Evgeniy Polyakov
  2009-01-23  9:45                   ` Nikanth Karthikesan
  0 siblings, 2 replies; 63+ messages in thread
From: David Rientjes @ 2009-01-22 10:27 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, 22 Jan 2009, Evgeniy Polyakov wrote:

> > In an exclusive cpuset, a task's memory is restricted to a set of mems 
> > that the administrator has designated.  If it is oom, the kernel must free 
> > memory on those nodes or the next allocation will again trigger an oom 
> > (leading to a needlessly killed task that was in a disjoint cpuset).
> > 
> > Really.
> 
> The whole point of oom-killer is to kill the most appropriate task to
> free the memory. And while task is selected system-wide and some
> tunables are added to tweak the behaviour local to some subsystems, this
> cpuset feature is hardcoded into the selection algorithm.

Of course, because the oom killer must be aware that tasks in disjoint 
cpusets are more likely than not to result in no memory freeing for 
current's subsequent allocations.

> And when some tunable starts doing own calculation, behaviour of this
> hardcoded feature changes.
> 

Yes, it is possible to elevate oom_adj scores to override the cpuset 
preference.  That's actually intended since it is now possible for the 
administrator to specify that, against the belief of the kernel, that 
killing a task will free memory in these cpuset-constrained ooms.  That's 
probably because it has either been moved to a different cpuset or its set 
of allowable nodes is dynamic.

> > Then the scope of this new cgroup is restricted to not being used with 
> > cpusets that could oom.
> 
> These are perpendicular tasks - cpusets limit one area of the oom
> handling, cgroup order - another. Some people needs cpusets, others want
> cgroups. cpusets are not something exceptional so that only they have to
> be taken into account when doing system-wide operation like OOM
> condition handling.
> 

A cpuset is a cgroup.  If I am using cpusets, this patch fails to 
adequately allow me to describe my oom preferences for both 
cpuset-constrained ooms and global unconstrained ooms, which is a major 
drawback.

I would encourage you to look at the per-cgroup oom notifier patch[*] that 
defers most of these decisions to userspace.  Given your interest in 
priority based oom preferences as exhibited by your oom_victim patch, I 
think you'll find it of interest since it allows you much greater 
flexibility than you could ever hope for from the kernel's heuristics.

 [*] http://marc.info/?l=linux-mm&m=122575082227252

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 10:27                 ` David Rientjes
@ 2009-01-22 13:21                   ` Evgeniy Polyakov
  2009-01-22 20:28                     ` David Rientjes
  2009-01-27 23:55                     ` Paul Menage
  2009-01-23  9:45                   ` Nikanth Karthikesan
  1 sibling, 2 replies; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-22 13:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, Jan 22, 2009 at 02:27:19AM -0800, David Rientjes (rientjes@google.com) wrote:
> > The whole point of oom-killer is to kill the most appropriate task to
> > free the memory. And while task is selected system-wide and some
> > tunables are added to tweak the behaviour local to some subsystems, this
> > cpuset feature is hardcoded into the selection algorithm.
> 
> Of course, because the oom killer must be aware that tasks in disjoint 
> cpusets are more likely than not to result in no memory freeing for 
> current's subsequent allocations.

And if we replace cpuset with cgroup (or anything else), nothing
changes, so why this was made so special?

> > And when some tunable starts doing own calculation, behaviour of this
> > hardcoded feature changes.
> > 
> 
> Yes, it is possible to elevate oom_adj scores to override the cpuset 
> preference.  That's actually intended since it is now possible for the 
> administrator to specify that, against the belief of the kernel, that 
> killing a task will free memory in these cpuset-constrained ooms.  That's 
> probably because it has either been moved to a different cpuset or its set 
> of allowable nodes is dynamic.

Yes, admin has to be aware of some obscure inner kernel policy which he
may not be using to be able to tune system-wide behaviour... Right now
it is not even documented :)

> > These are perpendicular tasks - cpusets limit one area of the oom
> > handling, cgroup order - another. Some people needs cpusets, others want
> > cgroups. cpusets are not something exceptional so that only they have to
> > be taken into account when doing system-wide operation like OOM
> > condition handling.
> 
> A cpuset is a cgroup.  If I am using cpusets, this patch fails to 
> adequately allow me to describe my oom preferences for both 
> cpuset-constrained ooms and global unconstrained ooms, which is a major 
> drawback.
> 
> I would encourage you to look at the per-cgroup oom notifier patch[*] that 
> defers most of these decisions to userspace.  Given your interest in 
> priority based oom preferences as exhibited by your oom_victim patch, I 
> think you'll find it of interest since it allows you much greater 
> flexibility than you could ever hope for from the kernel's heuristics.
> 
>  [*] http://marc.info/?l=linux-mm&m=122575082227252

Having userspace to decide which task to kill may not work in some cases
at all (when task is swapped and we need to kill someone to get the mem
to swap out the task, which will make that decision).

+			/* If we're planning to retry, we should wake
+			 * up any userspace waiter in order to let it
+			 * handle the OOM
+			 */
+			wake_up_all(&cs->oom_wait);

You must be kidding :)
If by the 'userspace' you do not mean special kernel thread, but in
this case there is no difference compared to existing in-kernel code.

At OOM time there is no userspace. We can not wakeup anyone or send (and
expect it will be received) some notification. The only alive entity in
the system is the kernel, who has to decide who must be killed based on
some data provided by administrator (or by default with some
heuristics).

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 13:21                   ` Evgeniy Polyakov
@ 2009-01-22 20:28                     ` David Rientjes
  2009-01-22 21:06                       ` Evgeniy Polyakov
  2009-01-27 23:55                     ` Paul Menage
  1 sibling, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-22 20:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, 22 Jan 2009, Evgeniy Polyakov wrote:

> > Of course, because the oom killer must be aware that tasks in disjoint 
> > cpusets are more likely than not to result in no memory freeing for 
> > current's subsequent allocations.
> 
> And if we replace cpuset with cgroup (or anything else), nothing
> changes, so why this was made so special?
> 

I don't know what you're talking about, cpusets are simply a client of the 
cgroup interface.

The problem that I'm identifying with this change is that the 
user-specified priority (in terms of oom.victim settings) conflicts when 
the kernel encounters a system-wide unconstrained oom versus a 
cpuset-constrained oom.

The oom.victim priority may well be valid for a system-wide condition 
where a specific aggregate of tasks are less critical than others and 
would be preferred to be killed.  But that doesn't help in a 
cpuset-constrained oom on the same machine if it doesn't lead to future 
memory freeing.

Cpusets are special in this regard because the oom killer's heuristics in 
this case are based on a function of the oom triggering task, current.  We 
check for tasks sharing current->mems_allowed so that we are freeing 
memory that the task can eventually use.  So this new oom cgroup cannot be 
used on any machine with one or more cpusets that may oom without the 
possibility of needlessly killing tasks.

> Having userspace to decide which task to kill may not work in some cases
> at all (when task is swapped and we need to kill someone to get the mem
> to swap out the task, which will make that decision).
> 

Yes, and it's always possible for userspace to defer back to the kernel in 
such cases.  But in the majority of cases that you seem to be interested 
in, this type of notification system seems like it would work quite well:

 - to replace your oom_victim patch, your userspace handler could simply
   issue a SIGKILL to the chosen job in place of the oom killer.  The 
   added benefit here is that your userspace handler could actually 
   support a prioritized list so if one application isn't running, it
   could check another, and another, etc., and

 - to replace this oom.victim patch, the userspace handler could check
   for the presence of the aggregate of tasks that would have appeared
   attached to the new cgroup and issue the same SIGKILL.

> +			/* If we're planning to retry, we should wake
> +			 * up any userspace waiter in order to let it
> +			 * handle the OOM
> +			 */
> +			wake_up_all(&cs->oom_wait);
> 
> You must be kidding :)
> If by the 'userspace' you do not mean special kernel thread, but in
> this case there is no difference compared to existing in-kernel code.
> 

That wakes up any userspace notifier that you've attached to the cgroup 
representing the tasks you're interested in controlling oom responses for.  
Your userspace application can then effectively take the place of the oom 
killer by expanding a cpuset's memory, elevating a memory controller 
reservation, killing current, or using its own heuristics to respond to 
the problem.

But we should probably discuss the implementation of the per-cgroup oom 
notifier in the thread dedicated to it, not this one.

> At OOM time there is no userspace. We can not wakeup anyone or send (and
> expect it will be received) some notification. The only alive entity in
> the system is the kernel, who has to decide who must be killed based on
> some data provided by administrator (or by default with some
> heuristics).
> 

This is completely and utterly wrong.  If you had read the code 
thoroughly, you would see that the page allocation is deferred until 
userspace has the opportunity to respond.  That time in deferral isn't 
absurd, it would take time for the oom killer's exiting task to free 
memory anyway.  The change simply allows a delay between a page 
allocation failure and invoking the oom killer.

It will even work on UP systems since the page allocator has multiple 
reschedule points where a dedicated or monitoring application attached to 
the cgroup could add or free memory itself so the allocation succeeds when 
current is rescheduled.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 20:28                     ` David Rientjes
@ 2009-01-22 21:06                       ` Evgeniy Polyakov
  2009-01-22 21:35                         ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-22 21:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, Jan 22, 2009 at 12:28:45PM -0800, David Rientjes (rientjes@google.com) wrote:
> I don't know what you're talking about, cpusets are simply a client of the 
> cgroup interface.

Only the fact that cpusets have _very_ special meaning in the oom-killer
codepath, while it should be just another tunable (if it should be
special code at all at the first place, why there were no objection and
argument, that tasks could have different oom_adj?).

> The problem that I'm identifying with this change is that the 
> user-specified priority (in terms of oom.victim settings) conflicts when 
> the kernel encounters a system-wide unconstrained oom versus a 
> cpuset-constrained oom.
> 
> The oom.victim priority may well be valid for a system-wide condition 
> where a specific aggregate of tasks are less critical than others and 
> would be preferred to be killed.  But that doesn't help in a 
> cpuset-constrained oom on the same machine if it doesn't lead to future 
> memory freeing.

And that is The Main Point. Not to help some subsystem, but be useful
for others. It is different tunable. It has really nothing with cpusets.
It is not aimed to help or make things worse for the cpusets. I even do
not understand why this very specific case not used by the wast majority
of the users was even rised in the discussion of the system-wide
oom-killer bahaviour and its more precise cgroup tunable.

This is different. It has its own tunable, which is used when cpusets
are enabled. Discussed tunable should be used at its own time. Admin has
a right to decide how he wants to tune oom-killer behaviour. And the
only point I hear, is that something does not help when we have cpusets
enabled.

> Cpusets are special in this regard because the oom killer's heuristics in 
> this case are based on a function of the oom triggering task, current.  We 
> check for tasks sharing current->mems_allowed so that we are freeing 
> memory that the task can eventually use.  So this new oom cgroup cannot be 
> used on any machine with one or more cpusets that may oom without the 
> possibility of needlessly killing tasks.

Then do not use anything else when cpusets are enabled.
Do not compile SuperH cpu support when you do not have such processes.
Do not enable iscsi support if you do not need this.
Do not enable cgroup oom-killer tunable when you want cpuset.

> > Having userspace to decide which task to kill may not work in some cases
> > at all (when task is swapped and we need to kill someone to get the mem
> > to swap out the task, which will make that decision).
> > 
> 
> Yes, and it's always possible for userspace to defer back to the kernel in 
> such cases.  But in the majority of cases that you seem to be interested 
> in, this type of notification system seems like it would work quite well:
> 
>  - to replace your oom_victim patch, your userspace handler could simply
>    issue a SIGKILL to the chosen job in place of the oom killer.  The 
>    added benefit here is that your userspace handler could actually 
>    support a prioritized list so if one application isn't running, it
>    could check another, and another, etc., and
> 
>  - to replace this oom.victim patch, the userspace handler could check
>    for the presence of the aggregate of tasks that would have appeared
>    attached to the new cgroup and issue the same SIGKILL.

Besides the fact that it is possible that userspace will not even wake
up. Since to wake it up we have to swap it out, which in turn needs some
memory, which we do not have, since we are in, hmm, Out-Of-Memory
condition.

There may be cases when this will work, I do not argue. But so far the
most common case I imagine is when it will not be able to wake up :)

> > If by the 'userspace' you do not mean special kernel thread, but in
> > this case there is no difference compared to existing in-kernel code.
> 
> That wakes up any userspace notifier that you've attached to the cgroup 
> representing the tasks you're interested in controlling oom responses for.  
> Your userspace application can then effectively take the place of the oom 
> killer by expanding a cpuset's memory, elevating a memory controller 
> reservation, killing current, or using its own heuristics to respond to 
> the problem.
> 
> But we should probably discuss the implementation of the per-cgroup oom 
> notifier in the thread dedicated to it, not this one.

Please explain, how task will make that decision (like freeing some
ram), when it can not be even swapped out? It is a great idea to be able
to monitor memory usage, but relying on that just does not work.

> > At OOM time there is no userspace. We can not wakeup anyone or send (and
> > expect it will be received) some notification. The only alive entity in
> > the system is the kernel, who has to decide who must be killed based on
> > some data provided by administrator (or by default with some
> > heuristics).
> > 
> 
> This is completely and utterly wrong.  If you had read the code 
> thoroughly, you would see that the page allocation is deferred until 
> userspace has the opportunity to respond.  That time in deferral isn't 
> absurd, it would take time for the oom killer's exiting task to free 
> memory anyway.  The change simply allows a delay between a page 
> allocation failure and invoking the oom killer.
> 
> It will even work on UP systems since the page allocator has multiple 
> reschedule points where a dedicated or monitoring application attached to 
> the cgroup could add or free memory itself so the allocation succeeds when 
> current is rescheduled.

Yup, it will wait for some timeout while userspace will respond. The
problem is that it will not in some cases. And so far I think (and I may
be wrong), that it will not be able to respond most of the time.

In every oom-condition I worked with, system was completely unusable
before oom-killer started to berserk several processes. They were not
able to allocate some data to read network input (atomic context) and to
even read a console input.

What I want to say is the simple matter of the iteraction within the
kernel oom-handler. Let it be as smart or as stupid as admin decided. I
have no objection against letting it to get some knowledge from the
outsude of its own world (like wait for userspace to respond). I vote by
both hands for implementing some kind of notification mechanism which
could be used by the userspace to get oom-notifications in particular
and to watch its (or system-wide) memory usage in general.

But all those things should be additional tunables. Which do not even
remotely, even theoretically allow a lockup. So userspace notifications
are great as long as this is not the only way to have a progress in the
OOM situation. And the highest priority in this case should be kernel
and its tunables (like proposed cgroup patch), since relying on
userspace in this case may be fatal. But of course it can be enabled and
contribute.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 21:06                       ` Evgeniy Polyakov
@ 2009-01-22 21:35                         ` David Rientjes
  2009-01-22 22:04                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-22 21:35 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Fri, 23 Jan 2009, Evgeniy Polyakov wrote:

> Only the fact that cpusets have _very_ special meaning in the oom-killer
> codepath, while it should be just another tunable (if it should be
> special code at all at the first place, why there were no objection and
> argument, that tasks could have different oom_adj?).
> 

That's because the oom killer needs to kill a task that will allow future 
memory freeing on current->mems_allowed in these cases.  Appropriate 
oom_adj scores can override that preference, it's up to the administrator.

> > The oom.victim priority may well be valid for a system-wide condition 
> > where a specific aggregate of tasks are less critical than others and 
> > would be preferred to be killed.  But that doesn't help in a 
> > cpuset-constrained oom on the same machine if it doesn't lead to future 
> > memory freeing.
> 
> And that is The Main Point. Not to help some subsystem, but be useful
> for others. It is different tunable. It has really nothing with cpusets.

It doesn't work with any system that is running with cpusets, that's the 
problem with accepting such code.  There needs to be clear and convincing 
evidence that a userspace oom notifier, which could handle all possible 
cases and is much more robust and flexible, would not suffice for such 
situations.

> Then do not use anything else when cpusets are enabled.
> Do not compile SuperH cpu support when you do not have such processes.
> Do not enable iscsi support if you do not need this.
> Do not enable cgroup oom-killer tunable when you want cpuset.
> 

How do I prioritize oom killing if my system is running cpusets, then?  
Your answer is that I can't, since this new cgroup doesn't work with them; 
so this solution is incomplete.  The oom killer notification patch[*] is 
much more functional and does allow this problem to be solved on all types 
of systems because the implementation of such a handler is left to the 
administrator.

 [*] http://marc.info/?l=linux-mm&m=122575082227252

> Please explain, how task will make that decision (like freeing some
> ram), when it can not be even swapped out? It is a great idea to be able
> to monitor memory usage, but relying on that just does not work.
> 

The userspace handler is a schedulable task resident in memory that, with 
any sane implementation, would not require additional memory when running.

> > This is completely and utterly wrong.  If you had read the code 
> > thoroughly, you would see that the page allocation is deferred until 
> > userspace has the opportunity to respond.  That time in deferral isn't 
> > absurd, it would take time for the oom killer's exiting task to free 
> > memory anyway.  The change simply allows a delay between a page 
> > allocation failure and invoking the oom killer.
> > 
> > It will even work on UP systems since the page allocator has multiple 
> > reschedule points where a dedicated or monitoring application attached to 
> > the cgroup could add or free memory itself so the allocation succeeds when 
> > current is rescheduled.
> 
> Yup, it will wait for some timeout while userspace will respond. The
> problem is that it will not in some cases. And so far I think (and I may
> be wrong), that it will not be able to respond most of the time.
> 

We've used it quite successfully for over a year, so your hypothesis in 
this case may be less than accurate.

> In every oom-condition I worked with, system was completely unusable
> before oom-killer started to berserk several processes. They were not
> able to allocate some data to read network input (atomic context) and to
> even read a console input.
> 

If your system failed to allocate GFP_ATOMIC memory, then those are page 
allocation failures in the buddy allocator and will actually never invoke 
the oom killer since it cannot sleep in that context.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 21:35                         ` David Rientjes
@ 2009-01-22 22:04                           ` Evgeniy Polyakov
  2009-01-22 22:28                             ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-22 22:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, Jan 22, 2009 at 01:35:13PM -0800, David Rientjes (rientjes@google.com) wrote:
> > And that is The Main Point. Not to help some subsystem, but be useful
> > for others. It is different tunable. It has really nothing with cpusets.
> 
> It doesn't work with any system that is running with cpusets, that's the 
> problem with accepting such code.  There needs to be clear and convincing 
> evidence that a userspace oom notifier, which could handle all possible 
> cases and is much more robust and flexible, would not suffice for such 
> situations.

I showed the case when it does not work at all. And then found (in this
mail), that task (part) has to be present in the memory, which means it
will be locked, which in turns will not work with the system which
already locked its range allowed by the limits.

And returning to the oom_adj and cpusets tunables. Why any new process
started in given cpuset can not be tuned by external application or some
script to have bigger/smaller oom_adj parameter? :)

> > Then do not use anything else when cpusets are enabled.
> > Do not compile SuperH cpu support when you do not have such processes.
> > Do not enable iscsi support if you do not need this.
> > Do not enable cgroup oom-killer tunable when you want cpuset.
> > 
> 
> How do I prioritize oom killing if my system is running cpusets, then?  

Just the way it works right now :)
You do not object against patches which improve superh cpu support
with the argument, that it is not possible to enable that feature,
when system does not have superh cpu.

> Your answer is that I can't, since this new cgroup doesn't work with them; 
> so this solution is incomplete.  The oom killer notification patch[*] is 
> much more functional and does allow this problem to be solved on all types 
> of systems because the implementation of such a handler is left to the 
> administrator.
> 
>  [*] http://marc.info/?l=linux-mm&m=122575082227252

Let me draw the line in this discussion: people propose a patches to tweak
the oom-killer behaviour in some situation. You say that this does not
cover another case, and thus has to be dropped. You instead propose
another solution, which does not work either in some cases (IMO more
wider).

It may be a good idea, and likely it is (at least its notification
part), but please look a bit wider, since in cases when it does not work,
still there are lots of systems which want to have a proper OOM tunables.

> > Please explain, how task will make that decision (like freeing some
> > ram), when it can not be even swapped out? It is a great idea to be able
> > to monitor memory usage, but relying on that just does not work.
> > 
> 
> The userspace handler is a schedulable task resident in memory that, with 
> any sane implementation, would not require additional memory when running.

And what happens when it can not lock the memory because of the limits?

> > Yup, it will wait for some timeout while userspace will respond. The
> > problem is that it will not in some cases. And so far I think (and I may
> > be wrong), that it will not be able to respond most of the time.
> > 
> 
> We've used it quite successfully for over a year, so your hypothesis in 
> this case may be less than accurate.

You also used oom_adj which was not documented. Is this an argument?

> > In every oom-condition I worked with, system was completely unusable
> > before oom-killer started to berserk several processes. They were not
> > able to allocate some data to read network input (atomic context) and to
> > even read a console input.
> > 
> 
> If your system failed to allocate GFP_ATOMIC memory, then those are page 
> allocation failures in the buddy allocator and will actually never invoke 
> the oom killer since it cannot sleep in that context.

Hmm, you likely missed the part in the last line. And in the first two,
where I said that before oom-killer started (and killed some processes,
usually not those which were need, but its a different story). System
just did not have a free memory to have _any_ progress neither in atomic
context, nor in process, so it had to invoke an oom-killer.

In that case userspace just can not reply or even awake. While kernel is
effectively alive if it does not need to allocate a memory. And could
kill some process to free up the ram.

Userspace notifications are great, no problem, but do not rely on them,
since there is a huge world outside the case it works in, which will be
quite unhappy when systems start freezing because oom-killer relied on
the userspace.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 22:04                           ` Evgeniy Polyakov
@ 2009-01-22 22:28                             ` David Rientjes
  2009-01-22 22:53                               ` Evgeniy Polyakov
  0 siblings, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-22 22:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Fri, 23 Jan 2009, Evgeniy Polyakov wrote:

> I showed the case when it does not work at all. And then found (in this
> mail), that task (part) has to be present in the memory, which means it
> will be locked, which in turns will not work with the system which
> already locked its range allowed by the limits.
> 

Yes, a userspace oom handler must be sanely implemented.

> And returning to the oom_adj and cpusets tunables. Why any new process
> started in given cpuset can not be tuned by external application or some
> script to have bigger/smaller oom_adj parameter? :)
> 

oom_adj scores are separate from the hard-coded and very fundamental 
heuristic that we should kill a task that has memory allocated on nodes we 
are attempting to free.  Anything else would just be stupid.

> > How do I prioritize oom killing if my system is running cpusets, then?  
> 
> Just the way it works right now :)
> You do not object against patches which improve superh cpu support
> with the argument, that it is not possible to enable that feature,
> when system does not have superh cpu.
> 

No, I object against any patch that isn't a complete solution to the 
problem being presented.  It's purely a matter of good software 
engineering practices and in the interest of a long-term maintainable 
kernel.

> > The userspace handler is a schedulable task resident in memory that, with 
> > any sane implementation, would not require additional memory when running.
> 
> And what happens when it can not lock the memory because of the limits?
> 

Any sane handler for responding to oom conditions will not require 
additional memory from nodes that are under oom, whether that includes all 
system memory or a subset, if it is attached to the oom notifier.

> Hmm, you likely missed the part in the last line. And in the first two,
> where I said that before oom-killer started (and killed some processes,
> usually not those which were need, but its a different story). System
> just did not have a free memory to have _any_ progress neither in atomic
> context, nor in process, so it had to invoke an oom-killer.
> 

The page allocator cannot invoke the oom killer in atomic context, so this 
would be happening in process content where it can sleep.  The userspace 
oom handler will wake up, handle the condition either by relaxing hardwall 
restrictions for either the memory controller or cpusets, or killing a 
task itself unless it chooses to defer to the kernel.

> In that case userspace just can not reply or even awake. While kernel is
> effectively alive if it does not need to allocate a memory. And could
> kill some process to free up the ram.
> 

Wrong, oom conditions do not preempt task scheduling.

> Userspace notifications are great, no problem, but do not rely on them,
> since there is a huge world outside the case it works in, which will be
> quite unhappy when systems start freezing because oom-killer relied on
> the userspace.
> 

I'm quite certain you've spent more time writing emails to me than merging 
the patch and testing its possibilities, given your lack of understanding 
of its very basic concepts.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 22:28                             ` David Rientjes
@ 2009-01-22 22:53                               ` Evgeniy Polyakov
  2009-01-22 23:25                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-22 22:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thu, Jan 22, 2009 at 02:28:11PM -0800, David Rientjes (rientjes@google.com) wrote:
> > And returning to the oom_adj and cpusets tunables. Why any new process
> > started in given cpuset can not be tuned by external application or some
> > script to have bigger/smaller oom_adj parameter? :)
> 
> oom_adj scores are separate from the hard-coded and very fundamental 
> heuristic that we should kill a task that has memory allocated on nodes we 
> are attempting to free.  Anything else would just be stupid.

The right answer is that oom-adj scores are global, while it is required
in some cases to have local groups processes can be arranged into so
that local tunables could be used. cpusets just happend to be those
groups, while it could be in turn wider cgroups with the proposed
or similar patch.

Or those groups could be processes started with special name :)
It just happend that cpusets were the first, so you can say,
that they are so special, while actually they are not.

We already have this, and history does not tolerate conjunctives,
so let's drop this part :)

> No, I object against any patch that isn't a complete solution to the 
> problem being presented.  It's purely a matter of good software 
> engineering practices and in the interest of a long-term maintainable 
> kernel.

Is this double standards, since your proposal, if implemented a
little bit different, does not solve the problem at all. And you argue
that uncomplete solution should not be merged. Please note, that I do
not object against notifications, but just want to have a system which
will work in all cases or at least in the majority of them.

> > > The userspace handler is a schedulable task resident in memory that, with 
> > > any sane implementation, would not require additional memory when running.
> > 
> > And what happens when it can not lock the memory because of the limits?
> 
> Any sane handler for responding to oom conditions will not require 
> additional memory from nodes that are under oom, whether that includes all 
> system memory or a subset, if it is attached to the oom notifier.

This reminds me a thread, where modporbe stuck because initrd code is
actually not allowed to write something to the console, and while it
happend that bug is somewhere else, the same argument was rised about
such fun limitations. What about just asking userspace developers not to
write a code, which may lead to OOM conditions :)

Even not arguing 'sane implementation' case, what about process which
happend to be in swap? Its oom handler has to be locked in the memory to
be sucessfully invoked, but this does not work if all allowed to be
locked memory is used.

> > Hmm, you likely missed the part in the last line. And in the first two,
> > where I said that before oom-killer started (and killed some processes,
> > usually not those which were need, but its a different story). System
> > just did not have a free memory to have _any_ progress neither in atomic
> > context, nor in process, so it had to invoke an oom-killer.
> > 
> 
> The page allocator cannot invoke the oom killer in atomic context, so this 
> would be happening in process content where it can sleep.  The userspace 
> oom handler will wake up, handle the condition either by relaxing hardwall 
> restrictions for either the memory controller or cpusets, or killing a 
> task itself unless it chooses to defer to the kernel.

Seems like you just do not read what I wrote. Please do. At least the
last sentence :) And what I wrote about swapped and locked memory.

> > In that case userspace just can not reply or even awake. While kernel is
> > effectively alive if it does not need to allocate a memory. And could
> > kill some process to free up the ram.
> 
> Wrong, oom conditions do not preempt task scheduling.

I actually did not tell anything about task scheduling.

I said that in the case above, if oom-killer would depend on the
userspace it could not make a progress, since oom-handlers would
not be able to wake up if swapped, and in this case oom-killer
would need just to select one by its own hueristics.

> > Userspace notifications are great, no problem, but do not rely on them,
> > since there is a huge world outside the case it works in, which will be
> > quite unhappy when systems start freezing because oom-killer relied on
> > the userspace.
> 
> I'm quite certain you've spent more time writing emails to me than merging 
> the patch and testing its possibilities, given your lack of understanding 
> of its very basic concepts.

How cute :)
Any other technical arguments of the same strength?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 22:53                               ` Evgeniy Polyakov
@ 2009-01-22 23:25                                 ` Evgeniy Polyakov
  0 siblings, 0 replies; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-22 23:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nikanth Karthikesan, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Fri, Jan 23, 2009 at 01:53:04AM +0300, Evgeniy Polyakov (zbr@ioremap.net) wrote:
> > I'm quite certain you've spent more time writing emails to me than merging 
> > the patch and testing its possibilities, given your lack of understanding 
> > of its very basic concepts.
> 
> How cute :)
> Any other technical arguments of the same strength?

Its my turn now for the professional statements, let's start with this
technical side: you wanted to send a signal for the process to be killed,
but this will need
1. to allocate a signal, which will deadlock
2. no need to do this for sigill, but it will not work if process is in
unkillable state, while oom-killer clears iirc

Now to the oom-handler: if it will want to free some memory, it will
have to call a syscall with some pointer, which in turn may be in the
swapped area, so handler will be locked.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 13:21                   ` Evgeniy Polyakov
  2009-01-22 20:28                     ` David Rientjes
@ 2009-01-27 23:55                     ` Paul Menage
  1 sibling, 0 replies; 63+ messages in thread
From: Paul Menage @ 2009-01-27 23:55 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Rientjes, Nikanth Karthikesan, Andrew Morton, Alan Cox,
	linux-kernel, Linus Torvalds, Chris Snook,
	Arve Hjønnevåg, containers

On Thu, Jan 22, 2009 at 5:21 AM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> Having userspace to decide which task to kill may not work in some cases
> at all (when task is swapped and we need to kill someone to get the mem
> to swap out the task, which will make that decision).

That's true in the case of a global OOM. In the case of a local OOM
(caused by memory limits applied via the cgroup memory controller, or
NUMA affinity enforcement applied by cpusets) the userspace handler
can be in a different domain which isn't OOM, and be quite capable of
figuring out who to kill. In our particular use case, it can happen
that a high-priority job hits its memory limits and triggers an OOM,
which causes the system controller daemon to kill some lower-priority
job and reassign some memory from that now-dead low-priority job (and
thus prevent the OOM from killing any process in the original cgroup).
This is something that would be very hard to express via kernel
policy.

Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22 10:27                 ` David Rientjes
  2009-01-22 13:21                   ` Evgeniy Polyakov
@ 2009-01-23  9:45                   ` Nikanth Karthikesan
  2009-01-23 10:33                     ` David Rientjes
  1 sibling, 1 reply; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-23  9:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Thursday 22 January 2009 15:57:19 David Rientjes wrote:
> On Thu, 22 Jan 2009, Evgeniy Polyakov wrote:
> > > In an exclusive cpuset, a task's memory is restricted to a set of mems
> > > that the administrator has designated.  If it is oom, the kernel must
> > > free memory on those nodes or the next allocation will again trigger an
> > > oom (leading to a needlessly killed task that was in a disjoint
> > > cpuset).
> > >
> > > Really.
> >
> > The whole point of oom-killer is to kill the most appropriate task to
> > free the memory. And while task is selected system-wide and some
> > tunables are added to tweak the behaviour local to some subsystems, this
> > cpuset feature is hardcoded into the selection algorithm.
>
> Of course, because the oom killer must be aware that tasks in disjoint
> cpusets are more likely than not to result in no memory freeing for
> current's subsequent allocations.
>

Yes, the problem is cpuset does not track the tasks which has allocated from 
this node - who has either moved or changed it set of allowable nodes. And 
because of that it does not limit oom killing to the tasks with in those tasks 
and could kill some innocent tasks at times.

As it is unable to take deterministic decision as memcg does, it plays with 
badness value and only suggests but does not restricts within those tasks that 
has to be killed.

This bug is present even without this patch.

> > And when some tunable starts doing own calculation, behaviour of this
> > hardcoded feature changes.
>
> Yes, it is possible to elevate oom_adj scores to override the cpuset
> preference.  That's actually intended since it is now possible for the
> administrator to specify that, against the belief of the kernel, that
> killing a task will free memory in these cpuset-constrained ooms.  That's
> probably because it has either been moved to a different cpuset or its set
> of allowable nodes is dynamic.
>

This patch adds one more easier way for the administrator to over-ride.

> > > Then the scope of this new cgroup is restricted to not being used with
> > > cpusets that could oom.
> >
> > These are perpendicular tasks - cpusets limit one area of the oom
> > handling, cgroup order - another. Some people needs cpusets, others want
> > cgroups. cpusets are not something exceptional so that only they have to
> > be taken into account when doing system-wide operation like OOM
> > condition handling.
>
> A cpuset is a cgroup.  If I am using cpusets, this patch fails to
> adequately allow me to describe my oom preferences for both
> cpuset-constrained ooms and global unconstrained ooms, which is a major
> drawback.
>

The current cpuset oom handling has to be fixed and the exact problem of 
killing innocent processes exists even without the oom-controller.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-23  9:45                   ` Nikanth Karthikesan
@ 2009-01-23 10:33                     ` David Rientjes
  2009-01-23 14:56                       ` Nikanth Karthikesan
  0 siblings, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-23 10:33 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Fri, 23 Jan 2009, Nikanth Karthikesan wrote:

> > Of course, because the oom killer must be aware that tasks in disjoint
> > cpusets are more likely than not to result in no memory freeing for
> > current's subsequent allocations.
> >
> 
> Yes, the problem is cpuset does not track the tasks which has allocated from 
> this node - who has either moved or changed it set of allowable nodes. And 
> because of that it does not limit oom killing to the tasks with in those tasks 
> and could kill some innocent tasks at times.
> 

Right, the logic to prefer tasks that share the same set of allowable 
nodes as the oom-triggering task is implemented in badness() and being in 
a completely disjoint cpuset does not specifically exclude a task from 
being chosen as the oom killer's victim.  That's because, as you said, it 
could have allocated memory elsewhere before changing cpusets or its set 
of allowable mems.

> As it is unable to take deterministic decision as memcg does, it plays with 
> badness value and only suggests but does not restricts within those tasks that 
> has to be killed.
> 
> This bug is present even without this patch.
> 

It's not a bug, it can actually help in a couple instances:

 - a much larger memory hogging task is identified in a disjoint cpuset
   and the liklihood it has allocated memory elsewhere either previously 
   or atomically, or

 - an administrator tunes the oom_adj value for such a task to describe 
   the above behavior even for smaller tasks and their liklihood to 
   allocate outside of their exclusive cpuset.

> This patch adds one more easier way for the administrator to over-ride.
> 

Yeah, I know.  But the problem with the approach is that it specifies an 
oom priority for both global unconstrained ooms and cpuset-constrained 
ooms.

It's quite possible with your patch to identify an aggregate of tasks that 
should be killed first whenever the system is completely out of memory.  
That's great, and solves your problem.  But that same system cannot 
correctly use cpusets that have the potential to ever oom because your 
patch completely overrides the victim priority and could needlessly kill 
tasks that will not lead to future memory freeing.

That's my objection to the proposal: it doesn't behave appropriately for 
both global unconstrained ooms and cpuset-constrained ooms at the same 
time.

I think if you look at the cgroup oom notifier patch I referred you to, 
you will find it trivial to implement a handler that can issue a SIGKILL 
for an aggregate of tasks and implement the functional equivalent of this 
patch in userspace where it belongs.  It's nearly impossible to code oom 
killer heuristics in the kernel that work for every possible workload and 
configuration without some unfortunate casualties.

> The current cpuset oom handling has to be fixed and the exact problem of 
> killing innocent processes exists even without the oom-controller.
> 

I think the heuristic could be tuned to penalize tasks in disjoint cpusets 
a little more, but its implementation as a factor of the badness score is 
actually very well placed.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-23 10:33                     ` David Rientjes
@ 2009-01-23 14:56                       ` Nikanth Karthikesan
  2009-01-23 20:44                         ` David Rientjes
  2009-01-28  1:00                         ` Paul Menage
  0 siblings, 2 replies; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-23 14:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Friday 23 January 2009 16:03:49 David Rientjes wrote:
> On Fri, 23 Jan 2009, Nikanth Karthikesan wrote:
> > > Of course, because the oom killer must be aware that tasks in disjoint
> > > cpusets are more likely than not to result in no memory freeing for
> > > current's subsequent allocations.
> >
> > Yes, the problem is cpuset does not track the tasks which has allocated
> > from this node - who has either moved or changed it set of allowable
> > nodes. And because of that it does not limit oom killing to the tasks
> > with in those tasks and could kill some innocent tasks at times.
>
> Right, the logic to prefer tasks that share the same set of allowable
> nodes as the oom-triggering task is implemented in badness() and being in
> a completely disjoint cpuset does not specifically exclude a task from
> being chosen as the oom killer's victim.  That's because, as you said, it
> could have allocated memory elsewhere before changing cpusets or its set
> of allowable mems.
>
> > As it is unable to take deterministic decision as memcg does, it plays
> > with badness value and only suggests but does not restricts within those
> > tasks that has to be killed.
> >
> > This bug is present even without this patch.
>
> It's not a bug, it can actually help in a couple instances:
>
>  - a much larger memory hogging task is identified in a disjoint cpuset
>    and the liklihood it has allocated memory elsewhere either previously
>    or atomically, or
>
>  - an administrator tunes the oom_adj value for such a task to describe
>    the above behavior even for smaller tasks and their liklihood to
>    allocate outside of their exclusive cpuset.
>

In other instances, It can actually also kill some innocent tasks unless the 
administrator tunes oom_adj, say something like kvm which would have a huge 
memory accounted, but might be from a different node altogether. Killing a 
single vm is killing all of the processes in that OS ;)  Don't you think this 
has to be fixed^Wimproved?

> > This patch adds one more easier way for the administrator to over-ride.
>
> Yeah, I know.  But the problem with the approach is that it specifies an
> oom priority for both global unconstrained ooms and cpuset-constrained
> ooms.
>
> It's quite possible with your patch to identify an aggregate of tasks that
> should be killed first whenever the system is completely out of memory.
> That's great, and solves your problem.  But that same system cannot
> correctly use cpusets that have the potential to ever oom because your
> patch completely overrides the victim priority and could needlessly kill
> tasks that will not lead to future memory freeing.
>
> That's my objection to the proposal: it doesn't behave appropriately for
> both global unconstrained ooms and cpuset-constrained ooms at the same
> time.
>

So you are against specifying order when it is a cpuset-constrained oom. Here 
is a revised version of the patch which adds oom.cpuset_constraint, when set 
to 1 would disable the ordering imposed by this controller for cpuset 
constrained ooms! Will this work for you?

Thanks
Nikanth

From: Nikanth Karthikesan <knikanth@suse.de>

Cgroup based OOM killer controller

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

---

This is a container group based approach to override the oom killer selection 
without losing all the benefits of the current oom killer heuristics and 
oom_adj interface.

It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
process using the usual badness value but only within the cgroup with the 
maximum value for oom.victim before killing any process from a cgroup with a 
lesser oom.victim number. Oom killing could be disabled by setting 
oom.victim=0.

Difference from oom_adj:
This controller allows users to specify "strict order" for oom kills. While 
oom_adj just works as a hint for the kernel, OOM Killer Controller gives users 
full control to decide the order in which processes should be killed. It is 
very hard to specify oom-kill order for several applications using just 
oom_adj because one needs to adjust oom_adj of various task based on oom_score 
of several tasks which keeps changing.

CPUSET constrained OOM:
Also the tunable oom.cpuset_constrained when enabled, would disable the 
ordering imposed by this controller for cpuset constrained OOMs.

diff --git a/Documentation/cgroups/oom.txt b/Documentation/cgroups/oom.txt
new file mode 100644
index 0000000..772fb41
--- /dev/null
+++ b/Documentation/cgroups/oom.txt
@@ -0,0 +1,34 @@
+OOM Killer controller
+--- ------ ----------
+
+The OOM killer kills the process based on a set of heuristics such that only
+minimum amount of work done will be lost, a large amount of memory would be
+recovered and minimum no of processes are killed.
+
+The user can adjust the score used to select the processes to be killed using
+/proc/<pid>/oom_adj. Giving it a high score will increase the likelihood of 
+this process being killed by the oom-killer.  Valid values are in the range 
+-16 to +15, plus the special value -17, which disables oom-killing altogether
+for that process.
+
+But it is very difficult to suggest an order among tasks to be killed during
+Out Of Memory situation. The OOM Killer controller aids in doing that.
+
+USAGE
+-----
+
+Mount the oom controller by passing 'oom' when mounting cgroups. Echo
+a value in oom.victim file to change the order. The oom killer would kill
+all the processes in a cgroup with a higher oom.victim value before killing a
+process in a cgroup with lower oom.victim value. Among those tasks with same
+oom.victim value, the usual badness heuristics would be applied. The 
+/proc/<pid>/oom_adj still helps adjusting the oom killer score. Also having
+oom.victim = 0 would disable oom killing for the tasks in that cgroup.
+
+Note: If this is used without proper consideration, innocent processes may
+get killed unnecesarily.
+
+CPUSET constrained OOM:
+Setting oom.cpuset_constraint=1 would disable the ordering during a cpuset
+constrained oom. Setting oom.cpuset_constraint=0 would not distinguish
+between a cpuset constrained oom and system wide oom.
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..6944f99 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,4 +59,8 @@ SUBSYS(freezer)
 SUBSYS(net_cls)
 #endif
 
+#ifdef CONFIG_CGROUP_OOM
+SUBSYS(oom)
+#endif
+
 /* */
diff --git a/include/linux/oomcontrol.h b/include/linux/oomcontrol.h
new file mode 100644
index 0000000..d4c4c72
--- /dev/null
+++ b/include/linux/oomcontrol.h
@@ -0,0 +1,19 @@
+#ifndef _LINUX_OOMCONTROL_H
+#define _LINUX_OOMCONTROL_H
+
+#ifdef CONFIG_CGROUP_OOM
+struct oom_cgroup { 
+	struct cgroup_subsys_state css;
+	/*
+	 * the order to be victimized for this group
+	 */  
+	u64 victim;
+
+	/*
+	 * disable during cpuset constrained oom
+	 */
+	bool *cpuset_constraint;
+};
+
+#endif
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 2af8382..99ed0de 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -354,6 +354,15 @@ config CGROUP_DEBUG
 
 	  Say N if unsure.
 
+config CGROUP_OOM
+	bool "Oom cgroup subsystem"
+	depends on CGROUPS
+	help
+	  This provides a cgroup subsystem which aids controlling
+	  the order in which tasks whould be killed during
+	  out of memory situations.
+	
+
 config CGROUP_NS
 	bool "Namespace cgroup subsystem"
 	depends on CGROUPS
diff --git a/mm/Makefile b/mm/Makefile
index 72255be..a5d7222 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,3 +33,4 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_OOM) += oomcontrol.o 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 40ba050..555ecb6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/oomcontrol.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -200,11 +201,15 @@ static inline enum oom_constraint 
constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+			struct mem_cgroup *mem, int cpuset_constrained)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
 	struct timespec uptime;
+#ifdef CONFIG_CGROUP_OOM
+	u64 chosenvictim = 1, taskvictim;
+	bool honour_cpuset_constraint;
+#endif
 	*ppoints = 0;
 
 	do_posix_clock_monotonic_gettime(&uptime);
@@ -257,10 +262,30 @@ static struct task_struct *select_bad_process(unsigned 
long *ppoints,
 			continue;
 
 		points = badness(p, uptime.tv_sec);
+#ifdef CONFIG_CGROUP_OOM
+		taskvictim = (container_of(p->cgroups->subsys[oom_subsys_id],
+						struct oom_cgroup, css))->victim;
+		honour_cpuset_constraint = *(container_of(p->cgroups-
>subsys[oom_subsys_id],
+                                                struct oom_cgroup, css))-
>cpuset_constraint;
+
+		if (taskvictim > chosenvictim ||
+			(((taskvictim == chosenvictim) ||
+				(cpuset_constrained && honour_cpuset_constraint))
+				 && points > *ppoints) ||
+			(taskvictim && !chosen)) {
+
+			chosen = p;
+			*ppoints = points;
+			chosenvictim = taskvictim;
+
+		}
+		
+#else
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
 		}
+#endif
 	} while_each_thread(g, p);
 
 	return chosen;
@@ -431,7 +456,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, 
gfp_t gfp_mask)
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, 0); /* not cpuset constrained */
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -513,7 +538,7 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t 
gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order, int 
cpuset_constrained)
 {
 	if (sysctl_oom_kill_allocating_task) {
 		oom_kill_process(current, gfp_mask, order, 0, NULL,
@@ -528,7 +553,7 @@ retry:
 		 * Rambo mode: Shoot down a process and hope it solves whatever
 		 * issues we may have.
 		 */
-		p = select_bad_process(&points, NULL);
+		p = select_bad_process(&points, NULL, cpuset_constrained);
 
 		if (PTR_ERR(p) == -1UL)
 			return;
@@ -569,7 +594,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order and not cpuset constrained */
+	__out_of_memory(0, 0, 0); 
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -623,7 +649,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t 
gfp_mask, int order)
 			panic("out of memory. panic_on_oom is selected\n");
 		/* Fall-through */
 	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
+		__out_of_memory(gfp_mask, order, 1);
 		break;
 	}
 
diff --git a/mm/oomcontrol.c b/mm/oomcontrol.c
new file mode 100644
index 0000000..24ee0da
--- /dev/null
+++ b/mm/oomcontrol.c
@@ -0,0 +1,138 @@
+/*
+ * kernel/cgroup_oom.c - oom handler cgroup.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/oomcontrol.h>
+
+/*
+ * Helper to retrieve oom controller data from cgroup
+ */
+static struct oom_cgroup *oom_css_from_cgroup(struct cgroup *cgrp)
+{
+        return container_of(cgroup_subsys_state(cgrp,
+                                oom_subsys_id), struct oom_cgroup,
+                                css);
+}
+
+
+static struct cgroup_subsys_state *oom_create(struct cgroup_subsys *ss,
+						   struct cgroup *cont)
+{
+	struct oom_cgroup *oom_css = kzalloc(sizeof(*oom_css), GFP_KERNEL);
+	struct oom_cgroup *parent;
+
+	if (!oom_css)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * if root last/only group to be victimized
+	 * else inherit parents value
+	 */
+	if (cont->parent == NULL) {
+		oom_css->victim = 1;
+		oom_css->cpuset_constraint =
+			kzalloc(sizeof(*oom_css->cpuset_constraint), GFP_KERNEL);
+		*oom_css->cpuset_constraint = false;
+	} else {
+		parent = oom_css_from_cgroup(cont->parent);
+		oom_css->victim = parent->victim;
+		oom_css->cpuset_constraint = parent->cpuset_constraint;
+	}
+
+	return &oom_css->css;
+}
+
+static void oom_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct oom_cgroup *oom_css = oom_css_from_cgroup(cont);
+	if (cont->parent == NULL)
+		kfree(oom_css->cpuset_constraint);
+	kfree(cont->subsys[oom_subsys_id]);
+}
+
+static int oom_victim_write(struct cgroup *cgrp, struct cftype *cft,
+                                       u64 val)
+{
+
+        cgroup_lock();
+
+	(oom_css_from_cgroup(cgrp))->victim = val;
+
+        cgroup_unlock();
+
+        return 0;
+}
+
+static u64 oom_victim_read(struct cgroup *cgrp, struct cftype *cft)
+{
+        u64 victim = (oom_css_from_cgroup(cgrp))->victim;
+
+        return victim;
+}
+
+static int oom_cpuset_write(struct cgroup *cont, struct cftype *cft,
+                            const char *buffer)
+{
+	if (buffer[0] == '1' && buffer[1] == 0)
+		*(oom_css_from_cgroup(cont))->cpuset_constraint = true;
+	else if (buffer[0] == '0' && buffer[1] == 0)
+		*(oom_css_from_cgroup(cont))->cpuset_constraint = false;
+	else
+		return -EINVAL;
+	return 0;
+}
+
+static u64 oom_cpuset_read(struct cgroup *cgrp, struct cftype *cft)
+{
+        if (*(oom_css_from_cgroup(cgrp))->cpuset_constraint)
+		return 1;
+	else
+		return 0;
+}
+
+static struct cftype oom_cgroup_files[] = {
+	{
+		.name = "victim",
+		.read_u64 = oom_victim_read,
+		.write_u64 = oom_victim_write,
+	},
+};
+
+static struct cftype oom_cgroup_root_files[] = {
+	{
+		.name = "victim",
+		.read_u64 = oom_victim_read,
+		.write_u64 = oom_victim_write,
+	},
+	{
+		.name = "cpuset_constraint",
+		.read_u64 = oom_cpuset_read,
+		.write_string = oom_cpuset_write,
+	},
+};
+
+static int oom_populate(struct cgroup_subsys *ss,
+                                struct cgroup *cont)
+{
+	int ret;
+
+	if  (cont->parent == NULL) {
+		ret = cgroup_add_files(cont, ss, oom_cgroup_root_files,
+				ARRAY_SIZE(oom_cgroup_root_files));
+	} else {
+		ret = cgroup_add_files(cont, ss, oom_cgroup_files,
+				ARRAY_SIZE(oom_cgroup_files));
+	}
+	return ret;
+}
+
+struct cgroup_subsys oom_subsys = {
+	.name = "oom",
+	.subsys_id = oom_subsys_id,
+	.create = oom_create,
+	.destroy = oom_destroy,
+	.populate = oom_populate,
+};


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-23 14:56                       ` Nikanth Karthikesan
@ 2009-01-23 20:44                         ` David Rientjes
  2009-01-27 10:20                           ` Nikanth Karthikesan
  2009-01-28  1:00                         ` Paul Menage
  1 sibling, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-23 20:44 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers

On Fri, 23 Jan 2009, Nikanth Karthikesan wrote:

> In other instances, It can actually also kill some innocent tasks unless the 
> administrator tunes oom_adj, say something like kvm which would have a huge 
> memory accounted, but might be from a different node altogether. Killing a 
> single vm is killing all of the processes in that OS ;)  Don't you think this 
> has to be fixed^Wimproved?
> 

As previously stated, I think the heuristic to penalize tasks for not 
having an intersection with the set of allowable nodes of the oom 
triggering task could be made slightly more severe.  That's irrelevant to 
your patch, though.

> > That's my objection to the proposal: it doesn't behave appropriately for
> > both global unconstrained ooms and cpuset-constrained ooms at the same
> > time.
> >
> 
> So you are against specifying order when it is a cpuset-constrained oom. Here 
> is a revised version of the patch which adds oom.cpuset_constraint, when set 
> to 1 would disable the ordering imposed by this controller for cpuset 
> constrained ooms! Will this work for you?
> 

No, I don't think it's appropriate to add special exemptions for specific 
subsystems to what should be a generic cgroup.  I think it is much more 
powerful to defer these decisions to userspace so each cgroup can attach 
its own handler and implement the necessary decision-making that the 
kernel could never perfectly handle for all possible workloads.

It is trivial to implement the equivalent of this particular change as a 
userspace handler to SIGKILL all tasks in a specific cgroup when the 
cgroup oom handler is woken up at the time of oom.  Additionally, it could 
also respond in other ways such as adding a node to a cpuset, killing a 
less important cgroup, elevate a memory controller limit, send a signal to 
your application to release memory, etc.

We also talked about a cgroup /dev/mem_notify device file that you can 
poll() and learn of low memory situations so that appropriate action can 
be taken even in lowmem situations as opposed to simply oom conditions.

These types of policy decisions belong in userspace.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-23 20:44                         ` David Rientjes
@ 2009-01-27 10:20                           ` Nikanth Karthikesan
  2009-01-27 10:53                             ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-27 10:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers, Balbir Singh, KOSAKI Motohiro

On Saturday 24 January 2009 02:14:59 David Rientjes wrote:
> On Fri, 23 Jan 2009, Nikanth Karthikesan wrote:
> > In other instances, It can actually also kill some innocent tasks unless
> > the administrator tunes oom_adj, say something like kvm which would have
> > a huge memory accounted, but might be from a different node altogether.
> > Killing a single vm is killing all of the processes in that OS ;)  Don't
> > you think this has to be fixed^Wimproved?
>
> As previously stated, I think the heuristic to penalize tasks for not
> having an intersection with the set of allowable nodes of the oom
> triggering task could be made slightly more severe.  That's irrelevant to
> your patch, though.
>

But the heuristic makes it non-deterministic, unlike memcg case. And this 
mandates special handling for cpuset constrained OOM conditions in this patch.

> > > That's my objection to the proposal: it doesn't behave appropriately
> > > for both global unconstrained ooms and cpuset-constrained ooms at the
> > > same time.
> >
> > So you are against specifying order when it is a cpuset-constrained oom.
> > Here is a revised version of the patch which adds oom.cpuset_constraint,
> > when set to 1 would disable the ordering imposed by this controller for
> > cpuset constrained ooms! Will this work for you?
>
> No, I don't think it's appropriate to add special exemptions for specific
> subsystems to what should be a generic cgroup.  I think it is much more
> powerful to defer these decisions to userspace so each cgroup can attach
> its own handler and implement the necessary decision-making that the
> kernel could never perfectly handle for all possible workloads.
>
> It is trivial to implement the equivalent of this particular change as a
> userspace handler to SIGKILL all tasks in a specific cgroup when the
> cgroup oom handler is woken up at the time of oom.  Additionally, it could
> also respond in other ways such as adding a node to a cpuset, killing a
> less important cgroup, elevate a memory controller limit, send a signal to
> your application to release memory, etc.
>
> We also talked about a cgroup /dev/mem_notify device file that you can
> poll() and learn of low memory situations so that appropriate action can
> be taken even in lowmem situations as opposed to simply oom conditions.
>

Userspace also needs to handle the cpuset constrained _almost-oom_'s 
specially? I wonder how easily userspace can handle that.

> These types of policy decisions belong in userspace.

Yes, policy decisions will be made in user-space using this oom-controller. 
This is just a framework/means to enforce policies. We do not make any 
decisions inside the kernel.

But yes, the badness calculation by the oom killer implements some kind of 
policy inside the kernel, but I guess it can stay, as this oom-controller lets 
user policy over-ride kernel policy. ;-)

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 10:20                           ` Nikanth Karthikesan
@ 2009-01-27 10:53                             ` David Rientjes
  2009-01-27 11:08                               ` Nikanth Karthikesan
  0 siblings, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-27 10:53 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers, Balbir Singh, KOSAKI Motohiro

On Tue, 27 Jan 2009, Nikanth Karthikesan wrote:

> > As previously stated, I think the heuristic to penalize tasks for not
> > having an intersection with the set of allowable nodes of the oom
> > triggering task could be made slightly more severe.  That's irrelevant to
> > your patch, though.
> >
> 
> But the heuristic makes it non-deterministic, unlike memcg case. And this 
> mandates special handling for cpuset constrained OOM conditions in this patch.
> 

Dividing a badness score by 8 if a task's set of allowable nodes do not 
insect with the oom triggering task's set does not make an otherwise 
deterministic algorithm non-deterministic.

I don't understand what you're arguing for here.  Are you suggesting that 
we should not prefer tasks that intersect the set of allowable nodes?  
That makes no sense if the goal is to allow for future memory freeing.

> > We also talked about a cgroup /dev/mem_notify device file that you can
> > poll() and learn of low memory situations so that appropriate action can
> > be taken even in lowmem situations as opposed to simply oom conditions.
> >
> 
> Userspace also needs to handle the cpuset constrained _almost-oom_'s 
> specially? I wonder how easily userspace can handle that.
> 

It handles it very well, cpusets are a client of cgroups and the 
/dev/mem_notify extension would be as well.  So to handle lowmem 
notifications for your cpuset, you would mount both cgroup subsystems at 
the same time and then poll() on the mem_notify file.  It would be 
responsible for the aggregate of tasks that the cgroup represents.

If lowmem notifications are implemented in the reclaim path, this is much 
easier for cpusets than for the memory controller, actually, since we 
already collect per-node ZVC information.

> > These types of policy decisions belong in userspace.
> 
> Yes, policy decisions will be made in user-space using this oom-controller. 
> This is just a framework/means to enforce policies. We do not make any 
> decisions inside the kernel.
> 
> But yes, the badness calculation by the oom killer implements some kind of 
> policy inside the kernel, but I guess it can stay, as this oom-controller lets 
> user policy over-ride kernel policy. ;-)
> 

The goal should not be to override the kernel's choice, because that 
decision depends heavily on the type of oom and the state of the machine 
at the time.  Appropriate changes to the oom killer's heuristics are 
always welcome; in my opinion, we should probably penalize tasks that do 
not intersect the triggering task's set of allowable nodes more than we 
currently do.

I think you'll find that your goals can be accomplished with a mem_notify 
cgroup and that it is a much more powerful interface so that your 
userspace policy can be better informed, especially if it is aware of 
lowmem situations where oom conditions are imminent.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 10:53                             ` David Rientjes
@ 2009-01-27 11:08                               ` Nikanth Karthikesan
  2009-01-27 11:21                                 ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-27 11:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers, Balbir Singh, KOSAKI Motohiro

On Tuesday 27 January 2009 16:23:00 David Rientjes wrote:
> On Tue, 27 Jan 2009, Nikanth Karthikesan wrote:
> > > As previously stated, I think the heuristic to penalize tasks for not
> > > having an intersection with the set of allowable nodes of the oom
> > > triggering task could be made slightly more severe.  That's irrelevant
> > > to your patch, though.
> >
> > But the heuristic makes it non-deterministic, unlike memcg case. And this
> > mandates special handling for cpuset constrained OOM conditions in this
> > patch.
>
> Dividing a badness score by 8 if a task's set of allowable nodes do not
> insect with the oom triggering task's set does not make an otherwise
> deterministic algorithm non-deterministic.
>
> I don't understand what you're arguing for here.  Are you suggesting that
> we should not prefer tasks that intersect the set of allowable nodes?
> That makes no sense if the goal is to allow for future memory freeing.
>

No. Actually I am just wondering, will it be possible to check whether a 
particular task has memory allocated or mmaped from this node to avoid killing 
an innocent task. I compared with memcg, to say that memcg never kills a task 
not related to the memcg constrained oom. Sorry if I was unclear, earlier. If 
we do this, oom-controller will not require special handling for cpuset 
constrained ooms.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 11:08                               ` Nikanth Karthikesan
@ 2009-01-27 11:21                                 ` David Rientjes
  2009-01-27 11:37                                   ` Nikanth Karthikesan
  0 siblings, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-27 11:21 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers, Balbir Singh, KOSAKI Motohiro

On Tue, 27 Jan 2009, Nikanth Karthikesan wrote:

> > I don't understand what you're arguing for here.  Are you suggesting that
> > we should not prefer tasks that intersect the set of allowable nodes?
> > That makes no sense if the goal is to allow for future memory freeing.
> >
> 
> No. Actually I am just wondering, will it be possible to check whether a 
> particular task has memory allocated or mmaped from this node to avoid killing 
> an innocent task.

That's certainly idealistic, but cannot be done in an inexpensive way that 
would scale with the large systems that clients of cpusets typically use.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 11:21                                 ` David Rientjes
@ 2009-01-27 11:37                                   ` Nikanth Karthikesan
  2009-01-27 20:29                                     ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-27 11:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers, Balbir Singh, KOSAKI Motohiro

On Tuesday 27 January 2009 16:51:26 David Rientjes wrote:
> On Tue, 27 Jan 2009, Nikanth Karthikesan wrote:
> > > I don't understand what you're arguing for here.  Are you suggesting
> > > that we should not prefer tasks that intersect the set of allowable
> > > nodes? That makes no sense if the goal is to allow for future memory
> > > freeing.
> >
> > No. Actually I am just wondering, will it be possible to check whether a
> > particular task has memory allocated or mmaped from this node to avoid
> > killing an innocent task.
>
> That's certainly idealistic, but cannot be done in an inexpensive way that
> would scale with the large systems that clients of cpusets typically use.

If we kill only the tasks for which cpuset_mems_allowed_intersects() is true 
on the first pass and even then if we do not get out of oom, we could go over 
again with this expensive check. Using this scheme, could kill more no of 
tasks than required, if a task with lots of memory has moved to a different 
cpuset. But it should be rare and not that severe compared to killing a 
totally innocent task like the current scheme does?!

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 11:37                                   ` Nikanth Karthikesan
@ 2009-01-27 20:29                                     ` David Rientjes
  0 siblings, 0 replies; 63+ messages in thread
From: David Rientjes @ 2009-01-27 20:29 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Evgeniy Polyakov, Andrew Morton, Alan Cox, linux-kernel,
	Linus Torvalds, Chris Snook, Arve Hjønnevåg,
	Paul Menage, containers, Balbir Singh, KOSAKI Motohiro

On Tue, 27 Jan 2009, Nikanth Karthikesan wrote:

> > That's certainly idealistic, but cannot be done in an inexpensive way that
> > would scale with the large systems that clients of cpusets typically use.
> 
> If we kill only the tasks for which cpuset_mems_allowed_intersects() is true 
> on the first pass and even then if we do not get out of oom, we could go over 
> again with this expensive check.

The oom killer has no memory of previous kills, so it's not possible to 
determine if there've been a series of recent needless ones.  Subsequent 
oom conditions should still check for intersection and, since it's only a 
heuristic, a large memory-hogging task will eventually be killed if there 
are no tasks remaining with such an intersection.

I don't know how you're planning on mapping large memory allocations on 
nodes of interest back to specific tasks, however.  Do you have a 
proposal?

> Using this scheme, could kill more no of 
> tasks than required, if a task with lots of memory has moved to a different 
> cpuset.

That's rare, since cpusets are used for NUMA optimizations and a set of 
cpus has a static affinity to certain memory.  It could happen if a 
cpuset's set of allowable nodes is made to be smaller, but that seems like 
it would trigger the oom in the first place and would encourage killing 
tasks with an intersection.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-23 14:56                       ` Nikanth Karthikesan
  2009-01-23 20:44                         ` David Rientjes
@ 2009-01-28  1:00                         ` Paul Menage
  2009-01-29 15:48                           ` Nikanth Karthikesan
  1 sibling, 1 reply; 63+ messages in thread
From: Paul Menage @ 2009-01-28  1:00 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: David Rientjes, Evgeniy Polyakov, Andrew Morton, Alan Cox,
	linux-kernel, Linus Torvalds, Chris Snook,
	Arve Hjønnevåg, containers

Hi Nikanth,

On Fri, Jan 23, 2009 at 6:56 AM, Nikanth Karthikesan <knikanth@suse.de> wrote:
>
> From: Nikanth Karthikesan <knikanth@suse.de>
>
> Cgroup based OOM killer controller
>
> Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
>
> ---
>
> This is a container group based approach to override the oom killer selection
> without losing all the benefits of the current oom killer heuristics and
> oom_adj interface.

The basic functionality looks useful.

But before we add an OOM subsystem and commit to an API that has to be
supported forever, I think it would be good to have an overall design
for what kinds of things we want to be able to do regarding cgroups
and OOM killing.

Specifying a per-cgroup priority is part of the solution, and is
useful for simple cases. Some kind of userspace notification is also
useful.

The notification system that David/Ying posted has worked pretty well
for us at Google - it's allowed us to use cpusets and fake numa to
provide hard memory controls and guarantees for jobs, while avoiding
having jobs getting killed when they expand faster than we expect. But
we also acknowledge that it's a bit of a hack, and it would be nice to
come up with something more generally acceptable for a real
submission.

>
> It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the
> process using the usual badness value but only within the cgroup with the
> maximum value for oom.victim before killing any process from a cgroup with a
> lesser oom.victim number. Oom killing could be disabled by setting
> oom.victim=0.

"priority" might be a better term than "victim".

>
> CPUSET constrained OOM:
> Also the tunable oom.cpuset_constrained when enabled, would disable the
> ordering imposed by this controller for cpuset constrained OOMs.
>
> diff --git a/Documentation/cgroups/oom.txt b/Documentation/cgroups/oom.txt
> new file mode 100644
> index 0000000..772fb41
> --- /dev/null
> +++ b/Documentation/cgroups/oom.txt
> @@ -0,0 +1,34 @@
> +OOM Killer controller
> +--- ------ ----------
> +
> +The OOM killer kills the process based on a set of heuristics such that only

Might be worth adding "theoretically" in this sentence :-)

>        do_posix_clock_monotonic_gettime(&uptime);
> @@ -257,10 +262,30 @@ static struct task_struct *select_bad_process(unsigned
> long *ppoints,
>                        continue;
>
>                points = badness(p, uptime.tv_sec);
> +#ifdef CONFIG_CGROUP_OOM
> +               taskvictim = (container_of(p->cgroups->subsys[oom_subsys_id],
> +                                               struct oom_cgroup, css))->victim;

Firstly, this ought to be using the task_subsys_state() function to
ensure the appropriate rcu_dereference() calls.

Secondly, is it safe? I'm not sure if we're in an RCU section in this
case, and we certainly haven't called task_lock(p) or cgroup_lock().
You should surround this with rcu_read_lock()/rcu_read_unlock().

And thirdly, it would be better to move the #ifdef to the header file,
and provide dummy functions that return 0 for the kill priority if
CONFIG_CGROUP_OOM isn't defined.

> +               honour_cpuset_constraint = *(container_of(p->cgroups-
>>subsys[oom_subsys_id],
> +                                                struct oom_cgroup, css))-
>>cpuset_constraint;

I think that putting this kind of inter-subsystem dependency in is a
bad idea. If you want to control whether the OOM killer treats cpusets
specially, perhaps that flag should be put in cpusets?

> +
> +               if (taskvictim > chosenvictim ||
> +                       (((taskvictim == chosenvictim) ||
> +                               (cpuset_constrained && honour_cpuset_constraint))
> +                                && points > *ppoints) ||
> +                       (taskvictim && !chosen)) {

This could do with more comments or maybe breaking up into simpler conditions.

> +       if (cont->parent == NULL) {
> +               oom_css->victim = 1;

Any reason to default to 1 rather than 0?

> +               oom_css->cpuset_constraint =
> +                       kzalloc(sizeof(*oom_css->cpuset_constraint), GFP_KERNEL);
> +               *oom_css->cpuset_constraint = false;
> +       } else {
> +               parent = oom_css_from_cgroup(cont->parent);
> +               oom_css->victim = parent->victim;
> +               oom_css->cpuset_constraint = parent->cpuset_constraint;
> +       }

So there's a single cpuset_constraint shared by all cgroups? Isn't
that just a global variable then?

> +
> +static int oom_victim_write(struct cgroup *cgrp, struct cftype *cft,
> +                                       u64 val)
> +{
> +
> +        cgroup_lock();

This isn't really doing much, since you don't synchronize on the read
side (either the file handler or in the OOM killer itself). It might
be better to just make the value an atomic_t and avoid taking
cgroup_lock() here.

Should we enforce any constraint that a cgroup can never have a lower
kill priority than its parent? Or a separate "min child priority"
value, or just make the cgroup's priority be the max of any in its
path to the root? That would allow you to safely delegate OOM priority
control to sub cgroups while still controlling relative priorities for
each subtree.

> +static int oom_cpuset_write(struct cgroup *cont, struct cftype *cft,
> +                            const char *buffer)
> +{
> +       if (buffer[0] == '1' && buffer[1] == 0)
> +               *(oom_css_from_cgroup(cont))->cpuset_constraint = true;
> +       else if (buffer[0] == '0' && buffer[1] == 0)
> +               *(oom_css_from_cgroup(cont))->cpuset_constraint = false;
> +       else
> +               return -EINVAL;
> +       return 0;
> +}

This can be a u64 write handler that just complains if its input isn't 0 or 1.

> +static struct cftype oom_cgroup_files[] = {
> +       {
> +               .name = "victim",
> +               .read_u64 = oom_victim_read,
> +               .write_u64 = oom_victim_write,
> +       },
> +};
> +
> +static struct cftype oom_cgroup_root_files[] = {
> +       {
> +               .name = "victim",
> +               .read_u64 = oom_victim_read,
> +               .write_u64 = oom_victim_write,
> +       },

Don't duplicate here - just have disjoint sets of files, and call
cgroup_add_files(oom_cgroup_root_files) in addition to the regular
files if it's the root. (Although as I mentioned above, I don't really
think this is the right place for the cpuset_constraint file)

Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-28  1:00                         ` Paul Menage
@ 2009-01-29 15:48                           ` Nikanth Karthikesan
  0 siblings, 0 replies; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-29 15:48 UTC (permalink / raw)
  To: Paul Menage
  Cc: David Rientjes, Evgeniy Polyakov, Andrew Morton, Alan Cox,
	linux-kernel, Linus Torvalds, Chris Snook,
	Arve Hjønnevåg, containers

On Wednesday 28 January 2009 06:30:42 Paul Menage wrote:
> Hi Nikanth,
>
> On Fri, Jan 23, 2009 at 6:56 AM, Nikanth Karthikesan <knikanth@suse.de> 
wrote:
> > From: Nikanth Karthikesan <knikanth@suse.de>
> >
> > Cgroup based OOM killer controller
> >
> > Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
> >
> > ---
> >
> > This is a container group based approach to override the oom killer
> > selection without losing all the benefits of the current oom killer
> > heuristics and oom_adj interface.
>
> The basic functionality looks useful.
>

Thanks.

> But before we add an OOM subsystem and commit to an API that has to be
> supported forever, I think it would be good to have an overall design
> for what kinds of things we want to be able to do regarding cgroups
> and OOM killing.
>
> Specifying a per-cgroup priority is part of the solution, and is
> useful for simple cases. Some kind of userspace notification is also
> useful.
>

Yes, very much.

> The notification system that David/Ying posted has worked pretty well
> for us at Google - it's allowed us to use cpusets and fake numa to
> provide hard memory controls and guarantees for jobs, while avoiding
> having jobs getting killed when they expand faster than we expect. But
> we also acknowledge that it's a bit of a hack, and it would be nice to
> come up with something more generally acceptable for a real
> submission.
>
> > It adds a tunable oom.victim to the oom cgroup. The oom killer will kill
> > the process using the usual badness value but only within the cgroup with
> > the maximum value for oom.victim before killing any process from a cgroup
> > with a lesser oom.victim number. Oom killing could be disabled by setting
> > oom.victim=0.
>
> "priority" might be a better term than "victim".
>

Agreed.

> > CPUSET constrained OOM:
> > Also the tunable oom.cpuset_constrained when enabled, would disable the
> > ordering imposed by this controller for cpuset constrained OOMs.
> >
> > diff --git a/Documentation/cgroups/oom.txt
> > b/Documentation/cgroups/oom.txt new file mode 100644
> > index 0000000..772fb41
> > --- /dev/null
> > +++ b/Documentation/cgroups/oom.txt
> > @@ -0,0 +1,34 @@
> > +OOM Killer controller
> > +--- ------ ----------
> > +
> > +The OOM killer kills the process based on a set of heuristics such that
> > only
>
> Might be worth adding "theoretically" in this sentence :-)
>
> >        do_posix_clock_monotonic_gettime(&uptime);
> > @@ -257,10 +262,30 @@ static struct task_struct
> > *select_bad_process(unsigned long *ppoints,
> >                        continue;
> >
> >                points = badness(p, uptime.tv_sec);
> > +#ifdef CONFIG_CGROUP_OOM
> > +               taskvictim =
> > (container_of(p->cgroups->subsys[oom_subsys_id], +                       
> >                        struct oom_cgroup, css))->victim;
>
> Firstly, this ought to be using the task_subsys_state() function to
> ensure the appropriate rcu_dereference() calls.
>

Ok.

> Secondly, is it safe? I'm not sure if we're in an RCU section in this
> case, and we certainly haven't called task_lock(p) or cgroup_lock().
> You should surround this with rcu_read_lock()/rcu_read_unlock().
>

Ok.

> And thirdly, it would be better to move the #ifdef to the header file,
> and provide dummy functions that return 0 for the kill priority if
> CONFIG_CGROUP_OOM isn't defined.
>

Ok. As this patch uses 0 to disable oom_killing completely, the dummy function 
should return 1 instead of zero. It should be documented more clearly.

> > +               honour_cpuset_constraint = *(container_of(p->cgroups-
> >
> >>subsys[oom_subsys_id],
> >
> > +                                                struct oom_cgroup,
> > css))-
> >
> >>cpuset_constraint;
>
> I think that putting this kind of inter-subsystem dependency in is a
> bad idea. If you want to control whether the OOM killer treats cpusets
> specially, perhaps that flag should be put in cpusets?
>

But then won't it add a special variable in cpusets for oom-controller?

> > +
> > +               if (taskvictim > chosenvictim ||
> > +                       (((taskvictim == chosenvictim) ||
> > +                               (cpuset_constrained &&
> > honour_cpuset_constraint)) +                                && points >
> > *ppoints) ||
> > +                       (taskvictim && !chosen)) {
>
> This could do with more comments or maybe breaking up into simpler
> conditions.
>

Ok.

> > +       if (cont->parent == NULL) {
> > +               oom_css->victim = 1;
>
> Any reason to default to 1 rather than 0?
>

0 disables oom killing completely.

> > +               oom_css->cpuset_constraint =
> > +                       kzalloc(sizeof(*oom_css->cpuset_constraint),
> > GFP_KERNEL); +               *oom_css->cpuset_constraint = false;
> > +       } else {
> > +               parent = oom_css_from_cgroup(cont->parent);
> > +               oom_css->victim = parent->victim;
> > +               oom_css->cpuset_constraint = parent->cpuset_constraint;
> > +       }
>
> So there's a single cpuset_constraint shared by all cgroups? Isn't
> that just a global variable then?
>

Yes, it should be a global variable.

> > +
> > +static int oom_victim_write(struct cgroup *cgrp, struct cftype *cft,
> > +                                       u64 val)
> > +{
> > +
> > +        cgroup_lock();
>
> This isn't really doing much, since you don't synchronize on the read
> side (either the file handler or in the OOM killer itself). It might
> be better to just make the value an atomic_t and avoid taking
> cgroup_lock() here.
>

Yes.

> Should we enforce any constraint that a cgroup can never have a lower
> kill priority than its parent? Or a separate "min child priority"
> value, or just make the cgroup's priority be the max of any in its
> path to the root? That would allow you to safely delegate OOM priority
> control to sub cgroups while still controlling relative priorities for
> each subtree.
>

Setting priority to be the maximum of any in its path seems better to me. It 
should make it easier to handle a group of cgroups.

> > +static int oom_cpuset_write(struct cgroup *cont, struct cftype *cft,
> > +                            const char *buffer)
> > +{
> > +       if (buffer[0] == '1' && buffer[1] == 0)
> > +               *(oom_css_from_cgroup(cont))->cpuset_constraint = true;
> > +       else if (buffer[0] == '0' && buffer[1] == 0)
> > +               *(oom_css_from_cgroup(cont))->cpuset_constraint = false;
> > +       else
> > +               return -EINVAL;
> > +       return 0;
> > +}
>
> This can be a u64 write handler that just complains if its input isn't 0 or
> 1.
>

Yes, that would be cleaner.

> > +static struct cftype oom_cgroup_files[] = {
> > +       {
> > +               .name = "victim",
> > +               .read_u64 = oom_victim_read,
> > +               .write_u64 = oom_victim_write,
> > +       },
> > +};
> > +
> > +static struct cftype oom_cgroup_root_files[] = {
> > +       {
> > +               .name = "victim",
> > +               .read_u64 = oom_victim_read,
> > +               .write_u64 = oom_victim_write,
> > +       },
>
> Don't duplicate here - just have disjoint sets of files, and call
> cgroup_add_files(oom_cgroup_root_files) in addition to the regular
> files if it's the root. (Although as I mentioned above, I don't really
> think this is the right place for the cpuset_constraint file)
>

Ok.

Thanks for the detailed review. I have attached the patch with your comments 
incorporated. There is a read-only oom.effective_priority added which is 
computed as the maximum oom.priority along its path.

Thanks
Nikanth

From: Nikanth Karthikesan <knikanth@suse.de>

Cgroup based OOM killer controller

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

---

This is a container group based approach to override the oom killer selection 
without losing all the benefits of the current oom killer heuristics and 
oom_adj interface. This controller helps in specifying a strict order between 
tasks that can be killed during a oom.

It adds a tunable oom.priority to the oom cgroup. The oom killer will kill the 
process using the usual badness value but only within the cgroup with the 
maximum value for oom.effective_priority before killing any process from a 
cgroup with a lesser oom.effective_priority number. The oom.effective_priority 
is calculated as the maximum oom.priority along its path. Oom killing could be 
disabled for a cgroup by setting oom.effective_priority=0.

diff --git a/Documentation/cgroups/oom.txt b/Documentation/cgroups/oom.txt
new file mode 100644
index 0000000..5ef34db
--- /dev/null
+++ b/Documentation/cgroups/oom.txt
@@ -0,0 +1,36 @@
+OOM Killer controller
+--- ------ ----------
+
+The OOM killer kills the process based on a set of heuristics such that only
+minimum amount of work done will be lost, a large amount of memory would be
+recovered and minimum no of processes are killed.
+
+The user can adjust the score used to select the processes to be killed using
+/proc/<pid>/oom_adj. Giving it a high score will increase the likelihood of 
+this process being killed by the oom-killer.  Valid values are in the range 
+-16 to +15, plus the special value -17, which disables oom-killing altogether
+for that process.
+
+But it is very difficult to suggest an order among tasks to be killed during
+Out Of Memory situation. The OOM Killer controller aids in doing that.
+
+USAGE
+-----
+
+Mount the oom controller by passing 'oom' when mounting cgroups. Echo
+a value in oom.priority file to change the order. The oom.effective_priority
+is calculated as the highest oom.priority along its path. The oom killer 
would
+kill all the processes in a cgroup with a higher oom.effective_priority 
before
+killing a process in a cgroup with lower oom.effective_priority value. Among
+those tasks with same oom.effective_priority value, the usual badness
+heuristics would be applied. The /proc/<pid>/oom_adj still helps adjusting 
the
+oom killer score. Also having oom.effective_priority = 0 would disable oom
+killing for the tasks in that cgroup.
+
+Note: If this is used without proper consideration, innocent processes may
+get killed unnecesarily.
+
+CPUSET constrained OOM:
+Setting oom.cpuset_constraint=1 would disable the ordering during a cpuset
+constrained oom. Setting oom.cpuset_constraint=0 would not distinguish
+between a cpuset constrained oom and system wide oom.
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..6944f99 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,4 +59,8 @@ SUBSYS(freezer)
 SUBSYS(net_cls)
 #endif
 
+#ifdef CONFIG_CGROUP_OOM
+SUBSYS(oom)
+#endif
+
 /* */
diff --git a/include/linux/oomcontrol.h b/include/linux/oomcontrol.h
new file mode 100644
index 0000000..8072d7a
--- /dev/null
+++ b/include/linux/oomcontrol.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_OOMCONTROL_H
+#define _LINUX_OOMCONTROL_H
+
+#ifdef CONFIG_CGROUP_OOM
+
+struct oom_cgroup { 
+	struct cgroup_subsys_state css;
+
+	/*
+	 * the order to be victimized for this group
+	 */  
+	atomic_t priority;
+
+	/*
+	 * the maximum priority along the path from root
+	 */  
+	atomic_t effective_priority;
+
+};
+
+/*
+ * disable during cpuset constrained oom
+ */
+extern atomic_t honour_cpuset_constraint;
+
+u64 task_oom_priority(struct task_struct *p);
+
+#else
+
+#define task_oom_priority(p) (1)
+
+static atomic_t honour_cpuset_constraint; /* unused */
+
+#endif
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 2af8382..99ed0de 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -354,6 +354,15 @@ config CGROUP_DEBUG
 
 	  Say N if unsure.
 
+config CGROUP_OOM
+	bool "Oom cgroup subsystem"
+	depends on CGROUPS
+	help
+	  This provides a cgroup subsystem which aids controlling
+	  the order in which tasks whould be killed during
+	  out of memory situations.
+	
+
 config CGROUP_NS
 	bool "Namespace cgroup subsystem"
 	depends on CGROUPS
diff --git a/mm/Makefile b/mm/Makefile
index 72255be..a5d7222 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,3 +33,4 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_OOM) += oomcontrol.o 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 40ba050..6851da3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/oomcontrol.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -200,11 +201,13 @@ static inline enum oom_constraint 
constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+			struct mem_cgroup *mem, int cpuset_constrained)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
 	struct timespec uptime;
+	u64 chosenpriority = 1, taskpriority;
+
 	*ppoints = 0;
 
 	do_posix_clock_monotonic_gettime(&uptime);
@@ -257,10 +260,35 @@ static struct task_struct *select_bad_process(unsigned 
long *ppoints,
 			continue;
 
 		points = badness(p, uptime.tv_sec);
-		if (points > *ppoints || !chosen) {
+
+		taskpriority = task_oom_priority(p);
+
+		/*
+		 * select this task if
+		 * 1. It has higher oom.priority than the previously selected
+		 * task, or
+		 * 2. It has the same priority as previously selected task but
+		 * higher badness score, or
+		 * 3. If this is the first task to be considered and it is not
+		 * protected from oom killer by setting priority as zero, or
+		 * 4. If this is a cpuset constrained oom and
+		 * honour_cpuset_constraint is set
+		 */
+		if (taskpriority > chosenpriority ||
+
+			(((taskpriority == chosenpriority) ||
+			  (cpuset_constrained &&
+				atomic_read(&honour_cpuset_constraint)))
+			 && points > *ppoints) ||
+
+			(taskpriority && !chosen)) {
+
 			chosen = p;
 			*ppoints = points;
+			chosenpriority = taskpriority;
+
 		}
+		
 	} while_each_thread(g, p);
 
 	return chosen;
@@ -431,7 +459,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, 
gfp_t gfp_mask)
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, 0); /* not cpuset constrained */
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -513,7 +541,7 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t 
gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order, int 
cpuset_constrained)
 {
 	if (sysctl_oom_kill_allocating_task) {
 		oom_kill_process(current, gfp_mask, order, 0, NULL,
@@ -528,7 +556,7 @@ retry:
 		 * Rambo mode: Shoot down a process and hope it solves whatever
 		 * issues we may have.
 		 */
-		p = select_bad_process(&points, NULL);
+		p = select_bad_process(&points, NULL, cpuset_constrained);
 
 		if (PTR_ERR(p) == -1UL)
 			return;
@@ -569,7 +597,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order and not cpuset constrained */
+	__out_of_memory(0, 0, 0); 
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -623,7 +652,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t 
gfp_mask, int order)
 			panic("out of memory. panic_on_oom is selected\n");
 		/* Fall-through */
 	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
+		__out_of_memory(gfp_mask, order, 1);
 		break;
 	}
 
diff --git a/mm/oomcontrol.c b/mm/oomcontrol.c
new file mode 100644
index 0000000..d572b1f
--- /dev/null
+++ b/mm/oomcontrol.c
@@ -0,0 +1,294 @@
+/*
+ * kernel/cgroup_oom.c - oom handler cgroup.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/oomcontrol.h>
+#include <asm/atomic.h>
+
+atomic_t honour_cpuset_constraint;
+
+/*
+ * Helper to retrieve oom controller data from cgroup
+ */
+static struct oom_cgroup *oom_css_from_cgroup(struct cgroup *cgrp)
+{
+        return container_of(cgroup_subsys_state(cgrp,
+                                oom_subsys_id), struct oom_cgroup,
+                                css);
+}
+
+u64 task_oom_priority(struct task_struct *p)
+{
+	rcu_read_lock();
+	return atomic_read(&(container_of(task_subsys_state(p,oom_subsys_id),
+				struct oom_cgroup, css))->effective_priority);
+	rcu_read_unlock();
+}
+
+static struct cgroup_subsys_state *oom_create(struct cgroup_subsys *ss,
+						   struct cgroup *cont)
+{
+	struct oom_cgroup *oom_css = kzalloc(sizeof(*oom_css), GFP_KERNEL);
+	struct oom_cgroup *parent;
+	u64 parent_priority, parent_effective_priority;
+
+	if (!oom_css)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * if root last/only group to be victimized
+	 * else inherit parents value
+	 */
+	if (cont->parent == NULL) {
+		atomic_set(&oom_css->priority, 1);
+		atomic_set(&oom_css->effective_priority, 1);
+		atomic_set(&honour_cpuset_constraint, 0);
+	} else {
+		parent = oom_css_from_cgroup(cont->parent);
+		parent_priority = atomic_read(&parent->priority);
+		parent_effective_priority = 
+			atomic_read(&parent->effective_priority);
+		atomic_set(&oom_css->priority, parent_priority);
+		atomic_set(&oom_css->effective_priority,
+					parent_effective_priority);
+	}
+
+	return &oom_css->css;
+}
+
+static void oom_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	kfree(cont->subsys[oom_subsys_id]);
+}
+
+static void increase_effective_priority(struct cgroup *cgrp, u64 val)
+{
+	struct cgroup *curr;
+	struct oom_cgroup *oom_css;
+
+	atomic_set( &(oom_css_from_cgroup(cgrp))->effective_priority, val);
+
+	mutex_lock(&oom_subsys.hierarchy_mutex);
+
+	/*
+	 * DFS
+	 */
+	if (!list_empty(&cgrp->children))
+		curr = list_first_entry(&cgrp->children,
+					struct cgroup, sibling);
+	else
+		goto out;
+
+visit_children:
+	oom_css = oom_css_from_cgroup(curr);
+	if (atomic_read(&oom_css->effective_priority) < val)
+		atomic_set(&oom_css->effective_priority, val);
+
+	if (!list_empty(&curr->children)) {
+		curr = list_first_entry(&curr->children,
+					struct cgroup, sibling);
+		goto visit_children;
+	} else {
+visit_siblings:
+		if (curr == 0 || cgrp == curr) goto out;
+
+		if (curr->sibling.next != &curr->parent->children) {
+			curr = list_entry(curr->sibling.next,
+						struct cgroup, sibling);
+			goto visit_children;
+		} else {
+			curr = curr->parent;
+			goto visit_siblings;
+		}
+	}
+out:
+	mutex_unlock(&oom_subsys.hierarchy_mutex);
+
+}
+
+static void decrease_effective_priority(struct cgroup *cgrp, u64 val)
+{
+	struct cgroup *curr;
+	u64 priority, effective_priority;
+
+
+	effective_priority = val;
+
+	atomic_set(&oom_css_from_cgroup(cgrp)->effective_priority,
+							effective_priority);
+
+	mutex_lock(&oom_subsys.hierarchy_mutex);
+
+	/*
+	 * DFS
+	 */
+	if (!list_empty(&cgrp->children))
+		curr = list_first_entry(&cgrp->children,
+					struct cgroup, sibling);
+	else
+		goto out;
+
+visit_children:
+	priority = atomic_read(&oom_css_from_cgroup(curr)->priority);
+
+	if (priority > effective_priority) {
+		atomic_set(&oom_css_from_cgroup(curr)->
+					effective_priority, priority);
+		effective_priority = priority;
+	} else 
+		atomic_set(&oom_css_from_cgroup(curr)->
+				effective_priority,effective_priority);
+
+	if (!list_empty(&curr->children)) {
+		curr = list_first_entry(&curr->children,
+						struct cgroup, sibling);
+		goto visit_children;
+	} else {
+visit_siblings:
+		if (curr == 0 || cgrp == curr)
+			goto out;
+
+		if (curr->parent)
+       	        	effective_priority =
+			  atomic_read(&oom_css_from_cgroup(
+			   curr->parent)->effective_priority);
+		else
+        		effective_priority = val;
+
+		if (curr->sibling.next != &curr->parent->children) {
+			curr = list_entry(curr->sibling.next,
+						struct cgroup, sibling);
+			goto visit_children;
+		} else {
+			curr = curr->parent;
+			goto visit_siblings;
+		}
+	}
+out:
+				
+		mutex_unlock(&oom_subsys.hierarchy_mutex);
+
+}
+
+static int oom_priority_write(struct cgroup *cgrp, struct cftype *cft,
+                                       u64 val)
+{
+	u64 effective_priority;
+	u64 old_priority;
+	u64 parent_effective_priority = 0;
+
+	old_priority = atomic_read(&(oom_css_from_cgroup(cgrp))->priority);
+	atomic_set(&(oom_css_from_cgroup(cgrp))->priority, val);
+
+	effective_priority = atomic_read(
+			&(oom_css_from_cgroup(cgrp))->effective_priority);
+
+	/*
+	 * propagate new effective_priority to sub cgroups
+	 */
+	if (val > effective_priority)
+		increase_effective_priority(cgrp, val);
+	else if (effective_priority == old_priority &&
+						val < effective_priority) {
+		struct oom_cgroup *oom_css = NULL;
+		if (cgrp->parent)
+			oom_css = oom_css_from_cgroup(cgrp->parent);
+		else
+			oom_css = oom_css_from_cgroup(cgrp);
+
+		if (cgrp->parent)
+			parent_effective_priority =
+				atomic_read(&oom_css->effective_priority);
+			
+		if (cgrp->parent == NULL || 
+				parent_effective_priority < effective_priority) {
+			/*
+			 * set effective_priority to max of parents effective and
+			 * new priority
+			 */
+			if (cgrp->parent == NULL || effective_priority < val
+				 	|| parent_effective_priority < val)
+				effective_priority = val;
+			else
+				effective_priority = parent_effective_priority;
+
+			decrease_effective_priority(cgrp, effective_priority);
+
+		} 
+	}
+        return 0;
+}
+
+static u64 oom_effective_priority_read(struct cgroup *cgrp, struct cftype 
*cft)
+{
+        u64 priority = atomic_read(&(oom_css_from_cgroup(cgrp))-
>effective_priority);
+
+        return priority;
+}
+
+static u64 oom_priority_read(struct cgroup *cgrp, struct cftype *cft)
+{
+        u64 priority = atomic_read(&(oom_css_from_cgroup(cgrp))->priority);
+
+        return priority;
+}
+
+static int oom_cpuset_write(struct cgroup *cgrp, struct cftype *cft,
+					u64 val)
+{
+	if (val > 1)
+		return -EINVAL;
+	atomic_set(&honour_cpuset_constraint, val);
+	return 0;
+}
+
+static u64 oom_cpuset_read(struct cgroup *cgrp, struct cftype *cft)
+{
+        return atomic_read(&honour_cpuset_constraint);
+}
+
+static struct cftype oom_cgroup_files[] = {
+	{
+		.name = "priority",
+		.read_u64 = oom_priority_read,
+		.write_u64 = oom_priority_write,
+	},
+	{
+		.name = "effective_priority",
+		.read_u64 = oom_effective_priority_read,
+	},
+};
+
+static struct cftype oom_cgroup_root_only_files[] = {
+	{
+		.name = "cpuset_constraint",
+		.read_u64 = oom_cpuset_read,
+		.write_u64 = oom_cpuset_write,
+	},
+};
+
+static int oom_populate(struct cgroup_subsys *ss,
+                                struct cgroup *cont)
+{
+	int ret;
+
+	ret = cgroup_add_files(cont, ss, oom_cgroup_files,
+				ARRAY_SIZE(oom_cgroup_files));
+	if (!ret && cont->parent == NULL) {
+		ret = cgroup_add_files(cont, ss, oom_cgroup_root_only_files,
+				ARRAY_SIZE(oom_cgroup_root_only_files));
+	}
+
+	return ret;
+}
+
+struct cgroup_subsys oom_subsys = {
+	.name = "oom",
+	.subsys_id = oom_subsys_id,
+	.create = oom_create,
+	.destroy = oom_destroy,
+	.populate = oom_populate,
+};


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-21 11:08 [RFC] [PATCH] Cgroup based OOM killer controller Nikanth Karthikesan
  2009-01-21 13:17 ` Evgeniy Polyakov
@ 2009-01-22  3:28 ` KAMEZAWA Hiroyuki
  2009-01-22  5:13   ` Nikanth Karthikesan
  2009-01-26 19:54 ` Balbir Singh
  2 siblings, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-22  3:28 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: linux-kernel, Andrew Morton, Linus Torvalds, Evgeniy Polyakov,
	Chris Snook, Alan Cox, Arve Hjønnevåg, Paul Menage,
	containers

On Wed, 21 Jan 2009 16:38:21 +0530
Nikanth Karthikesan <knikanth@suse.de> wrote:

> As Alan Cox suggested/wondered in this thread, 
> http://lkml.org/lkml/2009/1/12/235 , this is a container group based approach 
> to override the oom killer selection without losing all the benefits of the 
> current oom killer heuristics and oom_adj interface.
> 
> It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
> process using the usual badness value but only within the cgroup with the 
> maximum value for oom.victim before killing any process from a cgroup with a 
> lesser oom.victim number. Oom killing could be disabled by setting 
> oom.victim=0.
> 
> Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
> 

Assume following
  - the usar can tell "which process should be killed at first"

What is the difference between oom_adj and this cgroup to users ?
If oom_adj is hard to use, making it simpler is a good way, I think.
rather than adding new complication.

It seems both of oom_adj and this cgroup will be hard-to-use functions
for usual system administrators. But no better idea than using memcg
and committing memory usage.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  3:28 ` KAMEZAWA Hiroyuki
@ 2009-01-22  5:13   ` Nikanth Karthikesan
  2009-01-22  5:27     ` KAMEZAWA Hiroyuki
  2009-01-22  5:39     ` Arve Hjønnevåg
  0 siblings, 2 replies; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-22  5:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Andrew Morton, Linus Torvalds, Evgeniy Polyakov,
	Chris Snook, Alan Cox, Arve Hjønnevåg, Paul Menage,
	containers

On Thursday 22 January 2009 08:58:43 KAMEZAWA Hiroyuki wrote:
> On Wed, 21 Jan 2009 16:38:21 +0530
>
> Nikanth Karthikesan <knikanth@suse.de> wrote:
> > As Alan Cox suggested/wondered in this thread,
> > http://lkml.org/lkml/2009/1/12/235 , this is a container group based
> > approach to override the oom killer selection without losing all the
> > benefits of the current oom killer heuristics and oom_adj interface.
> >
> > It adds a tunable oom.victim to the oom cgroup. The oom killer will kill
> > the process using the usual badness value but only within the cgroup with
> > the maximum value for oom.victim before killing any process from a cgroup
> > with a lesser oom.victim number. Oom killing could be disabled by setting
> > oom.victim=0.
> >
> > Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
>
> Assume following
>   - the usar can tell "which process should be killed at first"
>
> What is the difference between oom_adj and this cgroup to users ?

It is next to impossible to specify the order among say 10 memory hogging 
tasks using oom_adj. Using this oom-controller users can specify the exact 
order.

> If oom_adj is hard to use, making it simpler is a good way, I think.
> rather than adding new complication.
>
> It seems both of oom_adj and this cgroup will be hard-to-use functions
> for usual system administrators. But no better idea than using memcg
> and committing memory usage.
>

To use oom_adj effectively one should continuously monitor oom_score of all 
the processes, which is a complex moving target and keep on adjusting the 
oom_adj of many tasks which still cannot guarantee the order. This controller 
is deterministic and hence easier to use.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  5:13   ` Nikanth Karthikesan
@ 2009-01-22  5:27     ` KAMEZAWA Hiroyuki
  2009-01-22  6:11       ` Nikanth Karthikesan
  2009-01-22  5:39     ` Arve Hjønnevåg
  1 sibling, 1 reply; 63+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-22  5:27 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: linux-kernel, Andrew Morton, Linus Torvalds, Evgeniy Polyakov,
	Chris Snook, Alan Cox, Arve Hjønnevåg, Paul Menage,
	containers

On Thu, 22 Jan 2009 10:43:12 +0530
Nikanth Karthikesan <knikanth@suse.de> wrote:

> On Thursday 22 January 2009 08:58:43 KAMEZAWA Hiroyuki wrote:
> > On Wed, 21 Jan 2009 16:38:21 +0530
> >
> > Nikanth Karthikesan <knikanth@suse.de> wrote:
> > > As Alan Cox suggested/wondered in this thread,
> > > http://lkml.org/lkml/2009/1/12/235 , this is a container group based
> > > approach to override the oom killer selection without losing all the
> > > benefits of the current oom killer heuristics and oom_adj interface.
> > >
> > > It adds a tunable oom.victim to the oom cgroup. The oom killer will kill
> > > the process using the usual badness value but only within the cgroup with
> > > the maximum value for oom.victim before killing any process from a cgroup
> > > with a lesser oom.victim number. Oom killing could be disabled by setting
> > > oom.victim=0.
> > >
> > > Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
> >
> > Assume following
> >   - the usar can tell "which process should be killed at first"
> >
> > What is the difference between oom_adj and this cgroup to users ?
> 
> It is next to impossible to specify the order among say 10 memory hogging 
> tasks using oom_adj. Using this oom-controller users can specify the exact 
> order.
> 
> > If oom_adj is hard to use, making it simpler is a good way, I think.
> > rather than adding new complication.
> >
> > It seems both of oom_adj and this cgroup will be hard-to-use functions
> > for usual system administrators. But no better idea than using memcg
> > and committing memory usage.
> >
> 
> To use oom_adj effectively one should continuously monitor oom_score of all 
> the processes, which is a complex moving target and keep on adjusting the 
> oom_adj of many tasks which still cannot guarantee the order. This controller 
> is deterministic and hence easier to use.
> 

Okay, thank you for explanation :)
I think it's better to explain "why this is much easier to use rather
than oom_adj and what is the benefit to users." in your patch description
and to improve your documentation.

+But it is very difficult to suggest an order among tasks to be killed during
+Out Of Memory situation. The OOM Killer controller aids in doing that.

As.
 Difference from oom_adj:
 This allows users to specify "strict order" of oom-kill's select-bad-process
 operation. While oom_adj just works as a hint for the kernel, OOM Killer
 Controller gives users full control.

 In general, it's very hard to specify oom-kill order of several applications
 only by oom_adj because it's just affects "badness" calculation.

A my English skill is poor, you'll be able to write better text ;)

Regards,
-Kame


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  5:27     ` KAMEZAWA Hiroyuki
@ 2009-01-22  6:11       ` Nikanth Karthikesan
  0 siblings, 0 replies; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-22  6:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Andrew Morton, Linus Torvalds, Evgeniy Polyakov,
	Chris Snook, Alan Cox, Arve Hjønnevåg, Paul Menage,
	containers

On Thursday 22 January 2009 10:57:21 KAMEZAWA Hiroyuki wrote:
> On Thu, 22 Jan 2009 10:43:12 +0530
>
> Nikanth Karthikesan <knikanth@suse.de> wrote:
> > On Thursday 22 January 2009 08:58:43 KAMEZAWA Hiroyuki wrote:
> > > On Wed, 21 Jan 2009 16:38:21 +0530
> > >
> > > Nikanth Karthikesan <knikanth@suse.de> wrote:
> > > > As Alan Cox suggested/wondered in this thread,
> > > > http://lkml.org/lkml/2009/1/12/235 , this is a container group based
> > > > approach to override the oom killer selection without losing all the
> > > > benefits of the current oom killer heuristics and oom_adj interface.
> > > >
> > > > It adds a tunable oom.victim to the oom cgroup. The oom killer will
> > > > kill the process using the usual badness value but only within the
> > > > cgroup with the maximum value for oom.victim before killing any
> > > > process from a cgroup with a lesser oom.victim number. Oom killing
> > > > could be disabled by setting oom.victim=0.
> > > >
> > > > Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
> > >
> > > Assume following
> > >   - the usar can tell "which process should be killed at first"
> > >
> > > What is the difference between oom_adj and this cgroup to users ?
> >
> > It is next to impossible to specify the order among say 10 memory hogging
> > tasks using oom_adj. Using this oom-controller users can specify the
> > exact order.
> >
> > > If oom_adj is hard to use, making it simpler is a good way, I think.
> > > rather than adding new complication.
> > >
> > > It seems both of oom_adj and this cgroup will be hard-to-use functions
> > > for usual system administrators. But no better idea than using memcg
> > > and committing memory usage.
> >
> > To use oom_adj effectively one should continuously monitor oom_score of
> > all the processes, which is a complex moving target and keep on adjusting
> > the oom_adj of many tasks which still cannot guarantee the order. This
> > controller is deterministic and hence easier to use.
>
> Okay, thank you for explanation :)
> I think it's better to explain "why this is much easier to use rather
> than oom_adj and what is the benefit to users." in your patch description
> and to improve your documentation.
>
> +But it is very difficult to suggest an order among tasks to be killed
> during +Out Of Memory situation. The OOM Killer controller aids in doing
> that.
>
> As.
>  Difference from oom_adj:
>  This allows users to specify "strict order" of oom-kill's
> select-bad-process operation. While oom_adj just works as a hint for the
> kernel, OOM Killer Controller gives users full control.
>
>  In general, it's very hard to specify oom-kill order of several
> applications only by oom_adj because it's just affects "badness"
> calculation.
>

Yes, this could be added to the patch description or the documentation.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  5:13   ` Nikanth Karthikesan
  2009-01-22  5:27     ` KAMEZAWA Hiroyuki
@ 2009-01-22  5:39     ` Arve Hjønnevåg
  2009-01-22  6:12       ` Nikanth Karthikesan
  1 sibling, 1 reply; 63+ messages in thread
From: Arve Hjønnevåg @ 2009-01-22  5:39 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Linus Torvalds,
	Evgeniy Polyakov, Chris Snook, Alan Cox, Paul Menage, containers

On Wed, Jan 21, 2009 at 9:13 PM, Nikanth Karthikesan <knikanth@suse.de> wrote:
> To use oom_adj effectively one should continuously monitor oom_score of all
> the processes, which is a complex moving target and keep on adjusting the
> oom_adj of many tasks which still cannot guarantee the order. This controller
> is deterministic and hence easier to use.
>

Why not add an option to make oom_adj ensure strict ordering instead?

-- 
Arve Hjønnevåg

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  5:39     ` Arve Hjønnevåg
@ 2009-01-22  6:12       ` Nikanth Karthikesan
  2009-01-22  6:29         ` Arve Hjønnevåg
  0 siblings, 1 reply; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-22  6:12 UTC (permalink / raw)
  To: Arve Hjønnevåg
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Linus Torvalds,
	Evgeniy Polyakov, Chris Snook, Alan Cox, Paul Menage, containers

On Thursday 22 January 2009 11:09:45 Arve Hjønnevåg wrote:
> On Wed, Jan 21, 2009 at 9:13 PM, Nikanth Karthikesan <knikanth@suse.de> 
wrote:
> > To use oom_adj effectively one should continuously monitor oom_score of
> > all the processes, which is a complex moving target and keep on adjusting
> > the oom_adj of many tasks which still cannot guarantee the order. This
> > controller is deterministic and hence easier to use.
>
> Why not add an option to make oom_adj ensure strict ordering instead?

This could be done in 2 ways.
1. Make oom_adj itself strict.(based on some other parameter?)
- Adds to confusion whether the current oom_adj is a strict value or the usual 
suggestion. 
- It would disable the oom_adj suggestion which could have been used till now. 
- It is a public interface, and changing that might break some one's script.

2. Add addtional parameter, say  /proc/<pid>/oom_order
- Not easy to use.
- Say I had assigned the oom.victim to a task and it had forked a lot. Now to 
change the value for all the tasks it is easier with cgroups.
- Some optimization that Kame specified earlier would be harder to achieve.

Basically oom-controller implements option 2, using cgroups which can be 
thought of as a modern interface for proc. Also it could be used along with 
other cgroup controllers like the group scheduler. Say you have 2 groups of 
tasks, clubed as entertainment and science, you could use the group scheduler 
to give more CPU bandwidth to science and instruct oom-controller to kill 
entertainment tasks in case of OOM situation.

Thanks
Nikanth


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  6:12       ` Nikanth Karthikesan
@ 2009-01-22  6:29         ` Arve Hjønnevåg
  2009-01-22  6:42           ` Nikanth Karthikesan
  0 siblings, 1 reply; 63+ messages in thread
From: Arve Hjønnevåg @ 2009-01-22  6:29 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Linus Torvalds,
	Evgeniy Polyakov, Chris Snook, Alan Cox, Paul Menage, containers

On Wed, Jan 21, 2009 at 10:12 PM, Nikanth Karthikesan <knikanth@suse.de> wrote:
> On Thursday 22 January 2009 11:09:45 Arve Hjønnevåg wrote:
>> On Wed, Jan 21, 2009 at 9:13 PM, Nikanth Karthikesan <knikanth@suse.de>
> wrote:
>> > To use oom_adj effectively one should continuously monitor oom_score of
>> > all the processes, which is a complex moving target and keep on adjusting
>> > the oom_adj of many tasks which still cannot guarantee the order. This
>> > controller is deterministic and hence easier to use.
>>
>> Why not add an option to make oom_adj ensure strict ordering instead?
>
> This could be done in 2 ways.
> 1. Make oom_adj itself strict.(based on some other parameter?)
> - Adds to confusion whether the current oom_adj is a strict value or the usual
> suggestion.
> - It would disable the oom_adj suggestion which could have been used till now.
> - It is a public interface, and changing that might break some one's script.
>
> 2. Add addtional parameter, say  /proc/<pid>/oom_order
> - Not easy to use.
> - Say I had assigned the oom.victim to a task and it had forked a lot. Now to
> change the value for all the tasks it is easier with cgroups.
> - Some optimization that Kame specified earlier would be harder to achieve.
>

Both options would work for us, but option 1 require no change to our
user space code. I agree that some operations are easier with a
cgroups approach, but since we don't perform these operations it would
be nice to not require cgroups to control the oom killer.

-- 
Arve Hjønnevåg

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-22  6:29         ` Arve Hjønnevåg
@ 2009-01-22  6:42           ` Nikanth Karthikesan
  0 siblings, 0 replies; 63+ messages in thread
From: Nikanth Karthikesan @ 2009-01-22  6:42 UTC (permalink / raw)
  To: Arve Hjønnevåg
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Linus Torvalds,
	Evgeniy Polyakov, Chris Snook, Alan Cox, Paul Menage, containers

On Thursday 22 January 2009 11:59:20 Arve Hjønnevåg wrote:
> On Wed, Jan 21, 2009 at 10:12 PM, Nikanth Karthikesan <knikanth@suse.de> 
wrote:
> > On Thursday 22 January 2009 11:09:45 Arve Hjønnevåg wrote:
> >> On Wed, Jan 21, 2009 at 9:13 PM, Nikanth Karthikesan <knikanth@suse.de>
> >
> > wrote:
> >> > To use oom_adj effectively one should continuously monitor oom_score
> >> > of all the processes, which is a complex moving target and keep on
> >> > adjusting the oom_adj of many tasks which still cannot guarantee the
> >> > order. This controller is deterministic and hence easier to use.
> >>
> >> Why not add an option to make oom_adj ensure strict ordering instead?
> >
> > This could be done in 2 ways.
> > 1. Make oom_adj itself strict.(based on some other parameter?)
> > - Adds to confusion whether the current oom_adj is a strict value or the
> > usual suggestion.
> > - It would disable the oom_adj suggestion which could have been used till
> > now. - It is a public interface, and changing that might break some one's
> > script.
> >
> > 2. Add addtional parameter, say  /proc/<pid>/oom_order
> > - Not easy to use.
> > - Say I had assigned the oom.victim to a task and it had forked a lot.
> > Now to change the value for all the tasks it is easier with cgroups.
> > - Some optimization that Kame specified earlier would be harder to
> > achieve.
>
> Both options would work for us, but option 1 require no change to our
> user space code.

Some scripts might be assuming the oom_adj will always be -17 to +15. So not 
more than 32+1 levels or order is possible. Yes it should be enough mostly. 
But incase you want to leave space between for adding tasks in between, one 
has to take extra care or do more work. And someone might still assume old 
behaviour and by seeing oom_score and oom_adj he might expect a different 
behaviour. And if someone wants the old behaviour, we have to provide an 
aditional variable to enable/disable this.

> I agree that some operations are easier with a
> cgroups approach, but since we don't perform these operations it would
> be nice to not require cgroups to control the oom killer.

We would perform these operations if these would be available and easier to 
use, so it would be nice to use cgroups.

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-21 11:08 [RFC] [PATCH] Cgroup based OOM killer controller Nikanth Karthikesan
  2009-01-21 13:17 ` Evgeniy Polyakov
  2009-01-22  3:28 ` KAMEZAWA Hiroyuki
@ 2009-01-26 19:54 ` Balbir Singh
  2009-01-26 19:56   ` Alan Cox
  2 siblings, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2009-01-26 19:54 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: linux-kernel, containers, Arve Hj?nnev?g, Evgeniy Polyakov,
	Andrew Morton, Chris Snook, Linus Torvalds, Paul Menage, Alan Cox

* Nikanth Karthikesan <knikanth@suse.de> [2009-01-21 16:38:21]:

> As Alan Cox suggested/wondered in this thread, 
> http://lkml.org/lkml/2009/1/12/235 , this is a container group based approach 
> to override the oom killer selection without losing all the benefits of the 
> current oom killer heuristics and oom_adj interface.
> 
> It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
> process using the usual badness value but only within the cgroup with the 
> maximum value for oom.victim before killing any process from a cgroup with a 
> lesser oom.victim number. Oom killing could be disabled by setting 
> oom.victim=0.

Looking at the patch, I wonder if it is time for user space OOM
notifications that were discussed during the containers mini-summit.
The idea is to inform user space about OOM's and let user space take
action, if no action is taken, the default handler kicks in.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-26 19:54 ` Balbir Singh
@ 2009-01-26 19:56   ` Alan Cox
  2009-01-27  7:02     ` KOSAKI Motohiro
  0 siblings, 1 reply; 63+ messages in thread
From: Alan Cox @ 2009-01-26 19:56 UTC (permalink / raw)
  To: balbir
  Cc: Nikanth Karthikesan, linux-kernel, containers, Arve Hj?nnev?g,
	Evgeniy Polyakov, Andrew Morton, Chris Snook, Linus Torvalds,
	Paul Menage

On Tue, 27 Jan 2009 01:24:31 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Nikanth Karthikesan <knikanth@suse.de> [2009-01-21 16:38:21]:
> 
> > As Alan Cox suggested/wondered in this thread, 
> > http://lkml.org/lkml/2009/1/12/235 , this is a container group based approach 
> > to override the oom killer selection without losing all the benefits of the 
> > current oom killer heuristics and oom_adj interface.
> > 
> > It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
> > process using the usual badness value but only within the cgroup with the 
> > maximum value for oom.victim before killing any process from a cgroup with a 
> > lesser oom.victim number. Oom killing could be disabled by setting 
> > oom.victim=0.
> 
> Looking at the patch, I wonder if it is time for user space OOM
> notifications that were discussed during the containers mini-summit.
> The idea is to inform user space about OOM's and let user space take
> action, if no action is taken, the default handler kicks in.

The OLPC folks (Marcelo I believe) posted code for this and I believe
OLPC is using this functionality internally so that under memory pressure
(before we actually hit OOM) programs can respond by doing stuff like
evicting caches.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-26 19:56   ` Alan Cox
@ 2009-01-27  7:02     ` KOSAKI Motohiro
  2009-01-27  7:26       ` Balbir Singh
  2009-01-27  7:39       ` David Rientjes
  0 siblings, 2 replies; 63+ messages in thread
From: KOSAKI Motohiro @ 2009-01-27  7:02 UTC (permalink / raw)
  To: Alan Cox
  Cc: kosaki.motohiro, balbir, Nikanth Karthikesan, containers,
	linux-kernel, Torvalds, Arve Hj?nnev?g, Evgeniy Polyakov,
	Andrew Morton, Chris Snook, Linus, Paul Menage

Hi

> > > As Alan Cox suggested/wondered in this thread, 
> > > http://lkml.org/lkml/2009/1/12/235 , this is a container group based approach 
> > > to override the oom killer selection without losing all the benefits of the 
> > > current oom killer heuristics and oom_adj interface.
> > > 
> > > It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
> > > process using the usual badness value but only within the cgroup with the 
> > > maximum value for oom.victim before killing any process from a cgroup with a 
> > > lesser oom.victim number. Oom killing could be disabled by setting 
> > > oom.victim=0.
> > 
> > Looking at the patch, I wonder if it is time for user space OOM
> > notifications that were discussed during the containers mini-summit.
> > The idea is to inform user space about OOM's and let user space take
> > action, if no action is taken, the default handler kicks in.
> 
> The OLPC folks (Marcelo I believe) posted code for this and I believe
> OLPC is using this functionality internally so that under memory pressure
> (before we actually hit OOM) programs can respond by doing stuff like
> evicting caches.

Confused.

As far as I know, people want the method of flexible cache treating.
but oom seems less flexible than userland notification.

Why do you think notification is bad?



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27  7:02     ` KOSAKI Motohiro
@ 2009-01-27  7:26       ` Balbir Singh
  2009-01-27  7:39       ` David Rientjes
  1 sibling, 0 replies; 63+ messages in thread
From: Balbir Singh @ 2009-01-27  7:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alan Cox, Nikanth Karthikesan, containers, linux-kernel, Torvalds,
	Arve Hj?nnev?g, Evgeniy Polyakov, Andrew Morton, Chris Snook,
	Linus, Paul Menage

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-01-27 16:02:34]:

> Hi
> 
> > > > As Alan Cox suggested/wondered in this thread, 
> > > > http://lkml.org/lkml/2009/1/12/235 , this is a container group based approach 
> > > > to override the oom killer selection without losing all the benefits of the 
> > > > current oom killer heuristics and oom_adj interface.
> > > > 
> > > > It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the 
> > > > process using the usual badness value but only within the cgroup with the 
> > > > maximum value for oom.victim before killing any process from a cgroup with a 
> > > > lesser oom.victim number. Oom killing could be disabled by setting 
> > > > oom.victim=0.
> > > 
> > > Looking at the patch, I wonder if it is time for user space OOM
> > > notifications that were discussed during the containers mini-summit.
> > > The idea is to inform user space about OOM's and let user space take
> > > action, if no action is taken, the default handler kicks in.
> > 
> > The OLPC folks (Marcelo I believe) posted code for this and I believe
> > OLPC is using this functionality internally so that under memory pressure
> > (before we actually hit OOM) programs can respond by doing stuff like
> > evicting caches.
> 

I did see the patches on linux-mm, but a more generic cgroup patch
would help both cases, in the absence of cgroups, the default cgroup
will contain all tasks and can carry out the handling.

> Confused.
> 
> As far as I know, people want the method of flexible cache treating.
> but oom seems less flexible than userland notification.
> 
> Why do you think notification is bad?

I did not find Alan's message confusing or stating that notification
was bad, but I might be misreading it.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27  7:02     ` KOSAKI Motohiro
  2009-01-27  7:26       ` Balbir Singh
@ 2009-01-27  7:39       ` David Rientjes
  2009-01-27  7:44         ` KOSAKI Motohiro
  1 sibling, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-27  7:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alan Cox, balbir, Nikanth Karthikesan, containers, linux-kernel,
	Torvalds, Arve Hj?nnev?g, Evgeniy Polyakov, Andrew Morton,
	Chris Snook, Linus, Paul Menage

On Tue, 27 Jan 2009, KOSAKI Motohiro wrote:

> Confused.
> 
> As far as I know, people want the method of flexible cache treating.
> but oom seems less flexible than userland notification.
> 
> Why do you think notification is bad?
> 

There're a couple of proposals that have been discussed recently that 
share some functional behavior.

One is the cgroup oom notifier that allows you to attach a task to wait on 
an oom condition for a collection of tasks.  That allows userspace to 
respond to the condition by droping caches, adding nodes to a cpuset, 
elevating memory controller limits, sending a signal, etc.  It can also 
defer to the kernel oom killer as a last resort.

The other is /dev/mem_notify that allows you to poll() on a device file 
and be informed of low memory events.  This can include the cgroup oom 
notifier behavior when a collection of tasks is completely out of memory, 
but can also warn when such a condition may be imminent.  I suggested that 
this be implemented as a client of cgroups so that different handlers can 
be responsible for different aggregates of tasks.

I think the latter is a much more powerful tool and includes all the 
behavior of the former.  It preserves the oom killer as a last resort for 
the kernel and defers all preference killing or lowmem responses to 
userspace.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27  7:39       ` David Rientjes
@ 2009-01-27  7:44         ` KOSAKI Motohiro
  2009-01-27  7:51           ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: KOSAKI Motohiro @ 2009-01-27  7:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Alan Cox, balbir, Nikanth Karthikesan,
	containers, linux-kernel, Torvalds, Arve Hj?nnev?g,
	Evgeniy Polyakov, Andrew Morton, Chris Snook, Linus, Paul Menage

> On Tue, 27 Jan 2009, KOSAKI Motohiro wrote:
> 
> > Confused.
> > 
> > As far as I know, people want the method of flexible cache treating.
> > but oom seems less flexible than userland notification.
> > 
> > Why do you think notification is bad?
> > 
> 
> There're a couple of proposals that have been discussed recently that 
> share some functional behavior.
> 
> One is the cgroup oom notifier that allows you to attach a task to wait on 
> an oom condition for a collection of tasks.  That allows userspace to 
> respond to the condition by droping caches, adding nodes to a cpuset, 
> elevating memory controller limits, sending a signal, etc.  It can also 
> defer to the kernel oom killer as a last resort.
> 
> The other is /dev/mem_notify that allows you to poll() on a device file 
> and be informed of low memory events.  This can include the cgroup oom 
> notifier behavior when a collection of tasks is completely out of memory, 
> but can also warn when such a condition may be imminent.  I suggested that 
> this be implemented as a client of cgroups so that different handlers can 
> be responsible for different aggregates of tasks.
>
> I think the latter is a much more powerful tool and includes all the 
> behavior of the former.  It preserves the oom killer as a last resort for 
> the kernel and defers all preference killing or lowmem responses to 
> userspace.

Yup, indeed. :)
honestly, I talked about the same thingk recently "lowmemory android driver not needed?" thread.




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27  7:44         ` KOSAKI Motohiro
@ 2009-01-27  7:51           ` David Rientjes
  2009-01-27  9:31             ` Evgeniy Polyakov
  0 siblings, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-27  7:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alan Cox, balbir, Nikanth Karthikesan, containers, linux-kernel,
	Torvalds, Arve Hj?nnev?g, Evgeniy Polyakov, Andrew Morton,
	Chris Snook, Paul Menage

On Tue, 27 Jan 2009, KOSAKI Motohiro wrote:

> Yup, indeed. :)
> honestly, I talked about the same thingk recently "lowmemory android driver not needed?" thread.
> 

Yeah, I proposed /dev/mem_notify being made as a client of cgroups there 
in http://marc.info/?l=linux-kernel&m=123200623628685

How do you replace the oom killer's capability of giving a killed task 
access to memory reserves with TIF_MEMDIE in userspace?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27  7:51           ` David Rientjes
@ 2009-01-27  9:31             ` Evgeniy Polyakov
  2009-01-27  9:37               ` David Rientjes
  2009-01-27 10:40               ` KOSAKI Motohiro
  0 siblings, 2 replies; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27  9:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Alan Cox, balbir, Nikanth Karthikesan,
	containers, linux-kernel, Torvalds, Arve Hj?nnev?g, Andrew Morton,
	Chris Snook, Paul Menage

On Mon, Jan 26, 2009 at 11:51:27PM -0800, David Rientjes (rientjes@google.com) wrote:
> Yeah, I proposed /dev/mem_notify being made as a client of cgroups there 
> in http://marc.info/?l=linux-kernel&m=123200623628685
> 
> How do you replace the oom killer's capability of giving a killed task 
> access to memory reserves with TIF_MEMDIE in userspace?

/dev/mem_notify is a great idea, but please do not limit existing
oom-killer in its ability to do the job and do not rely on application's
ability to send a SIGKILL which will not kill tasks in unkillable state
contrary to oom-killer.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27  9:31             ` Evgeniy Polyakov
@ 2009-01-27  9:37               ` David Rientjes
  2009-01-27 13:40                 ` Evgeniy Polyakov
  2009-01-27 10:40               ` KOSAKI Motohiro
  1 sibling, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-27  9:37 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: KOSAKI Motohiro, Alan Cox, balbir, Nikanth Karthikesan,
	containers, linux-kernel, Torvalds, Arve Hj?nnev?g, Andrew Morton,
	Chris Snook, Paul Menage

On Tue, 27 Jan 2009, Evgeniy Polyakov wrote:

> /dev/mem_notify is a great idea, but please do not limit existing
> oom-killer in its ability to do the job and do not rely on application's
> ability to send a SIGKILL which will not kill tasks in unkillable state
> contrary to oom-killer.
> 

You're missing the point, /dev/mem_notify would notify userspace of lowmem 
situations and allow it to respond appropriately in any number of ways 
before an oom condition exists.

When the system (or cpuset, memory controller, etc) is oom, userspace can 
choose to defer to the oom killer so that it may kill a task that would 
most likely lead to future memory freeing with access to memory reserves.

There is no additional oom killer limitation imposed here, nor can the oom 
killer kill a task hung in D state any better than userspace.

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27  9:37               ` David Rientjes
@ 2009-01-27 13:40                 ` Evgeniy Polyakov
  2009-01-27 20:37                   ` David Rientjes
  0 siblings, 1 reply; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 13:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Alan Cox, balbir, Nikanth Karthikesan,
	containers, linux-kernel, Torvalds, Arve Hj?nnev?g, Andrew Morton,
	Chris Snook, Paul Menage

Hi David.

On Tue, Jan 27, 2009 at 01:37:55AM -0800, David Rientjes (rientjes@google.com) wrote:
> > /dev/mem_notify is a great idea, but please do not limit existing
> > oom-killer in its ability to do the job and do not rely on application's
> > ability to send a SIGKILL which will not kill tasks in unkillable state
> > contrary to oom-killer.
> > 
> 
> You're missing the point, /dev/mem_notify would notify userspace of lowmem 
> situations and allow it to respond appropriately in any number of ways 
> before an oom condition exists.

Yes, I know.

> When the system (or cpuset, memory controller, etc) is oom, userspace can 
> choose to defer to the oom killer so that it may kill a task that would 
> most likely lead to future memory freeing with access to memory reserves.
> 
> There is no additional oom killer limitation imposed here, nor can the oom 
> killer kill a task hung in D state any better than userspace.

Well, oom-killer can, since it drops unkillable state from the process
mask, that may be not enough though, but it tries more than userspace.

My main point was to haev a way to monitor memory usage and that any
process could tune own behaviour according to that information. Which is
not realated to the system oom-killer at all. Thus /dev/mem_notify is
interested first (and only the first) as a memory usage notification
interface and not a way to invoke any kind of 'soft' oom-killer.
Application can do whatever it wants of course including killing itself
or the neighbours, but this should not be forced as a usage policy.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 13:40                 ` Evgeniy Polyakov
@ 2009-01-27 20:37                   ` David Rientjes
  2009-01-27 21:51                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-27 20:37 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: KOSAKI Motohiro, Alan Cox, balbir, Nikanth Karthikesan,
	containers, linux-kernel, Torvalds, Arve Hj?nnev?g, Andrew Morton,
	Chris Snook, Paul Menage

On Tue, 27 Jan 2009, Evgeniy Polyakov wrote:

> > There is no additional oom killer limitation imposed here, nor can the oom 
> > killer kill a task hung in D state any better than userspace.
> 
> Well, oom-killer can, since it drops unkillable state from the process
> mask, that may be not enough though, but it tries more than userspace.
> 

The only thing it does is send a SIGKILL and gives the thread access to 
memory reserves with TIF_MEMDIE, it doesn't drop any unkillable state.  If 
its victim is hung in D state and the memory reserves do not allow it to 
return to being runnable, this task will not die and the oom killer would 
livelock unless given another target.

> My main point was to haev a way to monitor memory usage and that any
> process could tune own behaviour according to that information. Which is
> not realated to the system oom-killer at all. Thus /dev/mem_notify is
> interested first (and only the first) as a memory usage notification
> interface and not a way to invoke any kind of 'soft' oom-killer.

It's a way to prevent invoking the kernel oom killer by allowing userspace 
notification of events where methods such as droping caches, elevating 
limits, adding nodes, sending signals, etc, can prevent such a problem.  
When the system (or cgroup) is completely oom, it can also issue SIGKILLs 
that will free some memory and preempt the oom killer from acting.

I think there might be some confusion about my proposal for extending 
/dev/mem_notify.  Not only should it notify of certain low memory events, 
but it should also allow userspace notification of oom events, just like 
the cgroup oom notifier patch allowed.  Instead of attaching a task to a 
cgroup file in that case, however, this would simply be the responsibility 
of a task that has set up a poll() on the cgroup's mem_notify file.  A 
configurable delay could be imposed so page allocation attempts simply 
loop while the userspace handler responds and then only invoke the oom 
killer when absolutely necessary.

> Application can do whatever it wants of course including killing itself
> or the neighbours, but this should not be forced as a usage policy.
> 

If preference killing is your goal, then userspace can do it with the 
/dev/mem_notify functionality.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 20:37                   ` David Rientjes
@ 2009-01-27 21:51                     ` Evgeniy Polyakov
  0 siblings, 0 replies; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 21:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Alan Cox, balbir, Nikanth Karthikesan,
	containers, linux-kernel, Torvalds, Arve Hj?nnev?g, Andrew Morton,
	Chris Snook, Paul Menage

On Tue, Jan 27, 2009 at 12:37:21PM -0800, David Rientjes (rientjes@google.com) wrote:
> > Well, oom-killer can, since it drops unkillable state from the process
> > mask, that may be not enough though, but it tries more than userspace.
> > 
> 
> The only thing it does is send a SIGKILL and gives the thread access to 
> memory reserves with TIF_MEMDIE, it doesn't drop any unkillable state.  If 

There is a small difference between force_sig_info() and usual
send_sinal() used by kill.

> its victim is hung in D state and the memory reserves do not allow it to 
> return to being runnable, this task will not die and the oom killer would 
> livelock unless given another target.

D-states are different. In the current tree we even have
page_lock_killable(), so it depends.

> > My main point was to haev a way to monitor memory usage and that any
> > process could tune own behaviour according to that information. Which is
> > not realated to the system oom-killer at all. Thus /dev/mem_notify is
> > interested first (and only the first) as a memory usage notification
> > interface and not a way to invoke any kind of 'soft' oom-killer.
> 
> It's a way to prevent invoking the kernel oom killer by allowing userspace 
> notification of events where methods such as droping caches, elevating 
> limits, adding nodes, sending signals, etc, can prevent such a problem.  
> When the system (or cgroup) is completely oom, it can also issue SIGKILLs 
> that will free some memory and preempt the oom killer from acting.
> 
> I think there might be some confusion about my proposal for extending 
> /dev/mem_notify.  Not only should it notify of certain low memory events, 
> but it should also allow userspace notification of oom events, just like 
> the cgroup oom notifier patch allowed.  Instead of attaching a task to a 
> cgroup file in that case, however, this would simply be the responsibility 
> of a task that has set up a poll() on the cgroup's mem_notify file.  A 
> configurable delay could be imposed so page allocation attempts simply 
> loop while the userspace handler responds and then only invoke the oom 
> killer when absolutely necessary.

I have really no objections against this and extending oom-killer to
allow to wait a bit in the allocation path before userspace makes some
progress. But do not drop existing oom-killer (i.e. its ability to kill
processes) in favour of this new feature. Let's have both and if
extension failed for some reason, old oom-killer will do the things.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27  9:31             ` Evgeniy Polyakov
  2009-01-27  9:37               ` David Rientjes
@ 2009-01-27 10:40               ` KOSAKI Motohiro
  2009-01-27 13:45                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 63+ messages in thread
From: KOSAKI Motohiro @ 2009-01-27 10:40 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: kosaki.motohiro, David Rientjes, Alan Cox, balbir,
	Nikanth Karthikesan, containers, linux-kernel, Torvalds,
	Arve Hj?nnev?g, Andrew Morton, Chris Snook, Paul Menage

Hi Evgeniy,

> On Mon, Jan 26, 2009 at 11:51:27PM -0800, David Rientjes (rientjes@google.com) wrote:
> > Yeah, I proposed /dev/mem_notify being made as a client of cgroups there 
> > in http://marc.info/?l=linux-kernel&m=123200623628685
> > 
> > How do you replace the oom killer's capability of giving a killed task 
> > access to memory reserves with TIF_MEMDIE in userspace?
> 
> /dev/mem_notify is a great idea, but please do not limit existing
> oom-killer in its ability to do the job and do not rely on application's
> ability to send a SIGKILL which will not kill tasks in unkillable state
> contrary to oom-killer.

I'd like to respect your requiremnt. but I also would like to know
why you like deterministic hierarchy oom than notification.

I think one of problem is, current patch description is a bit poor
and don't describe from administrator view.

Could you please sort the discssion out and explain your requirement detail?
otherwise (I guess) this discussion don't reach people agreement.

I don't like the implementation idea vs another idea discussion.
it often don't make productive discussion.
I'd like to sort out people requrement.
otherwise I can't review the patch fill requirement or not.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 10:40               ` KOSAKI Motohiro
@ 2009-01-27 13:45                 ` Evgeniy Polyakov
  2009-01-27 15:40                   ` Balbir Singh
  2009-01-27 20:41                   ` David Rientjes
  0 siblings, 2 replies; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 13:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Alan Cox, balbir, Nikanth Karthikesan, containers,
	linux-kernel, Torvalds, Arve Hj?nnev?g, Andrew Morton,
	Chris Snook, Paul Menage

Hi.

On Tue, Jan 27, 2009 at 07:40:58PM +0900, KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> I'd like to respect your requiremnt. but I also would like to know
> why you like deterministic hierarchy oom than notification.
> 
> I think one of problem is, current patch description is a bit poor
> and don't describe from administrator view.

Notification of the memory state is by no means a great idea.
Any process which cares about the system state can register and make
some decisions based on the memory state. But if it fails to update to
the current situation, the main oom-killer has to enter the scene and
make a progress on the system behaviour.

As I wrote multiple times there may be a quite trivial situation, when
process will not be able to make progress (it will not be able to free
some data even if its memory notification callback is invoked in some
cases), so we just can not rely on that. After all there may be no
processes with given notifications registered, so we should be able to
tune main oom-killer, which is another story compared to the
/dev/mem_notify discussion.

Having some special application which will monitor /dev/mem_notify and
kill processes based on its own hueristics is a good idea, but when it
fails to do its work (or does not exist) system has to have ability to
make a progress and invoke a main oom-killer.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 13:45                 ` Evgeniy Polyakov
@ 2009-01-27 15:40                   ` Balbir Singh
  2009-01-27 21:54                     ` Evgeniy Polyakov
  2009-01-27 20:41                   ` David Rientjes
  1 sibling, 1 reply; 63+ messages in thread
From: Balbir Singh @ 2009-01-27 15:40 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: KOSAKI Motohiro, Nikanth Karthikesan, containers, linux-kernel,
	Arve Hj?nnev?g, David Rientjes, Andrew Morton, Chris Snook,
	Torvalds, Paul Menage, Alan Cox

* Evgeniy Polyakov <zbr@ioremap.net> [2009-01-27 16:45:59]:

> Hi.
> 
> On Tue, Jan 27, 2009 at 07:40:58PM +0900, KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> > I'd like to respect your requiremnt. but I also would like to know
> > why you like deterministic hierarchy oom than notification.
> > 
> > I think one of problem is, current patch description is a bit poor
> > and don't describe from administrator view.
> 
> Notification of the memory state is by no means a great idea.
> Any process which cares about the system state can register and make
> some decisions based on the memory state. But if it fails to update to
> the current situation, the main oom-killer has to enter the scene and
> make a progress on the system behaviour.
> 
> As I wrote multiple times there may be a quite trivial situation, when
> process will not be able to make progress (it will not be able to free
> some data even if its memory notification callback is invoked in some
> cases), so we just can not rely on that. After all there may be no
> processes with given notifications registered, so we should be able to
> tune main oom-killer, which is another story compared to the
> /dev/mem_notify discussion.
> 
> Having some special application which will monitor /dev/mem_notify and
> kill processes based on its own hueristics is a good idea, but when it
> fails to do its work (or does not exist) system has to have ability to
> make a progress and invoke a main oom-killer.
>

The last part is what we've discussed in the mini-summit. There should
be OOM kill notification to user space and if that fails let the kernel
invoke the OOM killer. The exact interface for notification is not
interesting, one could use netlink if that works well.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 15:40                   ` Balbir Singh
@ 2009-01-27 21:54                     ` Evgeniy Polyakov
  0 siblings, 0 replies; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 21:54 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KOSAKI Motohiro, Nikanth Karthikesan, containers, linux-kernel,
	Arve Hj?nnev?g, David Rientjes, Andrew Morton, Chris Snook,
	Torvalds, Paul Menage, Alan Cox

On Tue, Jan 27, 2009 at 09:10:53PM +0530, Balbir Singh (balbir@linux.vnet.ibm.com) wrote:
> > Having some special application which will monitor /dev/mem_notify and
> > kill processes based on its own hueristics is a good idea, but when it
> > fails to do its work (or does not exist) system has to have ability to
> > make a progress and invoke a main oom-killer.
> 
> The last part is what we've discussed in the mini-summit. There should
> be OOM kill notification to user space and if that fails let the kernel
> invoke the OOM killer. The exact interface for notification is not
> interesting, one could use netlink if that works well.

Yes, that's exactly what I would like to see.
Btw, netlink will not be a good idea, since it requires additional (and
quite big) allocation. I believe just reading the char device (or maybe
having a syscall) is enough, but its up to the implementation.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 13:45                 ` Evgeniy Polyakov
  2009-01-27 15:40                   ` Balbir Singh
@ 2009-01-27 20:41                   ` David Rientjes
  2009-01-27 21:55                     ` Evgeniy Polyakov
  1 sibling, 1 reply; 63+ messages in thread
From: David Rientjes @ 2009-01-27 20:41 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: KOSAKI Motohiro, Alan Cox, balbir, Nikanth Karthikesan,
	containers, linux-kernel, Torvalds, Arve Hj?nnev?g, Andrew Morton,
	Chris Snook, Paul Menage

On Tue, 27 Jan 2009, Evgeniy Polyakov wrote:

> Having some special application which will monitor /dev/mem_notify and
> kill processes based on its own hueristics is a good idea, but when it
> fails to do its work (or does not exist) system has to have ability to
> make a progress and invoke a main oom-killer.
> 

Agreed, very similiar to the cgroup oom notifier patch that invokes the 
oom killer if there are no attached tasks waiting to handle the situation. 

In this case, it would be a configurable delay to allow userspace to act 
in response to oom conditions before invoking the kernel oom killer.  So 
instead of thinking of this as a userspace replacement for the oom killer, 
it simply preempts it if userspace can provide more memory, including the 
possibility of killing tasks itself.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC] [PATCH] Cgroup based OOM killer controller
  2009-01-27 20:41                   ` David Rientjes
@ 2009-01-27 21:55                     ` Evgeniy Polyakov
  0 siblings, 0 replies; 63+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 21:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Alan Cox, balbir, Nikanth Karthikesan,
	containers, linux-kernel, Torvalds, Arve Hj?nnev?g, Andrew Morton,
	Chris Snook, Paul Menage

On Tue, Jan 27, 2009 at 12:41:03PM -0800, David Rientjes (rientjes@google.com) wrote:
> > Having some special application which will monitor /dev/mem_notify and
> > kill processes based on its own hueristics is a good idea, but when it
> > fails to do its work (or does not exist) system has to have ability to
> > make a progress and invoke a main oom-killer.
> 
> Agreed, very similiar to the cgroup oom notifier patch that invokes the 
> oom killer if there are no attached tasks waiting to handle the situation. 
> 
> In this case, it would be a configurable delay to allow userspace to act 
> in response to oom conditions before invoking the kernel oom killer.  So 
> instead of thinking of this as a userspace replacement for the oom killer, 
> it simply preempts it if userspace can provide more memory, including the 
> possibility of killing tasks itself.

How different may look idea expressed by the different phrases and cold
heads :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2009-01-29 15:51 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-21 11:08 [RFC] [PATCH] Cgroup based OOM killer controller Nikanth Karthikesan
2009-01-21 13:17 ` Evgeniy Polyakov
2009-01-21 15:24   ` Nikanth Karthikesan
2009-01-21 20:49     ` David Rientjes
2009-01-22  2:53       ` KAMEZAWA Hiroyuki
2009-01-22  5:12         ` Nikanth Karthikesan
2009-01-22  5:12       ` Nikanth Karthikesan
2009-01-22  8:43         ` David Rientjes
2009-01-22  9:23           ` Nikanth Karthikesan
2009-01-22  9:39             ` David Rientjes
2009-01-22 10:10               ` Nikanth Karthikesan
2009-01-22 10:18                 ` David Rientjes
2009-01-22  9:50           ` Evgeniy Polyakov
2009-01-22 10:00             ` David Rientjes
2009-01-22 10:14               ` Evgeniy Polyakov
2009-01-22 10:27                 ` David Rientjes
2009-01-22 13:21                   ` Evgeniy Polyakov
2009-01-22 20:28                     ` David Rientjes
2009-01-22 21:06                       ` Evgeniy Polyakov
2009-01-22 21:35                         ` David Rientjes
2009-01-22 22:04                           ` Evgeniy Polyakov
2009-01-22 22:28                             ` David Rientjes
2009-01-22 22:53                               ` Evgeniy Polyakov
2009-01-22 23:25                                 ` Evgeniy Polyakov
2009-01-27 23:55                     ` Paul Menage
2009-01-23  9:45                   ` Nikanth Karthikesan
2009-01-23 10:33                     ` David Rientjes
2009-01-23 14:56                       ` Nikanth Karthikesan
2009-01-23 20:44                         ` David Rientjes
2009-01-27 10:20                           ` Nikanth Karthikesan
2009-01-27 10:53                             ` David Rientjes
2009-01-27 11:08                               ` Nikanth Karthikesan
2009-01-27 11:21                                 ` David Rientjes
2009-01-27 11:37                                   ` Nikanth Karthikesan
2009-01-27 20:29                                     ` David Rientjes
2009-01-28  1:00                         ` Paul Menage
2009-01-29 15:48                           ` Nikanth Karthikesan
2009-01-22  3:28 ` KAMEZAWA Hiroyuki
2009-01-22  5:13   ` Nikanth Karthikesan
2009-01-22  5:27     ` KAMEZAWA Hiroyuki
2009-01-22  6:11       ` Nikanth Karthikesan
2009-01-22  5:39     ` Arve Hjønnevåg
2009-01-22  6:12       ` Nikanth Karthikesan
2009-01-22  6:29         ` Arve Hjønnevåg
2009-01-22  6:42           ` Nikanth Karthikesan
2009-01-26 19:54 ` Balbir Singh
2009-01-26 19:56   ` Alan Cox
2009-01-27  7:02     ` KOSAKI Motohiro
2009-01-27  7:26       ` Balbir Singh
2009-01-27  7:39       ` David Rientjes
2009-01-27  7:44         ` KOSAKI Motohiro
2009-01-27  7:51           ` David Rientjes
2009-01-27  9:31             ` Evgeniy Polyakov
2009-01-27  9:37               ` David Rientjes
2009-01-27 13:40                 ` Evgeniy Polyakov
2009-01-27 20:37                   ` David Rientjes
2009-01-27 21:51                     ` Evgeniy Polyakov
2009-01-27 10:40               ` KOSAKI Motohiro
2009-01-27 13:45                 ` Evgeniy Polyakov
2009-01-27 15:40                   ` Balbir Singh
2009-01-27 21:54                     ` Evgeniy Polyakov
2009-01-27 20:41                   ` David Rientjes
2009-01-27 21:55                     ` Evgeniy Polyakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox