[patch] memcg: add oom killer delay

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] memcg: add oom killer delay
@ 2010-12-22  7:27 David Rientjes
  2010-12-22  7:59 ` Andrew Morton
  2010-12-25 10:47 ` [patch] " Balbir Singh
  0 siblings, 2 replies; 22+ messages in thread
From: David Rientjes @ 2010-12-22  7:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah,
	linux-mm

Completely disabling the oom killer for a memcg is problematic if
userspace is unable to address the condition itself, usually because
userspace is unresponsive.  This scenario creates a memcg livelock:
tasks are continuously trying to allocate memory and nothing is getting
killed, so memory freeing is impossible since reclaim has failed, and
all work stalls with no remedy in sight.

This patch adds an oom killer delay so that a memcg may be configured to
wait at least a pre-defined number of milliseconds before calling the
oom killer.  If the oom condition persists for this number of
milliseconds, the oom killer will be called the next time the memory
controller attempts to charge a page (and memory.oom_control is set to
0).  This allows userspace to have a short period of time to respond to
the condition before timing out and deferring to the kernel to kill a
task.

Admins may set the oom killer timeout using the new interface:

	# echo 60000 > memory.oom_delay

This will defer oom killing to the kernel only after 60 seconds has
elapsed.  When setting memory.oom_delay, all pending timeouts are
restarted.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroups/memory.txt |   15 ++++++++++
 mm/memcontrol.c                  |   56 +++++++++++++++++++++++++++++++++----
 2 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -68,6 +68,7 @@ Brief summary of control files.
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
  memory.oom_control		 # set/show oom controls.
+ memory.oom_delay		 # set/show millisecs to wait before oom kill
 
 1. History
 
@@ -640,6 +641,20 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
+It is also possible to configure an oom killer timeout to prevent the
+possibility that the memcg will livelock looking for memory if userspace
+has disabled the oom killer with oom_control but cannot act to fix the
+condition itself (usually because userspace has become unresponsive).
+
+To set an oom killer timeout for a memcg, write the number of milliseconds
+to wait before killing a task to memory.oom_delay:
+
+	# echo 60000 > memory.oom_delay	# wait 60 seconds, then oom kill
+
+This timeout is reset the next time the memcg successfully charges memory
+to a task.
+
+
 11. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -233,12 +233,16 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
+	/* oom_delay has expired and still out of memory? */
+	bool oom_kill_delay_expired;
 	atomic_t	oom_lock;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
+	/* min number of ticks to wait before calling oom killer */
+	int		oom_kill_delay;
 
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
@@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
 
 static void memcg_oom_recover(struct mem_cgroup *mem)
 {
+	mem->oom_kill_delay_expired = false;
 	if (mem && atomic_read(&mem->oom_lock))
 		memcg_wakeup_oom(mem);
 }
@@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
 /*
  * try to call OOM killer. returns false if we should exit memory-reclaim loop.
  */
-bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 {
 	struct oom_wait_info owait;
-	bool locked, need_to_kill;
+	bool locked;
+	bool need_to_kill = true;
+	bool need_to_delay = false;
 
 	owait.mem = mem;
 	owait.wait.flags = 0;
 	owait.wait.func = memcg_oom_wake_function;
 	owait.wait.private = current;
 	INIT_LIST_HEAD(&owait.wait.task_list);
-	need_to_kill = true;
 	/* At first, try to OOM lock hierarchy under mem.*/
 	mutex_lock(&memcg_oom_mutex);
 	locked = mem_cgroup_oom_lock(mem);
@@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
 	if (!locked || mem->oom_kill_disable)
 		need_to_kill = false;
-	if (locked)
+	if (locked) {
 		mem_cgroup_oom_notify(mem);
+		if (mem->oom_kill_delay && !mem->oom_kill_delay_expired) {
+			need_to_kill = false;
+			need_to_delay = true;
+		}
+	}
 	mutex_unlock(&memcg_oom_mutex);
 
 	if (need_to_kill) {
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(mem, mask);
 	} else {
-		schedule();
+		schedule_timeout(need_to_delay ? mem->oom_kill_delay :
+						 MAX_SCHEDULE_TIMEOUT);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
 	mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
+	mem->oom_kill_delay_expired = need_to_delay;
 	mutex_unlock(&memcg_oom_mutex);
 
 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
 		return false;
 	/* Give chance to dying process */
-	schedule_timeout(1);
+	if (!need_to_delay)
+		schedule_timeout(1);
 	return true;
 }
 
@@ -2007,6 +2021,7 @@ again:
 		refill_stock(mem, csize - PAGE_SIZE);
 	css_put(&mem->css);
 done:
+	mem->oom_kill_delay_expired = false;
 	*memcg = mem;
 	return 0;
 nomem:
@@ -4053,6 +4068,29 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_oom_delay_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return jiffies_to_msecs(memcg->oom_kill_delay);
+}
+
+static int mem_cgroup_oom_delay_write(struct cgroup *cgrp, struct cftype *cft,
+				      u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	/* Sanity check -- don't wait longer than an hour */
+	if (val > (60 * 60 * 1000))
+		return -EINVAL;
+
+	cgroup_lock();
+	memcg->oom_kill_delay = msecs_to_jiffies(val);
+	memcg_oom_recover(memcg);
+	cgroup_unlock();
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4116,6 +4154,11 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "oom_delay",
+		.read_u64 = mem_cgroup_oom_delay_read,
+		.write_u64 = mem_cgroup_oom_delay_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4357,6 +4400,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 		mem->oom_kill_disable = parent->oom_kill_disable;
+		mem->oom_kill_delay = parent->oom_kill_delay;
 	}
 
 	if (parent && parent->use_hierarchy) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  7:27 [patch] memcg: add oom killer delay David Rientjes
@ 2010-12-22  7:59 ` Andrew Morton
  2010-12-22  8:17   ` KAMEZAWA Hiroyuki
  2010-12-22  8:42   ` David Rientjes
  2010-12-25 10:47 ` [patch] " Balbir Singh
  1 sibling, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2010-12-22  7:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah,
	linux-mm

On Tue, 21 Dec 2010 23:27:25 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because
> userspace is unresponsive.  This scenario creates a memcg livelock:
> tasks are continuously trying to allocate memory and nothing is getting
> killed, so memory freeing is impossible since reclaim has failed, and
> all work stalls with no remedy in sight.

Userspace was buggy, surely.  If userspace has elected to disable the
oom-killer then it should ensure that it can cope with the ensuing result.

One approach might be to run a mlockall()ed watchdog which monitors the
worker tasks via shared memory.  Another approach would be to run that
watchdog in a different memcg, without mlockall().  There are surely
plenty of other ways of doing it.

> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the
> oom killer.  If the oom condition persists for this number of
> milliseconds, the oom killer will be called the next time the memory
> controller attempts to charge a page (and memory.oom_control is set to
> 0).  This allows userspace to have a short period of time to respond to
> the condition before timing out and deferring to the kernel to kill a
> task.
> 
> Admins may set the oom killer timeout using the new interface:
> 
> 	# echo 60000 > memory.oom_delay
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed.  When setting memory.oom_delay, all pending timeouts are
> restarted.
> 

eww, ick ick ick.


Minutea:

- changelog and docs forgot to mention that oom_delay=0 disables.

- it's called oom_kill_delay in the kernel and oom_delay in userspace.

- oom_delay_millisecs would be a better name for the pseudo file.

- Also, ick.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  7:59 ` Andrew Morton
@ 2010-12-22  8:17   ` KAMEZAWA Hiroyuki
  2010-12-22  8:31     ` KOSAKI Motohiro
  2010-12-22  8:48     ` David Rientjes
  2010-12-22  8:42   ` David Rientjes
  1 sibling, 2 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-22  8:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Tue, 21 Dec 2010 23:59:24 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 21 Dec 2010 23:27:25 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> 
> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because
> > userspace is unresponsive.  This scenario creates a memcg livelock:
> > tasks are continuously trying to allocate memory and nothing is getting
> > killed, so memory freeing is impossible since reclaim has failed, and
> > all work stalls with no remedy in sight.
> 
> Userspace was buggy, surely.  If userspace has elected to disable the
> oom-killer then it should ensure that it can cope with the ensuing result.
> 
> One approach might be to run a mlockall()ed watchdog which monitors the
> worker tasks via shared memory.  Another approach would be to run that
> watchdog in a different memcg, without mlockall().  There are surely
> plenty of other ways of doing it.
> 
> > This patch adds an oom killer delay so that a memcg may be configured to
> > wait at least a pre-defined number of milliseconds before calling the
> > oom killer.  If the oom condition persists for this number of
> > milliseconds, the oom killer will be called the next time the memory
> > controller attempts to charge a page (and memory.oom_control is set to
> > 0).  This allows userspace to have a short period of time to respond to
> > the condition before timing out and deferring to the kernel to kill a
> > task.
> > 
> > Admins may set the oom killer timeout using the new interface:
> > 
> > 	# echo 60000 > memory.oom_delay
> > 
> > This will defer oom killing to the kernel only after 60 seconds has
> > elapsed.  When setting memory.oom_delay, all pending timeouts are
> > restarted.
> > 
> 
> eww, ick ick ick.
> 
> 
> Minutea:
> 
> - changelog and docs forgot to mention that oom_delay=0 disables.
> 
> - it's called oom_kill_delay in the kernel and oom_delay in userspace.
> 
> - oom_delay_millisecs would be a better name for the pseudo file.
> 
> - Also, ick.
> 

seems to be hard to use. No one can estimate "milisecond" for avoidling
OOM-kill. I think this is very bad. Nack to this feature itself.


If you want something smart _in kernel_, please implement followings.

 - When hit oom, enlarge limit to some extent.
 - All processes in cgroup should be stopped.
 - A helper application will be called by usermode_helper().
 - When a helper application exit(), automatically release all processes
   to run again.

Then, you can avoid oom-kill situation in automatic with kernel's help.

BTW, don't call cgroup_lock(). It's always dangerous. You can add your own
lock.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:17   ` KAMEZAWA Hiroyuki
@ 2010-12-22  8:31     ` KOSAKI Motohiro
  2010-12-22  8:48     ` David Rientjes
  1 sibling, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2010-12-22  8:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Andrew Morton, David Rientjes, Balbir Singh,
	Daisuke Nishimura, Divyesh Shah, linux-mm

> seems to be hard to use. No one can estimate "milisecond" for avoidling
> OOM-kill. I think this is very bad. Nack to this feature itself.
> 
> 
> If you want something smart _in kernel_, please implement followings.
> 
>  - When hit oom, enlarge limit to some extent.
>  - All processes in cgroup should be stopped.
>  - A helper application will be called by usermode_helper().
>  - When a helper application exit(), automatically release all processes
>    to run again.
> 
> Then, you can avoid oom-kill situation in automatic with kernel's help.

I bet a monitor use diffent memcg is simplest thing.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:17   ` KAMEZAWA Hiroyuki
  2010-12-22  8:31     ` KOSAKI Motohiro
@ 2010-12-22  8:48     ` David Rientjes
  2010-12-22  8:48       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 22+ messages in thread
From: David Rientjes @ 2010-12-22  8:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:

> seems to be hard to use. No one can estimate "milisecond" for avoidling
> OOM-kill. I think this is very bad. Nack to this feature itself.
> 

There's no estimation that is really needed, we simply need to be able to 
stall long enough that we'll eventually kill "something" if userspace 
fails to act.

> If you want something smart _in kernel_, please implement followings.
> 
>  - When hit oom, enlarge limit to some extent.
>  - All processes in cgroup should be stopped.
>  - A helper application will be called by usermode_helper().
>  - When a helper application exit(), automatically release all processes
>    to run again.
> 

Hmm, that's a _lot_ of policy to be implemented in the kernel itself and 
comes at the cost of either being faulty (if the limit cannot be 
increased) or harmful (when increasing the limit is detrimental to other 
memcg).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:48     ` David Rientjes
@ 2010-12-22  8:48       ` KAMEZAWA Hiroyuki
  2010-12-22  8:55         ` KAMEZAWA Hiroyuki
  2010-12-22  9:04         ` David Rientjes
  0 siblings, 2 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-22  8:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Wed, 22 Dec 2010 00:48:53 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:
> 
> > seems to be hard to use. No one can estimate "milisecond" for avoidling
> > OOM-kill. I think this is very bad. Nack to this feature itself.
> > 
> 
> There's no estimation that is really needed, we simply need to be able to 
> stall long enough that we'll eventually kill "something" if userspace 
> fails to act.
> 

Why we have to think of usermode failure by mis configuration or user mode bug ?
It's a work of Middleware in usual.
Please make libcgroup or libvirt more useful.

> > If you want something smart _in kernel_, please implement followings.
> > 
> >  - When hit oom, enlarge limit to some extent.
> >  - All processes in cgroup should be stopped.
> >  - A helper application will be called by usermode_helper().
> >  - When a helper application exit(), automatically release all processes
> >    to run again.
> > 
> 
> Hmm, that's a _lot_ of policy to be implemented in the kernel itself and 
> comes at the cost of either being faulty (if the limit cannot be 
> increased) or harmful (when increasing the limit is detrimental to other 
> memcg).
> 

Or runnking a helper function in "root" cgroup which has no limit at all.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:48       ` KAMEZAWA Hiroyuki
@ 2010-12-22  8:55         ` KAMEZAWA Hiroyuki
  2010-12-22  9:21           ` David Rientjes
  2010-12-22  9:04         ` David Rientjes
  1 sibling, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-22  8:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Balbir Singh, Daisuke Nishimura,
	Divyesh Shah, linux-mm

On Wed, 22 Dec 2010 17:48:29 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 22 Dec 2010 00:48:53 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > > seems to be hard to use. No one can estimate "milisecond" for avoidling
> > > OOM-kill. I think this is very bad. Nack to this feature itself.
> > > 
> > 
> > There's no estimation that is really needed, we simply need to be able to 
> > stall long enough that we'll eventually kill "something" if userspace 
> > fails to act.
> > 
> 
> Why we have to think of usermode failure by mis configuration or user mode bug ?
> It's a work of Middleware in usual.

For example. oom_check_deadlockd can work as

  1. disable oom by memory.oom_disable=1
  2. check memory.oom_notify and wait it by poll()
  3. At oom, it wakes up.
  4. wait for 60 secs.
  5. If the cgroup is still in OOM, set oom_disalble=0

This daemon will not use much memory and can run in /roog memory cgroup.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:55         ` KAMEZAWA Hiroyuki
@ 2010-12-22  9:21           ` David Rientjes
  2010-12-27  1:47             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2010-12-22  9:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:

> For example. oom_check_deadlockd can work as
> 
>   1. disable oom by memory.oom_disable=1
>   2. check memory.oom_notify and wait it by poll()
>   3. At oom, it wakes up.
>   4. wait for 60 secs.
>   5. If the cgroup is still in OOM, set oom_disalble=0
> 
> This daemon will not use much memory and can run in /roog memory cgroup.
> 

Yes, this is almost the same as the "simple and perfect implementation" 
that I eluded to in my response to Andrew (and I think KOSAKI-san 
suggested something similiar), although it doesn't quite work because all 
threads in the cgroup are sitting on the waitqueue and don't get woken up 
to see oom_control == 0 unless memory is freed, a task is moved, or the 
limit is resized so this daemon will need to trigger that as step #6.

That certainly works if it is indeed perfect and guaranteed to always be 
running.  In the interest of a robust resource isolation model, I don't 
think we can ever make that conclusion, though, so this discussion is 
really only about how fault tolerant the kernel is because the end result 
is if this daemon fails, the kernel livelocks.

I'd personally prefer not to allow a buggy or imperfect userspace to allow 
the kernel to livelock; we control the kernel so I think it would be best 
to ensure that it cannot livelock no matter what userspace happens to do 
despite its best effort.  If you or Andrew come to the conclusion that 
it's overkill and at the end of the day we have to trust userspace, I 
really can't argue that philosophy though :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  9:21           ` David Rientjes
@ 2010-12-27  1:47             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-27  1:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Wed, 22 Dec 2010 01:21:01 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:
> 
> > For example. oom_check_deadlockd can work as
> > 
> >   1. disable oom by memory.oom_disable=1
> >   2. check memory.oom_notify and wait it by poll()
> >   3. At oom, it wakes up.
> >   4. wait for 60 secs.
> >   5. If the cgroup is still in OOM, set oom_disalble=0
> > 
> > This daemon will not use much memory and can run in /roog memory cgroup.
> > 
> 
> Yes, this is almost the same as the "simple and perfect implementation" 
> that I eluded to in my response to Andrew (and I think KOSAKI-san 
> suggested something similiar), although it doesn't quite work because all 
> threads in the cgroup are sitting on the waitqueue and don't get woken up 
> to see oom_control == 0 unless memory is freed, a task is moved, or the 
> limit is resized so this daemon will need to trigger that as step #6.
> 
> That certainly works if it is indeed perfect and guaranteed to always be 
> running.  In the interest of a robust resource isolation model, I don't 
> think we can ever make that conclusion, though, so this discussion is 
> really only about how fault tolerant the kernel is because the end result 
> is if this daemon fails, the kernel livelocks.
> 
> I'd personally prefer not to allow a buggy or imperfect userspace to allow 
> the kernel to livelock; we control the kernel so I think it would be best 
> to ensure that it cannot livelock no matter what userspace happens to do 
> despite its best effort.  If you or Andrew come to the conclusion that 
> it's overkill and at the end of the day we have to trust userspace, I 
> really can't argue that philosophy though :)
> 

It's not livelock. A user can create a new thread in a cgroup not under OOM.
IMHO, oom-kill itself is totally of-no-use and panic_at_oom or the system
stop is always good. We can do cluster keep alive.

If I was you, I'll add a "pool of memory for emergency cgroup" and run 
watchdog tasks in it.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:48       ` KAMEZAWA Hiroyuki
  2010-12-22  8:55         ` KAMEZAWA Hiroyuki
@ 2010-12-22  9:04         ` David Rientjes
  1 sibling, 0 replies; 22+ messages in thread
From: David Rientjes @ 2010-12-22  9:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:

> > > seems to be hard to use. No one can estimate "milisecond" for avoidling
> > > OOM-kill. I think this is very bad. Nack to this feature itself.
> > > 
> > 
> > There's no estimation that is really needed, we simply need to be able to 
> > stall long enough that we'll eventually kill "something" if userspace 
> > fails to act.
> > 
> 
> Why we have to think of usermode failure by mis configuration or user mode bug ?
> It's a work of Middleware in usual.
> Please make libcgroup or libvirt more useful.
> 

It's a general concern for users who wish to defer the kernel oom killer 
unless userspace chooses not to act or cannot act and the only way to do 
that without memory.oom_delay is to set all memcgs to have 
memory.oom_control of 1.  memory.oom_control of 1 is equivalent to 
OOM_DISABLE for all attached tasks and if all tasks are assigned non-root 
memcg for resource isolation (and the sum of those memcgs' limits equals 
system RAM), we always get memcg oom kills instead of system wide oom 
kills.  The difference in this case is that with the memcg oom kills, the 
kernel livelocks whereas the system wide oom kills would panic the machine 
since all eligible tasks are OOM_DISABLE, the equivalent of all memcgs 
having memory.oom_control of 1.

Since the kernel has opened this possibility up by disabling oom killing 
without giving userspace any other chance of deferring the oom killer, we 
need a way to preserve the machine by having a fallback plan if userspace 
cannot act.  The other possibility would be to panic if all memcgs have 
memory.oom_control of 1 and the sum of their limits equals the machine's 
memory capacity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  7:59 ` Andrew Morton
  2010-12-22  8:17   ` KAMEZAWA Hiroyuki
@ 2010-12-22  8:42   ` David Rientjes
  2010-12-22 22:45     ` [patch v2] " David Rientjes
  1 sibling, 1 reply; 22+ messages in thread
From: David Rientjes @ 2010-12-22  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah,
	linux-mm

On Tue, 21 Dec 2010, Andrew Morton wrote:

> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because
> > userspace is unresponsive.  This scenario creates a memcg livelock:
> > tasks are continuously trying to allocate memory and nothing is getting
> > killed, so memory freeing is impossible since reclaim has failed, and
> > all work stalls with no remedy in sight.
> 
> Userspace was buggy, surely.  If userspace has elected to disable the
> oom-killer then it should ensure that it can cope with the ensuing result.
> 

I think it would be argued that no such guarantee can ever be made.

> One approach might be to run a mlockall()ed watchdog which monitors the
> worker tasks via shared memory.  Another approach would be to run that
> watchdog in a different memcg, without mlockall().  There are surely
> plenty of other ways of doing it.
> 

Yeah, we considered a simple and perfect userspace implementation that 
would be as fault tolerant unless it ends up getting killed (not by the 
oom killer) or dies itself, but there was a concern that setting every 
memcg to have oom_control of 0 could render the entire kernel useless 
without the help of userspace and that is a bad policy.

In our particular use case, we _always_ want to defer using the kernel oom 
killer unless userspace chooses not to act (because the limit is already 
high enough) or cannot act (because of a bug).  The former is accomplished 
by setting memory.oom_control to 0 originally and then setting it to 1 for 
that particular memcg to allow the oom kill, but it is not possible for 
the latter.

> Minutea:
> 
> - changelog and docs forgot to mention that oom_delay=0 disables.
> 

I thought it would be intuitive that an oom_delay of 0 would mean there 
was no delay :)

> - it's called oom_kill_delay in the kernel and oom_delay in userspace.
> 

Right, this was because of the symmetry to the oom_kill_disable naming in 
the struct itself.  I'd be happy to change it if we're to go ahead in this 
direction.

> - oom_delay_millisecs would be a better name for the pseudo file.
> 

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch v2] memcg: add oom killer delay
  2010-12-22  8:42   ` David Rientjes
@ 2010-12-22 22:45     ` David Rientjes
  2010-12-27  0:52       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2010-12-22 22:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah,
	linux-mm

Completely disabling the oom killer for a memcg is problematic if
userspace is unable to address the condition itself, usually because
userspace is unresponsive.  This scenario creates a memcg livelock:
tasks are continuously trying to allocate memory and nothing is getting
killed, so memory freeing is impossible since reclaim has failed, and
all work stalls with no remedy in sight.

This patch adds an oom killer delay so that a memcg may be configured to
wait at least a pre-defined number of milliseconds before calling the
oom killer.  If the oom condition persists for this number of
milliseconds, the oom killer will be called the next time the memory
controller attempts to charge a page (and memory.oom_control is set to
0).  This allows userspace to have a short period of time to respond to
the condition before timing out and deferring to the kernel to kill a
task.

Admins may set the oom killer timeout using the new interface:

	# echo 60000 > memory.oom_delay

This will defer oom killing to the kernel only after 60 seconds has
elapsed.  When setting memory.oom_delay, all pending timeouts are
restarted.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2 of the patch to address your suggestions -- if we _really_ want to
 leave the kernel open to the possibility of livelock as the result of
 a userspace bug, then this doesn't need to be merged.  Otherwise, it
 would be nice to get this support for a more robust memory controller.

 Documentation/cgroups/memory.txt |   17 +++++++++++
 mm/memcontrol.c                  |   55 +++++++++++++++++++++++++++++++++----
 2 files changed, 66 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -68,6 +68,7 @@ Brief summary of control files.
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
  memory.oom_control		 # set/show oom controls.
+ memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
 
 1. History
 
@@ -640,6 +641,22 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
+It is also possible to configure an oom killer timeout to prevent the
+possibility that the memcg will livelock looking for memory if userspace
+has disabled the oom killer with oom_control but cannot act to fix the
+condition itself (usually because userspace has become unresponsive).
+
+To set an oom killer timeout for a memcg, write the number of milliseconds
+to wait before killing a task to memory.oom_delay_millisecs:
+
+	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
+
+This timeout is reset the next time the memcg successfully charges memory
+to a task.
+
+There is no delay if memory.oom_delay_millisecs is set to 0 (default).
+
+
 11. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -233,12 +233,16 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
+	/* oom_delay has expired and still out of memory? */
+	bool oom_delay_expired;
 	atomic_t	oom_lock;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
+	/* number of ticks to stall before calling oom killer */
+	int		oom_delay;
 
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
@@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
 
 static void memcg_oom_recover(struct mem_cgroup *mem)
 {
+	mem->oom_delay_expired = false;
 	if (mem && atomic_read(&mem->oom_lock))
 		memcg_wakeup_oom(mem);
 }
@@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
 /*
  * try to call OOM killer. returns false if we should exit memory-reclaim loop.
  */
-bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 {
 	struct oom_wait_info owait;
-	bool locked, need_to_kill;
+	bool locked;
+	bool need_to_kill = true;
+	bool need_to_delay = false;
 
 	owait.mem = mem;
 	owait.wait.flags = 0;
 	owait.wait.func = memcg_oom_wake_function;
 	owait.wait.private = current;
 	INIT_LIST_HEAD(&owait.wait.task_list);
-	need_to_kill = true;
 	/* At first, try to OOM lock hierarchy under mem.*/
 	mutex_lock(&memcg_oom_mutex);
 	locked = mem_cgroup_oom_lock(mem);
@@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
 	if (!locked || mem->oom_kill_disable)
 		need_to_kill = false;
-	if (locked)
+	if (locked) {
 		mem_cgroup_oom_notify(mem);
+		if (mem->oom_delay && !mem->oom_delay_expired) {
+			need_to_kill = false;
+			need_to_delay = true;
+		}
+	}
 	mutex_unlock(&memcg_oom_mutex);
 
 	if (need_to_kill) {
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(mem, mask);
 	} else {
-		schedule();
+		schedule_timeout(need_to_delay ? mem->oom_delay :
+						 MAX_SCHEDULE_TIMEOUT);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
 	mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
+	mem->oom_delay_expired = need_to_delay;
 	mutex_unlock(&memcg_oom_mutex);
 
 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
 		return false;
 	/* Give chance to dying process */
-	schedule_timeout(1);
+	if (!need_to_delay)
+		schedule_timeout(1);
 	return true;
 }
 
@@ -2007,6 +2021,7 @@ again:
 		refill_stock(mem, csize - PAGE_SIZE);
 	css_put(&mem->css);
 done:
+	mem->oom_delay_expired = false;
 	*memcg = mem;
 	return 0;
 nomem:
@@ -4053,6 +4068,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
+					struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return jiffies_to_msecs(memcg->oom_delay);
+}
+
+static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	/* Sanity check -- don't wait longer than an hour */
+	if (val > (60 * 60 * 1000))
+		return -EINVAL;
+
+	memcg->oom_delay = msecs_to_jiffies(val);
+	memcg_oom_recover(memcg);
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4116,6 +4153,11 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "oom_delay_millisecs",
+		.read_u64 = mem_cgroup_oom_delay_millisecs_read,
+		.write_u64 = mem_cgroup_oom_delay_millisecs_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4357,6 +4399,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 		mem->oom_kill_disable = parent->oom_kill_disable;
+		mem->oom_delay = parent->oom_delay;
 	}
 
 	if (parent && parent->use_hierarchy) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch v2] memcg: add oom killer delay
  2010-12-22 22:45     ` [patch v2] " David Rientjes
@ 2010-12-27  0:52       ` KAMEZAWA Hiroyuki
  2010-12-28  5:22         ` David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-27  0:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Wed, 22 Dec 2010 14:45:05 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because
> userspace is unresponsive.  This scenario creates a memcg livelock:
> tasks are continuously trying to allocate memory and nothing is getting
> killed, so memory freeing is impossible since reclaim has failed, and
> all work stalls with no remedy in sight.
> 
> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the
> oom killer.  If the oom condition persists for this number of
> milliseconds, the oom killer will be called the next time the memory
> controller attempts to charge a page (and memory.oom_control is set to
> 0).  This allows userspace to have a short period of time to respond to
> the condition before timing out and deferring to the kernel to kill a
> task.
> 
> Admins may set the oom killer timeout using the new interface:
> 
> 	# echo 60000 > memory.oom_delay
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed.  When setting memory.oom_delay, all pending timeouts are
> restarted.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

I dislike this feature but if someone other than goole want this, I'll ack.
some comments below.


> ---
>  v2 of the patch to address your suggestions -- if we _really_ want to
>  leave the kernel open to the possibility of livelock as the result of
>  a userspace bug, then this doesn't need to be merged.  Otherwise, it
>  would be nice to get this support for a more robust memory controller.
> 
>  Documentation/cgroups/memory.txt |   17 +++++++++++
>  mm/memcontrol.c                  |   55 +++++++++++++++++++++++++++++++++----
>  2 files changed, 66 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -68,6 +68,7 @@ Brief summary of control files.
>  				 (See sysctl's vm.swappiness)
>   memory.move_charge_at_immigrate # set/show controls of moving charges
>   memory.oom_control		 # set/show oom controls.
> + memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
>  
>  1. History
>  
> @@ -640,6 +641,22 @@ At reading, current status of OOM is shown.
>  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
>  				 be stopped.)
>  
> +It is also possible to configure an oom killer timeout to prevent the
> +possibility that the memcg will livelock looking for memory if userspace

It's not livelock. It's just 'stop'. No cpu consumption at all even if oom is
disabled.


> +has disabled the oom killer with oom_control but cannot act to fix the
> +condition itself (usually because userspace has become unresponsive).
> +
> +To set an oom killer timeout for a memcg, write the number of milliseconds
> +to wait before killing a task to memory.oom_delay_millisecs:
> +
> +	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
> +

I wonder whether this should be call as oom_delay you mention this feature as
'timeout' a few times before here. I like 'timeout' rather than 'delay'.
And from this ducument, They are unclear that
  1. what happens when it used with oom_disable.
  2. what kind of timer is this. Is it a one-shot timer ?
  3. how work with hierarchy ?

My suggestion for 1. is:
Please return -EBUSY or some if oom_disable=true and allow set timeout only when
oom_disable=false. Using both of two interface at the same time is too complex.


> +This timeout is reset the next time the memcg successfully charges memory
> +to a task.
> +
> +There is no delay if memory.oom_delay_millisecs is set to 0 (default).
> +
> +
>  11. TODO
>  
>  1. Add support for accounting huge pages (as a separate controller)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -233,12 +233,16 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> +	/* oom_delay has expired and still out of memory? */
> +	bool oom_delay_expired;
>  	atomic_t	oom_lock;
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
> +	/* number of ticks to stall before calling oom killer */
> +	int		oom_delay;
>  
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
>  
>  static void memcg_oom_recover(struct mem_cgroup *mem)
>  {
> +	mem->oom_delay_expired = false;
>  	if (mem && atomic_read(&mem->oom_lock))
>  		memcg_wakeup_oom(mem);
>  }
> @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
>  /*
>   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
>   */
> -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  {
>  	struct oom_wait_info owait;
> -	bool locked, need_to_kill;
> +	bool locked;
> +	bool need_to_kill = true;
> +	bool need_to_delay = false;
>  
>  	owait.mem = mem;
>  	owait.wait.flags = 0;
>  	owait.wait.func = memcg_oom_wake_function;
>  	owait.wait.private = current;
>  	INIT_LIST_HEAD(&owait.wait.task_list);
> -	need_to_kill = true;
>  	/* At first, try to OOM lock hierarchy under mem.*/
>  	mutex_lock(&memcg_oom_mutex);
>  	locked = mem_cgroup_oom_lock(mem);
> @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
>  	if (!locked || mem->oom_kill_disable)
>  		need_to_kill = false;
> -	if (locked)
> +	if (locked) {
>  		mem_cgroup_oom_notify(mem);
> +		if (mem->oom_delay && !mem->oom_delay_expired) {
> +			need_to_kill = false;
> +			need_to_delay = true;
> +		}
> +	}

Hmm. When threads T1 and T2 enters this routine, it seems broken.

Case 1)
	T1                        T2
     lock_oom.
     locked=true                lock_oom.
     oom_notify()               locked = false.    
     wait for msecs.            wait until wakeup.
     ......
     unlock_oom.
     wakes up.
     wake up all threads.
     oom_delay_expired=true.    wakes up.
                                oom_delay_expired=false.
                               
     2nd call of oom.
     lock_oom.
     locked=true.
     oom_notify.
     wait for msecs.
  
Then, oom_notify is duplicated and no OOM happens.
memcg_wakeup_oom() wakes up all threads. So, I guess you should avoid to call
that. But hmm...I think there are other pitfalls.


Assume a hierachy as this.

  A
 / \
 B  C

A.memory.use_hierarchy= 1 and (A,B,C) are under hierarchical control.

At first, oom_disable is allowed to be set only against A. By setting
oom_disable to A, OOM in B and C are disabled, too. For that purpose,
mem_cgroup_oom_lock/unlock is provided.

With your patch, even if oom_delay is set to A, B and C will never delay.
Please fix.


>  	mutex_unlock(&memcg_oom_mutex);
>  
>  	if (need_to_kill) {
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  		mem_cgroup_out_of_memory(mem, mask);
>  	} else {
> -		schedule();
> +		schedule_timeout(need_to_delay ? mem->oom_delay :
> +						 MAX_SCHEDULE_TIMEOUT);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
>  	mem_cgroup_oom_unlock(mem);
>  	memcg_wakeup_oom(mem);
> +	mem->oom_delay_expired = need_to_delay;


If someone charges successfully, this oom_delay_expired will tunrs back to be false.
I think this is not good....race is complicated.

If I was you, I'll add a function like memcg_oom_recover() to update status. 
(You need somethink like that for supporting hierarchy. I think.)


>  	mutex_unlock(&memcg_oom_mutex);
>  
>  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
>  		return false;
>  	/* Give chance to dying process */
> -	schedule_timeout(1);
> +	if (!need_to_delay)
> +		schedule_timeout(1);

This is unnecessary. After mem_cgroup_oom_unlock(), memcg_wakeup_oom() is called
and all thread wakes up. I think all other threads than TIF_MEMDIE should
sleep for a while.

>  	return true;
>  }
>  
> @@ -2007,6 +2021,7 @@ again:
>  		refill_stock(mem, csize - PAGE_SIZE);
>  	css_put(&mem->css);
>  done:
> +	mem->oom_delay_expired = false;

Don't add this in 'fast path'. This is too bad.


Have a good new year.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch v2] memcg: add oom killer delay
  2010-12-27  0:52       ` KAMEZAWA Hiroyuki
@ 2010-12-28  5:22         ` David Rientjes
  2010-12-28  6:29           ` [patch v3] " David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2010-12-28  5:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Mon, 27 Dec 2010, KAMEZAWA Hiroyuki wrote:

> I dislike this feature but if someone other than goole want this, I'll ack.
> some comments below.
> 

I don't really understand that; if you think the future is useful for the 
described use-case with minimal impact to memcg, then I'd hope you would 
ack it without needing to build a coalition of companies to ask for the 
feature to be merged.

We're going to use it regardless of whether it's merged upstream, but I 
think there's other use cases where it could potentially be helpful: for 
example, to avoid a quick oom kill when there's a temporary spike in 
memory usage over time without modifying the memcg's limit.  It seems 
plausible that a user would want to defer oom killing only if a memcg is 
oom for a certain amount of time.  That's distinct and in addition to our 
use case, which is to not rely solely on userspace to solve these memcg 
deadlocks when it fails to respond appropriately and in a timely manner.

> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -68,6 +68,7 @@ Brief summary of control files.
> >  				 (See sysctl's vm.swappiness)
> >   memory.move_charge_at_immigrate # set/show controls of moving charges
> >   memory.oom_control		 # set/show oom controls.
> > + memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
> >  
> >  1. History
> >  
> > @@ -640,6 +641,22 @@ At reading, current status of OOM is shown.
> >  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
> >  				 be stopped.)
> >  
> > +It is also possible to configure an oom killer timeout to prevent the
> > +possibility that the memcg will livelock looking for memory if userspace
> 
> It's not livelock. It's just 'stop'. No cpu consumption at all even if oom is
> disabled.
> 

I did s/livelock/deadlock.

> > +has disabled the oom killer with oom_control but cannot act to fix the
> > +condition itself (usually because userspace has become unresponsive).
> > +
> > +To set an oom killer timeout for a memcg, write the number of milliseconds
> > +to wait before killing a task to memory.oom_delay_millisecs:
> > +
> > +	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
> > +
> 
> I wonder whether this should be call as oom_delay you mention this feature as
> 'timeout' a few times before here. I like 'timeout' rather than 'delay'.
> And from this ducument, They are unclear that
>   1. what happens when it used with oom_disable.
>   2. what kind of timer is this. Is it a one-shot timer ?
>   3. how work with hierarchy ?
> 

These are three very good questions regarding my (currently lack of 
adequate) documentation and I'll add the answers to v3.  I'd prefer to 
keep the name oom_delay_millisecs, however, since "delay" means oom 
killing will be deferred rather than "timeout," which means oom killing 
has taken too long and expired.

> My suggestion for 1. is:
> Please return -EBUSY or some if oom_disable=true and allow set timeout only when
> oom_disable=false. Using both of two interface at the same time is too complex.
> 

I added documentation that states that memory.oom_delay_millisecs is only 
effective if memory.oom_control == 0.  I think configuring both should be 
allowed and memory.oom_control == 1 takes precedence anytime it is set; if 
it is cleared later, then any existing memory.oom_delay_millisecs becomes 
effective.

> > +This timeout is reset the next time the memcg successfully charges memory
> > +to a task.
> > +
> > +There is no delay if memory.oom_delay_millisecs is set to 0 (default).
> > +
> > +
> >  11. TODO
> >  
> >  1. Add support for accounting huge pages (as a separate controller)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -233,12 +233,16 @@ struct mem_cgroup {
> >  	 * Should the accounting and control be hierarchical, per subtree?
> >  	 */
> >  	bool use_hierarchy;
> > +	/* oom_delay has expired and still out of memory? */
> > +	bool oom_delay_expired;
> >  	atomic_t	oom_lock;
> >  	atomic_t	refcnt;
> >  
> >  	unsigned int	swappiness;
> >  	/* OOM-Killer disable */
> >  	int		oom_kill_disable;
> > +	/* number of ticks to stall before calling oom killer */
> > +	int		oom_delay;
> >  
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> > @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
> >  
> >  static void memcg_oom_recover(struct mem_cgroup *mem)
> >  {
> > +	mem->oom_delay_expired = false;
> >  	if (mem && atomic_read(&mem->oom_lock))
> >  		memcg_wakeup_oom(mem);
> >  }
> > @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
> >  /*
> >   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> >   */
> > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> >  {
> >  	struct oom_wait_info owait;
> > -	bool locked, need_to_kill;
> > +	bool locked;
> > +	bool need_to_kill = true;
> > +	bool need_to_delay = false;
> >  
> >  	owait.mem = mem;
> >  	owait.wait.flags = 0;
> >  	owait.wait.func = memcg_oom_wake_function;
> >  	owait.wait.private = current;
> >  	INIT_LIST_HEAD(&owait.wait.task_list);
> > -	need_to_kill = true;
> >  	/* At first, try to OOM lock hierarchy under mem.*/
> >  	mutex_lock(&memcg_oom_mutex);
> >  	locked = mem_cgroup_oom_lock(mem);
> > @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> >  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> >  	if (!locked || mem->oom_kill_disable)
> >  		need_to_kill = false;
> > -	if (locked)
> > +	if (locked) {
> >  		mem_cgroup_oom_notify(mem);
> > +		if (mem->oom_delay && !mem->oom_delay_expired) {
> > +			need_to_kill = false;
> > +			need_to_delay = true;
> > +		}
> > +	}
> 
> Hmm. When threads T1 and T2 enters this routine, it seems broken.
> 
> Case 1)
> 	T1                        T2
>      lock_oom.
>      locked=true                lock_oom.
>      oom_notify()               locked = false.    
>      wait for msecs.            wait until wakeup.
>      ......
>      unlock_oom.
>      wakes up.
>      wake up all threads.
>      oom_delay_expired=true.    wakes up.
>                                 oom_delay_expired=false.
>                                
>      2nd call of oom.
>      lock_oom.
>      locked=true.
>      oom_notify.
>      wait for msecs.
>   
> Then, oom_notify is duplicated and no OOM happens.
> memcg_wakeup_oom() wakes up all threads. So, I guess you should avoid to call
> that. But hmm...I think there are other pitfalls.
> 
> 
> Assume a hierachy as this.
> 
>   A
>  / \
>  B  C
> 
> A.memory.use_hierarchy= 1 and (A,B,C) are under hierarchical control.
> 
> At first, oom_disable is allowed to be set only against A. By setting
> oom_disable to A, OOM in B and C are disabled, too. For that purpose,
> mem_cgroup_oom_lock/unlock is provided.
> 
> With your patch, even if oom_delay is set to A, B and C will never delay.
> Please fix.
> 

I think what might be clearer is if we made memory.oom_delay_millisecs to 
be a one-shot timer as you mentioned previously.  So once a memcg is oom, 
the task that grabs the oom_lock sleeps for the delay and then it is 
cleared.  Later threads that are oom and grab oom_lock will immediately 
call the oom killer.  Then, if userspace truly is dead, oom killing won't 
be delayed in the future and the kernel can take immediate action.  And if 
userspace is alive, it can reset memory.oom_delay_millisecs.

What do you think about that functionality instead?

> Have a good new year.
> 

Same to you, Kame!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [patch v3] memcg: add oom killer delay
  2010-12-28  5:22         ` David Rientjes
@ 2010-12-28  6:29           ` David Rientjes
  2011-01-04  1:41             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2010-12-28  6:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

Completely disabling the oom killer for a memcg is problematic if
userspace is unable to address the condition itself, usually because it
is unresponsive.  This scenario creates a memcg deadlock: tasks are
sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
exit or move, or the oom killer reenabled and userspace is unable to do
so.

An additional possible use case is to defer oom killing within a memcg
for a set period of time, probably to prevent unnecessary kills due to
temporary memory spikes, before allowing the kernel to handle the
condition.

This patch adds an oom killer delay so that a memcg may be configured to
wait at least a pre-defined number of milliseconds before calling the oom
killer.  If the oom condition persists for this number of milliseconds,
the oom killer will be called the next time the memory controller
attempts to charge a page (and memory.oom_control is set to 0).  This
allows userspace to have a short period of time to respond to the
condition before deferring to the kernel to kill a task.

Admins may set the oom killer delay using the new interface:

	# echo 60000 > memory.oom_delay_millisecs

This will defer oom killing to the kernel only after 60 seconds has
elapsed.  When setting memory.oom_delay, all pending delays have their
charge retried and, if necessary, the new delay is then effected.

The delay is cleared the first time the memcg is oom to avoid unnecessary
waiting when userspace is unresponsive for future oom conditions.  It may
be set again using the above interface to enforce a delay on the next
oom.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroups/memory.txt |   26 +++++++++++++++++++++
 mm/memcontrol.c                  |   46 ++++++++++++++++++++++++++++++++++---
 2 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7781857..e426733 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -68,6 +68,7 @@ Brief summary of control files.
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
  memory.oom_control		 # set/show oom controls.
+ memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
 
 1. History
 
@@ -640,6 +641,31 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
+It is also possible to configure an oom killer timeout to prevent the
+possibility that the memcg will deadlock looking for memory if userspace
+has disabled the oom killer with oom_control but cannot act to fix the
+condition itself (usually because userspace has become unresponsive).
+
+To set an oom killer timeout for a memcg, write the number of milliseconds
+to wait before killing a task to memory.oom_delay_millisecs:
+
+	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
+
+This timeout is reset the first time the memcg is oom to prevent needlessly
+waiting for the next oom when userspace is truly unresponsive.  It may be
+set again using the above interface to defer killing a task the next time
+the memcg is oom.
+
+Disabling the oom killer for a memcg with memory.oom_control takes
+precedence over memory.oom_delay_millisecs, so it must be set to 0
+(default) to allow the oom kill after the delay has expired.
+
+This value is inherited from the memcg's parent on creation.
+
+There is no delay if memory.oom_delay_millisecs is set to 0 (default).
+This tunable's upper bound is 60 minutes.
+
+
 11. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e6aadd6..951a22c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -229,6 +229,8 @@ struct mem_cgroup {
 	unsigned int	swappiness;
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
+	/* number of ticks to stall before calling oom killer */
+	int		oom_delay;
 
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
@@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
 /*
  * try to call OOM killer. returns false if we should exit memory-reclaim loop.
  */
-bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 {
 	struct oom_wait_info owait;
 	bool locked, need_to_kill;
+	long timeout = MAX_SCHEDULE_TIMEOUT;
 
 	owait.mem = mem;
 	owait.wait.flags = 0;
@@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
 	if (!locked || mem->oom_kill_disable)
 		need_to_kill = false;
-	if (locked)
+	if (locked) {
+		if (mem->oom_delay) {
+			need_to_kill = false;
+			timeout = mem->oom_delay;
+			mem->oom_delay = 0;
+		}
 		mem_cgroup_oom_notify(mem);
+	}
 	mutex_unlock(&memcg_oom_mutex);
 
 	if (need_to_kill) {
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(mem, mask);
 	} else {
-		schedule();
+		schedule_timeout(timeout);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
@@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
 		return false;
 	/* Give chance to dying process */
-	schedule_timeout(1);
+	if (timeout != MAX_SCHEDULE_TIMEOUT)
+		schedule_timeout(1);
 	return true;
 }
 
@@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
+					struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return jiffies_to_msecs(memcg->oom_delay);
+}
+
+static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	/* Sanity check -- don't wait longer than an hour */
+	if (val > (60 * 60 * 1000))
+		return -EINVAL;
+
+	memcg->oom_delay = msecs_to_jiffies(val);
+	memcg_oom_recover(memcg);
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3926,6 +3958,11 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "oom_delay_millisecs",
+		.read_u64 = mem_cgroup_oom_delay_millisecs_read,
+		.write_u64 = mem_cgroup_oom_delay_millisecs_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4164,6 +4201,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 		mem->oom_kill_disable = parent->oom_kill_disable;
+		mem->oom_delay = parent->oom_delay;
 	}
 
 	if (parent && parent->use_hierarchy) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [patch v3] memcg: add oom killer delay
  2010-12-28  6:29           ` [patch v3] " David Rientjes
@ 2011-01-04  1:41             ` KAMEZAWA Hiroyuki
  2011-01-04  3:59               ` Balbir Singh
  0 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-04  1:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Mon, 27 Dec 2010 22:29:05 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because it
> is unresponsive.  This scenario creates a memcg deadlock: tasks are
> sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> exit or move, or the oom killer reenabled and userspace is unable to do
> so.
> 
> An additional possible use case is to defer oom killing within a memcg
> for a set period of time, probably to prevent unnecessary kills due to
> temporary memory spikes, before allowing the kernel to handle the
> condition.
> 
> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the oom
> killer.  If the oom condition persists for this number of milliseconds,
> the oom killer will be called the next time the memory controller
> attempts to charge a page (and memory.oom_control is set to 0).  This
> allows userspace to have a short period of time to respond to the
> condition before deferring to the kernel to kill a task.
> 
> Admins may set the oom killer delay using the new interface:
> 
> 	# echo 60000 > memory.oom_delay_millisecs
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed.  When setting memory.oom_delay, all pending delays have their
> charge retried and, if necessary, the new delay is then effected.
> 
> The delay is cleared the first time the memcg is oom to avoid unnecessary
> waiting when userspace is unresponsive for future oom conditions.  It may
> be set again using the above interface to enforce a delay on the next
> oom.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Changelog please.


> ---
>  Documentation/cgroups/memory.txt |   26 +++++++++++++++++++++
>  mm/memcontrol.c                  |   46 ++++++++++++++++++++++++++++++++++---
>  2 files changed, 68 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 7781857..e426733 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -68,6 +68,7 @@ Brief summary of control files.
>  				 (See sysctl's vm.swappiness)
>   memory.move_charge_at_immigrate # set/show controls of moving charges
>   memory.oom_control		 # set/show oom controls.
> + memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
>  
>  1. History
>  
> @@ -640,6 +641,31 @@ At reading, current status of OOM is shown.
>  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
>  				 be stopped.)
>  
> +It is also possible to configure an oom killer timeout to prevent the
> +possibility that the memcg will deadlock looking for memory if userspace
> +has disabled the oom killer with oom_control but cannot act to fix the
> +condition itself (usually because userspace has become unresponsive).
> +
> +To set an oom killer timeout for a memcg, write the number of milliseconds
> +to wait before killing a task to memory.oom_delay_millisecs:
> +
> +	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
> +
> +This timeout is reset the first time the memcg is oom to prevent needlessly
> +waiting for the next oom when userspace is truly unresponsive.  It may be
> +set again using the above interface to defer killing a task the next time
> +the memcg is oom.
> +
> +Disabling the oom killer for a memcg with memory.oom_control takes
> +precedence over memory.oom_delay_millisecs, so it must be set to 0
> +(default) to allow the oom kill after the delay has expired.
> +
> +This value is inherited from the memcg's parent on creation.
> +
> +There is no delay if memory.oom_delay_millisecs is set to 0 (default).
> +This tunable's upper bound is 60 minutes.

Why upper-bounds is 60 minutes ? Do we have to have a limit ?
Hmm, I feel 60minutes is too short. I like 32 or 31 bit limit.


> +
> +
>  11. TODO
>  
>  1. Add support for accounting huge pages (as a separate controller)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e6aadd6..951a22c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -229,6 +229,8 @@ struct mem_cgroup {
>  	unsigned int	swappiness;
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
> +	/* number of ticks to stall before calling oom killer */
> +	int		oom_delay;
>  
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> @@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
>  /*
>   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
>   */
> -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  {
>  	struct oom_wait_info owait;
>  	bool locked, need_to_kill;
> +	long timeout = MAX_SCHEDULE_TIMEOUT;
>  
>  	owait.mem = mem;
>  	owait.wait.flags = 0;
> @@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
>  	if (!locked || mem->oom_kill_disable)
>  		need_to_kill = false;
> -	if (locked)
> +	if (locked) {
> +		if (mem->oom_delay) {
> +			need_to_kill = false;
> +			timeout = mem->oom_delay;
> +			mem->oom_delay = 0;
> +		}
>  		mem_cgroup_oom_notify(mem);
> +	}
>  	mutex_unlock(&memcg_oom_mutex);
>  
>  	if (need_to_kill) {
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  		mem_cgroup_out_of_memory(mem, mask);
>  	} else {
> -		schedule();
> +		schedule_timeout(timeout);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
> @@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
>  		return false;
>  	/* Give chance to dying process */
> -	schedule_timeout(1);
> +	if (timeout != MAX_SCHEDULE_TIMEOUT)

!= ?

This seems to change existing behavior.

> +		schedule_timeout(1);
>  	return true;
>  }
>  
> @@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>  	return 0;
>  }
>  
> +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
> +					struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return jiffies_to_msecs(memcg->oom_delay);
> +}
> +
> +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
> +					struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	/* Sanity check -- don't wait longer than an hour */
> +	if (val > (60 * 60 * 1000))
> +		return -EINVAL;
> +
> +	memcg->oom_delay = msecs_to_jiffies(val);
> +	memcg_oom_recover(memcg);
> +	return 0;
> +}
> +

Please allow this to the root of sub-hierarchy and no children....(*)
(please check how mem_cgroup_oom_lock/unlock() works under use_hierarchy=1)

>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3926,6 +3958,11 @@ static struct cftype mem_cgroup_files[] = {
>  		.unregister_event = mem_cgroup_oom_unregister_event,
>  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>  	},
> +	{
> +		.name = "oom_delay_millisecs",
> +		.read_u64 = mem_cgroup_oom_delay_millisecs_read,
> +		.write_u64 = mem_cgroup_oom_delay_millisecs_write,
> +	},
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -4164,6 +4201,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		mem->use_hierarchy = parent->use_hierarchy;
>  		mem->oom_kill_disable = parent->oom_kill_disable;
> +		mem->oom_delay = parent->oom_delay;

Becasue of (*), oom_kill_disable can be copied here.
If you want to inherit this, you should do (*) or update all hierarchy value.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch v3] memcg: add oom killer delay
  2011-01-04  1:41             ` KAMEZAWA Hiroyuki
@ 2011-01-04  3:59               ` Balbir Singh
  2011-01-06  1:53                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2011-01-04  3:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Daisuke Nishimura, Divyesh Shah,
	linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-04 10:41:30]:

> On Mon, 27 Dec 2010 22:29:05 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because it
> > is unresponsive.  This scenario creates a memcg deadlock: tasks are
> > sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> > exit or move, or the oom killer reenabled and userspace is unable to do
> > so.
> > 
> > An additional possible use case is to defer oom killing within a memcg
> > for a set period of time, probably to prevent unnecessary kills due to
> > temporary memory spikes, before allowing the kernel to handle the
> > condition.
> > 
> > This patch adds an oom killer delay so that a memcg may be configured to
> > wait at least a pre-defined number of milliseconds before calling the oom
> > killer.  If the oom condition persists for this number of milliseconds,
> > the oom killer will be called the next time the memory controller
> > attempts to charge a page (and memory.oom_control is set to 0).  This
> > allows userspace to have a short period of time to respond to the
> > condition before deferring to the kernel to kill a task.
> > 
> > Admins may set the oom killer delay using the new interface:
> > 
> > 	# echo 60000 > memory.oom_delay_millisecs
> > 
> > This will defer oom killing to the kernel only after 60 seconds has
> > elapsed.  When setting memory.oom_delay, all pending delays have their
> > charge retried and, if necessary, the new delay is then effected.
> > 
> > The delay is cleared the first time the memcg is oom to avoid unnecessary
> > waiting when userspace is unresponsive for future oom conditions.  It may
> > be set again using the above interface to enforce a delay on the next
> > oom.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> Changelog please.
> 
> 
> > ---
> >  Documentation/cgroups/memory.txt |   26 +++++++++++++++++++++
> >  mm/memcontrol.c                  |   46 ++++++++++++++++++++++++++++++++++---
> >  2 files changed, 68 insertions(+), 4 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index 7781857..e426733 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -68,6 +68,7 @@ Brief summary of control files.
> >  				 (See sysctl's vm.swappiness)
> >   memory.move_charge_at_immigrate # set/show controls of moving charges
> >   memory.oom_control		 # set/show oom controls.
> > + memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
> >  
> >  1. History
> >  
> > @@ -640,6 +641,31 @@ At reading, current status of OOM is shown.
> >  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
> >  				 be stopped.)
> >  
> > +It is also possible to configure an oom killer timeout to prevent the
> > +possibility that the memcg will deadlock looking for memory if userspace
> > +has disabled the oom killer with oom_control but cannot act to fix the
> > +condition itself (usually because userspace has become unresponsive).
> > +
> > +To set an oom killer timeout for a memcg, write the number of milliseconds
> > +to wait before killing a task to memory.oom_delay_millisecs:
> > +
> > +	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
> > +
> > +This timeout is reset the first time the memcg is oom to prevent needlessly
> > +waiting for the next oom when userspace is truly unresponsive.  It may be
> > +set again using the above interface to defer killing a task the next time
> > +the memcg is oom.
> > +
> > +Disabling the oom killer for a memcg with memory.oom_control takes
> > +precedence over memory.oom_delay_millisecs, so it must be set to 0
> > +(default) to allow the oom kill after the delay has expired.
> > +
> > +This value is inherited from the memcg's parent on creation.
> > +
> > +There is no delay if memory.oom_delay_millisecs is set to 0 (default).
> > +This tunable's upper bound is 60 minutes.
> 
> Why upper-bounds is 60 minutes ? Do we have to have a limit ?
> Hmm, I feel 60minutes is too short. I like 32 or 31 bit limit.
>

I agree
 
> 
> > +
> > +
> >  11. TODO
> >  
> >  1. Add support for accounting huge pages (as a separate controller)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e6aadd6..951a22c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -229,6 +229,8 @@ struct mem_cgroup {
> >  	unsigned int	swappiness;
> >  	/* OOM-Killer disable */
> >  	int		oom_kill_disable;
> > +	/* number of ticks to stall before calling oom killer */
> > +	int		oom_delay;
> >  
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> > @@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
> >  /*
> >   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> >   */
> > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> >  {
> >  	struct oom_wait_info owait;
> >  	bool locked, need_to_kill;
> > +	long timeout = MAX_SCHEDULE_TIMEOUT;
> >  
> >  	owait.mem = mem;
> >  	owait.wait.flags = 0;
> > @@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> >  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> >  	if (!locked || mem->oom_kill_disable)
> >  		need_to_kill = false;
> > -	if (locked)
> > +	if (locked) {
> > +		if (mem->oom_delay) {
> > +			need_to_kill = false;
> > +			timeout = mem->oom_delay;
> > +			mem->oom_delay = 0;
> > +		}
> >  		mem_cgroup_oom_notify(mem);
> > +	}
> >  	mutex_unlock(&memcg_oom_mutex);
> >  
> >  	if (need_to_kill) {
> >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> >  		mem_cgroup_out_of_memory(mem, mask);
> >  	} else {
> > -		schedule();
> > +		schedule_timeout(timeout);
> >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> >  	}
> >  	mutex_lock(&memcg_oom_mutex);
> > @@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> >  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> >  		return false;
> >  	/* Give chance to dying process */
> > -	schedule_timeout(1);
> > +	if (timeout != MAX_SCHEDULE_TIMEOUT)
> 
> != ?
> 
> This seems to change existing behavior.
>

Ideally it should be "==", if oom_delay was never set, we want to
schedule_timeout(1). BTW, the sched* makes me wonder by how much we
increase the ctxsw rate, but I guess in the OOM path, we should not
bother much. I'll do some testing around this.
 
> > +		schedule_timeout(1);
> >  	return true;
> >  }
> >  
> > @@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
> >  	return 0;
> >  }
> >  
> > +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
> > +					struct cftype *cft)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +
> > +	return jiffies_to_msecs(memcg->oom_delay);
> > +}
> > +
> > +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
> > +					struct cftype *cft, u64 val)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +
> > +	/* Sanity check -- don't wait longer than an hour */
> > +	if (val > (60 * 60 * 1000))
> > +		return -EINVAL;
> > +
> > +	memcg->oom_delay = msecs_to_jiffies(val);
> > +	memcg_oom_recover(memcg);
> > +	return 0;
> > +}
> > +
> 
> Please allow this to the root of sub-hierarchy and no children....(*)
> (please check how mem_cgroup_oom_lock/unlock() works under use_hierarchy=1)
> 

Kamezawa-San, not sure if your comment is clear, are you suggesting

Since memcg is the root of a hierarchy, we need to use hierarchical
locking before changing the value of the root oom_delay?

> >  static struct cftype mem_cgroup_files[] = {
> >  	{
> >  		.name = "usage_in_bytes",
> > @@ -3926,6 +3958,11 @@ static struct cftype mem_cgroup_files[] = {
> >  		.unregister_event = mem_cgroup_oom_unregister_event,
> >  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
> >  	},
> > +	{
> > +		.name = "oom_delay_millisecs",
> > +		.read_u64 = mem_cgroup_oom_delay_millisecs_read,
> > +		.write_u64 = mem_cgroup_oom_delay_millisecs_write,
> > +	},
> >  };
> >  
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > @@ -4164,6 +4201,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  		parent = mem_cgroup_from_cont(cont->parent);
> >  		mem->use_hierarchy = parent->use_hierarchy;
> >  		mem->oom_kill_disable = parent->oom_kill_disable;
> > +		mem->oom_delay = parent->oom_delay;
> 
> Becasue of (*), oom_kill_disable can be copied here.
> If you want to inherit this, you should do (*) or update all hierarchy value.
>

Not sure I understand this either. I would ideally like to see these
copied if use_hierarchy is set.
 
> 
> Thanks,
> -Kame
> 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch v3] memcg: add oom killer delay
  2011-01-04  3:59               ` Balbir Singh
@ 2011-01-06  1:53                 ` KAMEZAWA Hiroyuki
  2011-01-06  5:46                   ` Balbir Singh
  0 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-06  1:53 UTC (permalink / raw)
  To: balbir
  Cc: David Rientjes, Andrew Morton, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Tue, 4 Jan 2011 09:29:57 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-04 10:41:30]:
> 
> > On Mon, 27 Dec 2010 22:29:05 -0800 (PST)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > Completely disabling the oom killer for a memcg is problematic if
> > > userspace is unable to address the condition itself, usually because it
> > > is unresponsive.  This scenario creates a memcg deadlock: tasks are
> > > sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> > > exit or move, or the oom killer reenabled and userspace is unable to do
> > > so.
> > > 
> > > An additional possible use case is to defer oom killing within a memcg
> > > for a set period of time, probably to prevent unnecessary kills due to
> > > temporary memory spikes, before allowing the kernel to handle the
> > > condition.
> > > 
> > > This patch adds an oom killer delay so that a memcg may be configured to
> > > wait at least a pre-defined number of milliseconds before calling the oom
> > > killer.  If the oom condition persists for this number of milliseconds,
> > > the oom killer will be called the next time the memory controller
> > > attempts to charge a page (and memory.oom_control is set to 0).  This
> > > allows userspace to have a short period of time to respond to the
> > > condition before deferring to the kernel to kill a task.
> > > 
> > > Admins may set the oom killer delay using the new interface:
> > > 
> > > 	# echo 60000 > memory.oom_delay_millisecs
> > > 
> > > This will defer oom killing to the kernel only after 60 seconds has
> > > elapsed.  When setting memory.oom_delay, all pending delays have their
> > > charge retried and, if necessary, the new delay is then effected.
> > > 
> > > The delay is cleared the first time the memcg is oom to avoid unnecessary
> > > waiting when userspace is unresponsive for future oom conditions.  It may
> > > be set again using the above interface to enforce a delay on the next
> > > oom.
> > > 
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > 
> > Changelog please.
> > 
> > 
> > > ---
> > >  Documentation/cgroups/memory.txt |   26 +++++++++++++++++++++
> > >  mm/memcontrol.c                  |   46 ++++++++++++++++++++++++++++++++++---
> > >  2 files changed, 68 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > > index 7781857..e426733 100644
> > > --- a/Documentation/cgroups/memory.txt
> > > +++ b/Documentation/cgroups/memory.txt
> > > @@ -68,6 +68,7 @@ Brief summary of control files.
> > >  				 (See sysctl's vm.swappiness)
> > >   memory.move_charge_at_immigrate # set/show controls of moving charges
> > >   memory.oom_control		 # set/show oom controls.
> > > + memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
> > >  
> > >  1. History
> > >  
> > > @@ -640,6 +641,31 @@ At reading, current status of OOM is shown.
> > >  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
> > >  				 be stopped.)
> > >  
> > > +It is also possible to configure an oom killer timeout to prevent the
> > > +possibility that the memcg will deadlock looking for memory if userspace
> > > +has disabled the oom killer with oom_control but cannot act to fix the
> > > +condition itself (usually because userspace has become unresponsive).
> > > +
> > > +To set an oom killer timeout for a memcg, write the number of milliseconds
> > > +to wait before killing a task to memory.oom_delay_millisecs:
> > > +
> > > +	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
> > > +
> > > +This timeout is reset the first time the memcg is oom to prevent needlessly
> > > +waiting for the next oom when userspace is truly unresponsive.  It may be
> > > +set again using the above interface to defer killing a task the next time
> > > +the memcg is oom.
> > > +
> > > +Disabling the oom killer for a memcg with memory.oom_control takes
> > > +precedence over memory.oom_delay_millisecs, so it must be set to 0
> > > +(default) to allow the oom kill after the delay has expired.
> > > +
> > > +This value is inherited from the memcg's parent on creation.
> > > +
> > > +There is no delay if memory.oom_delay_millisecs is set to 0 (default).
> > > +This tunable's upper bound is 60 minutes.
> > 
> > Why upper-bounds is 60 minutes ? Do we have to have a limit ?
> > Hmm, I feel 60minutes is too short. I like 32 or 31 bit limit.
> >
> 
> I agree
>  
> > 
> > > +
> > > +
> > >  11. TODO
> > >  
> > >  1. Add support for accounting huge pages (as a separate controller)
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index e6aadd6..951a22c 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -229,6 +229,8 @@ struct mem_cgroup {
> > >  	unsigned int	swappiness;
> > >  	/* OOM-Killer disable */
> > >  	int		oom_kill_disable;
> > > +	/* number of ticks to stall before calling oom killer */
> > > +	int		oom_delay;
> > >  
> > >  	/* set when res.limit == memsw.limit */
> > >  	bool		memsw_is_minimum;
> > > @@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
> > >  /*
> > >   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > >   */
> > > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> > > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> > >  {
> > >  	struct oom_wait_info owait;
> > >  	bool locked, need_to_kill;
> > > +	long timeout = MAX_SCHEDULE_TIMEOUT;
> > >  
> > >  	owait.mem = mem;
> > >  	owait.wait.flags = 0;
> > > @@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> > >  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> > >  	if (!locked || mem->oom_kill_disable)
> > >  		need_to_kill = false;
> > > -	if (locked)
> > > +	if (locked) {
> > > +		if (mem->oom_delay) {
> > > +			need_to_kill = false;
> > > +			timeout = mem->oom_delay;
> > > +			mem->oom_delay = 0;
> > > +		}
> > >  		mem_cgroup_oom_notify(mem);
> > > +	}
> > >  	mutex_unlock(&memcg_oom_mutex);
> > >  
> > >  	if (need_to_kill) {
> > >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> > >  		mem_cgroup_out_of_memory(mem, mask);
> > >  	} else {
> > > -		schedule();
> > > +		schedule_timeout(timeout);
> > >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> > >  	}
> > >  	mutex_lock(&memcg_oom_mutex);
> > > @@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> > >  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> > >  		return false;
> > >  	/* Give chance to dying process */
> > > -	schedule_timeout(1);
> > > +	if (timeout != MAX_SCHEDULE_TIMEOUT)
> > 
> > != ?
> > 
> > This seems to change existing behavior.
> >
> 
> Ideally it should be "==", if oom_delay was never set, we want to
> schedule_timeout(1). BTW, the sched* makes me wonder by how much we
> increase the ctxsw rate, but I guess in the OOM path, we should not
> bother much. I'll do some testing around this.
> 

Anyway, please do in other patch if you change existing behavior for some purpose.


 
> > > +		schedule_timeout(1);
> > >  	return true;
> > >  }
> > >  
> > > @@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
> > >  	return 0;
> > >  }
> > >  
> > > +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
> > > +					struct cftype *cft)
> > > +{
> > > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > > +
> > > +	return jiffies_to_msecs(memcg->oom_delay);
> > > +}
> > > +
> > > +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
> > > +					struct cftype *cft, u64 val)
> > > +{
> > > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > > +
> > > +	/* Sanity check -- don't wait longer than an hour */
> > > +	if (val > (60 * 60 * 1000))
> > > +		return -EINVAL;
> > > +
> > > +	memcg->oom_delay = msecs_to_jiffies(val);
> > > +	memcg_oom_recover(memcg);
> > > +	return 0;
> > > +}
> > > +
> > 
> > Please allow this to the root of sub-hierarchy and no children....(*)
> > (please check how mem_cgroup_oom_lock/unlock() works under use_hierarchy=1)
> > 
> 
> Kamezawa-San, not sure if your comment is clear, are you suggesting
> 
> Since memcg is the root of a hierarchy, we need to use hierarchical
> locking before changing the value of the root oom_delay?
> 

No. mem_cgroup_oom_lock() is a lock for hierarchy, not for a group.

For example,

  A
 / \
B   C

In above hierarchy, when C is in OOM, A's OOM will be blocked by C's OOM.
Because A's OOM can be fixed by C's oom-kill.
This means oom_delay for A should be for C (and B), IOW, for hierarchy.


A and B, C should have the same oom_delay, oom_disable value.

About oom_disable,
 - It can be set only when A has no children and root of hierarchy.
 - It's inherited at creating children.

Then, A, B ,C have the same value.

Considering race conditions, I like current oom_disable's approach.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch v3] memcg: add oom killer delay
  2011-01-06  1:53                 ` KAMEZAWA Hiroyuki
@ 2011-01-06  5:46                   ` Balbir Singh
  2011-01-06  5:52                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2011-01-06  5:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Daisuke Nishimura, Divyesh Shah,
	linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-06 10:53:15]:

> > Kamezawa-San, not sure if your comment is clear, are you suggesting
> > 
> > Since memcg is the root of a hierarchy, we need to use hierarchical
> > locking before changing the value of the root oom_delay?
> > 
> 
> No. mem_cgroup_oom_lock() is a lock for hierarchy, not for a group.
> 
> For example,
> 
>   A
>  / \
> B   C
> 
> In above hierarchy, when C is in OOM, A's OOM will be blocked by C's OOM.
> Because A's OOM can be fixed by C's oom-kill.
> This means oom_delay for A should be for C (and B), IOW, for hierarchy.
> 
> 
> A and B, C should have the same oom_delay, oom_disable value.
>

Why so? You already mentioned that A's OOM will be blocked by C's OOM?
If we keep that behaviour, if C has a different oom_delay value, it
won't matter, since we'll never go up to A. If the patch breaks that
behaviour then we are in trouble. With hierarchy we need to ensure
that if A has a oom_delay set and C does not, A's setting takes
precendence. In the absence of that logic what you say makes sense.
 
> About oom_disable,
>  - It can be set only when A has no children and root of hierarchy.
>  - It's inherited at creating children.
> 
> Then, A, B ,C have the same value.
> 
> Considering race conditions, I like current oom_disable's approach.
>

Thanks for clarifying. 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch v3] memcg: add oom killer delay
  2011-01-06  5:46                   ` Balbir Singh
@ 2011-01-06  5:52                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-06  5:52 UTC (permalink / raw)
  To: balbir
  Cc: David Rientjes, Andrew Morton, Daisuke Nishimura, Divyesh Shah,
	linux-mm

On Thu, 6 Jan 2011 11:16:00 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-06 10:53:15]:
> 
> > > Kamezawa-San, not sure if your comment is clear, are you suggesting
> > > 
> > > Since memcg is the root of a hierarchy, we need to use hierarchical
> > > locking before changing the value of the root oom_delay?
> > > 
> > 
> > No. mem_cgroup_oom_lock() is a lock for hierarchy, not for a group.
> > 
> > For example,
> > 
> >   A
> >  / \
> > B   C
> > 
> > In above hierarchy, when C is in OOM, A's OOM will be blocked by C's OOM.
> > Because A's OOM can be fixed by C's oom-kill.
> > This means oom_delay for A should be for C (and B), IOW, for hierarchy.
> > 
> > 
> > A and B, C should have the same oom_delay, oom_disable value.
> >
> 
> Why so? You already mentioned that A's OOM will be blocked by C's OOM?
> If we keep that behaviour, if C has a different oom_delay value, it
> won't matter, since we'll never go up to A. 

When C's oom_delay is 10min and A's oom_delay is 1min, A can be blocked
for 10min even if it has 1min delay. 

I don't want this complex rule.

> If the patch breaks that
> behaviour then we are in trouble. With hierarchy we need to ensure
> that if A has a oom_delay set and C does not, A's setting takes
> precendence. In the absence of that logic what you say makes sense.
>  
His implemenation doesn't do that and I want a simple one even if I have
to make an Ack to a feature which seems of-no-use to me.



Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  7:27 [patch] memcg: add oom killer delay David Rientjes
  2010-12-22  7:59 ` Andrew Morton
@ 2010-12-25 10:47 ` Balbir Singh
  2010-12-26 20:35   ` David Rientjes
  1 sibling, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2010-12-25 10:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah,
	linux-mm

* David Rientjes <rientjes@google.com> [2010-12-21 23:27:25]:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because
> userspace is unresponsive.  This scenario creates a memcg livelock:
> tasks are continuously trying to allocate memory and nothing is getting
> killed, so memory freeing is impossible since reclaim has failed, and
> all work stalls with no remedy in sight.
> 
> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the
> oom killer.  If the oom condition persists for this number of
> milliseconds, the oom killer will be called the next time the memory
> controller attempts to charge a page (and memory.oom_control is set to
> 0).  This allows userspace to have a short period of time to respond to
> the condition before timing out and deferring to the kernel to kill a
> task.
> 
> Admins may set the oom killer timeout using the new interface:
> 
> 	# echo 60000 > memory.oom_delay
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed.  When setting memory.oom_delay, all pending timeouts are
> restarted.

I think Paul mentioned this problem and solution (I think already in
use at google) some time back. Is x miliseconds a per oom kill decision
timer or is it global?

> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/cgroups/memory.txt |   15 ++++++++++
>  mm/memcontrol.c                  |   56 +++++++++++++++++++++++++++++++++----
>  2 files changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -68,6 +68,7 @@ Brief summary of control files.
>  				 (See sysctl's vm.swappiness)
>   memory.move_charge_at_immigrate # set/show controls of moving charges
>   memory.oom_control		 # set/show oom controls.
> + memory.oom_delay		 # set/show millisecs to wait before oom kill
> 
>  1. History
> 
> @@ -640,6 +641,20 @@ At reading, current status of OOM is shown.
>  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
>  				 be stopped.)
> 
> +It is also possible to configure an oom killer timeout to prevent the
> +possibility that the memcg will livelock looking for memory if userspace
> +has disabled the oom killer with oom_control but cannot act to fix the
> +condition itself (usually because userspace has become unresponsive).
> +
> +To set an oom killer timeout for a memcg, write the number of milliseconds
> +to wait before killing a task to memory.oom_delay:
> +
> +	# echo 60000 > memory.oom_delay	# wait 60 seconds, then oom kill
> +
> +This timeout is reset the next time the memcg successfully charges memory
> +to a task.
> +
> +
>  11. TODO
> 
>  1. Add support for accounting huge pages (as a separate controller)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -233,12 +233,16 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> +	/* oom_delay has expired and still out of memory? */
> +	bool oom_kill_delay_expired;
>  	atomic_t	oom_lock;
>  	atomic_t	refcnt;
> 
>  	unsigned int	swappiness;
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
> +	/* min number of ticks to wait before calling oom killer */
> +	int		oom_kill_delay;
> 
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
> 
>  static void memcg_oom_recover(struct mem_cgroup *mem)
>  {
> +	mem->oom_kill_delay_expired = false;
>  	if (mem && atomic_read(&mem->oom_lock))
>  		memcg_wakeup_oom(mem);
>  }
> @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
>  /*
>   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
>   */
> -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  {
>  	struct oom_wait_info owait;
> -	bool locked, need_to_kill;
> +	bool locked;
> +	bool need_to_kill = true;
> +	bool need_to_delay = false;
> 
>  	owait.mem = mem;
>  	owait.wait.flags = 0;
>  	owait.wait.func = memcg_oom_wake_function;
>  	owait.wait.private = current;
>  	INIT_LIST_HEAD(&owait.wait.task_list);
> -	need_to_kill = true;
>  	/* At first, try to OOM lock hierarchy under mem.*/
>  	mutex_lock(&memcg_oom_mutex);
>  	locked = mem_cgroup_oom_lock(mem);
> @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
>  	if (!locked || mem->oom_kill_disable)
>  		need_to_kill = false;
> -	if (locked)
> +	if (locked) {
>  		mem_cgroup_oom_notify(mem);
> +		if (mem->oom_kill_delay && !mem->oom_kill_delay_expired) {
> +			need_to_kill = false;
> +			need_to_delay = true;
> +		}
> +	}
>  	mutex_unlock(&memcg_oom_mutex);
> 
>  	if (need_to_kill) {
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  		mem_cgroup_out_of_memory(mem, mask);
>  	} else {
> -		schedule();
> +		schedule_timeout(need_to_delay ? mem->oom_kill_delay :
> +						 MAX_SCHEDULE_TIMEOUT);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
>  	mem_cgroup_oom_unlock(mem);
>  	memcg_wakeup_oom(mem);
> +	mem->oom_kill_delay_expired = need_to_delay;
>  	mutex_unlock(&memcg_oom_mutex);
> 
>  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
>  		return false;
>  	/* Give chance to dying process */
> -	schedule_timeout(1);
> +	if (!need_to_delay)
> +		schedule_timeout(1);
>  	return true;
>  }

I think we need additional statistics for tracking oom kills due to
timer expiry.


> 
> @@ -2007,6 +2021,7 @@ again:
>  		refill_stock(mem, csize - PAGE_SIZE);
>  	css_put(&mem->css);
>  done:
> +	mem->oom_kill_delay_expired = false;
>  	*memcg = mem;
>  	return 0;
>  nomem:
> @@ -4053,6 +4068,29 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>  	return 0;
>  }
> 
> +static u64 mem_cgroup_oom_delay_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return jiffies_to_msecs(memcg->oom_kill_delay);
> +}
> +
> +static int mem_cgroup_oom_delay_write(struct cgroup *cgrp, struct cftype *cft,
> +				      u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	/* Sanity check -- don't wait longer than an hour */
> +	if (val > (60 * 60 * 1000))
> +		return -EINVAL;

Why do this and not document it? These sort of things get exremely
confusing. I would prefer not to have it without a resonable use case
or very good documentation, explaining why we need an upper bound.

> +
> +	cgroup_lock();
> +	memcg->oom_kill_delay = msecs_to_jiffies(val);
> +	memcg_oom_recover(memcg);
> +	cgroup_unlock();
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -4116,6 +4154,11 @@ static struct cftype mem_cgroup_files[] = {
>  		.unregister_event = mem_cgroup_oom_unregister_event,
>  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>  	},
> +	{
> +		.name = "oom_delay",
> +		.read_u64 = mem_cgroup_oom_delay_read,
> +		.write_u64 = mem_cgroup_oom_delay_write,
> +	},
>  };
> 
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -4357,6 +4400,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		mem->use_hierarchy = parent->use_hierarchy;
>  		mem->oom_kill_disable = parent->oom_kill_disable;
> +		mem->oom_kill_delay = parent->oom_kill_delay;
>  	}
> 
>  	if (parent && parent->use_hierarchy) {

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-25 10:47 ` [patch] " Balbir Singh
@ 2010-12-26 20:35   ` David Rientjes
  0 siblings, 0 replies; 22+ messages in thread
From: David Rientjes @ 2010-12-26 20:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah,
	linux-mm

On Sat, 25 Dec 2010, Balbir Singh wrote:

> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because
> > userspace is unresponsive.  This scenario creates a memcg livelock:
> > tasks are continuously trying to allocate memory and nothing is getting
> > killed, so memory freeing is impossible since reclaim has failed, and
> > all work stalls with no remedy in sight.
> > 
> > This patch adds an oom killer delay so that a memcg may be configured to
> > wait at least a pre-defined number of milliseconds before calling the
> > oom killer.  If the oom condition persists for this number of
> > milliseconds, the oom killer will be called the next time the memory
> > controller attempts to charge a page (and memory.oom_control is set to
> > 0).  This allows userspace to have a short period of time to respond to
> > the condition before timing out and deferring to the kernel to kill a
> > task.
> > 
> > Admins may set the oom killer timeout using the new interface:
> > 
> > 	# echo 60000 > memory.oom_delay
> > 
> > This will defer oom killing to the kernel only after 60 seconds has
> > elapsed.  When setting memory.oom_delay, all pending timeouts are
> > restarted.
> 
> I think Paul mentioned this problem and solution (I think already in
> use at google) some time back. Is x miliseconds a per oom kill decision
> timer or is it global?
> 

It's global for that memcg and works for all oom kills in that memcg until 
changed by userspace.  The example given in the changelog of 60 seconds 
would be for a use-case where we're only concerned about deadlock because 
nothing can allocate memory, the oom killer is disabled, and userspace 
fails to act for whatever reason.  It's necessary because of how memcg 
handles oom disabling compared to OOM_DISABLE: if al tasks are OOM_DISABLE 
for a system-wide oom then the machine panics whereas memcg will simply 
deadlock for memory.oom_control == 0.

 [ Your email is in response to the first version of the patch that 
   doesn't have the changes requested by Andrew, but this response will 
   be in terms of how it is implemented in v2. ]

> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  Documentation/cgroups/memory.txt |   15 ++++++++++
> >  mm/memcontrol.c                  |   56 +++++++++++++++++++++++++++++++++----
> >  2 files changed, 65 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -68,6 +68,7 @@ Brief summary of control files.
> >  				 (See sysctl's vm.swappiness)
> >   memory.move_charge_at_immigrate # set/show controls of moving charges
> >   memory.oom_control		 # set/show oom controls.
> > + memory.oom_delay		 # set/show millisecs to wait before oom kill
> > 
> >  1. History
> > 
> > @@ -640,6 +641,20 @@ At reading, current status of OOM is shown.
> >  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
> >  				 be stopped.)
> > 
> > +It is also possible to configure an oom killer timeout to prevent the
> > +possibility that the memcg will livelock looking for memory if userspace
> > +has disabled the oom killer with oom_control but cannot act to fix the
> > +condition itself (usually because userspace has become unresponsive).
> > +
> > +To set an oom killer timeout for a memcg, write the number of milliseconds
> > +to wait before killing a task to memory.oom_delay:
> > +
> > +	# echo 60000 > memory.oom_delay	# wait 60 seconds, then oom kill
> > +
> > +This timeout is reset the next time the memcg successfully charges memory
> > +to a task.
> > +
> > +
> >  11. TODO
> > 
> >  1. Add support for accounting huge pages (as a separate controller)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -233,12 +233,16 @@ struct mem_cgroup {
> >  	 * Should the accounting and control be hierarchical, per subtree?
> >  	 */
> >  	bool use_hierarchy;
> > +	/* oom_delay has expired and still out of memory? */
> > +	bool oom_kill_delay_expired;
> >  	atomic_t	oom_lock;
> >  	atomic_t	refcnt;
> > 
> >  	unsigned int	swappiness;
> >  	/* OOM-Killer disable */
> >  	int		oom_kill_disable;
> > +	/* min number of ticks to wait before calling oom killer */
> > +	int		oom_kill_delay;
> > 
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> > @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
> > 
> >  static void memcg_oom_recover(struct mem_cgroup *mem)
> >  {
> > +	mem->oom_kill_delay_expired = false;
> >  	if (mem && atomic_read(&mem->oom_lock))
> >  		memcg_wakeup_oom(mem);
> >  }
> > @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
> >  /*
> >   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> >   */
> > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> >  {
> >  	struct oom_wait_info owait;
> > -	bool locked, need_to_kill;
> > +	bool locked;
> > +	bool need_to_kill = true;
> > +	bool need_to_delay = false;
> > 
> >  	owait.mem = mem;
> >  	owait.wait.flags = 0;
> >  	owait.wait.func = memcg_oom_wake_function;
> >  	owait.wait.private = current;
> >  	INIT_LIST_HEAD(&owait.wait.task_list);
> > -	need_to_kill = true;
> >  	/* At first, try to OOM lock hierarchy under mem.*/
> >  	mutex_lock(&memcg_oom_mutex);
> >  	locked = mem_cgroup_oom_lock(mem);
> > @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> >  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> >  	if (!locked || mem->oom_kill_disable)
> >  		need_to_kill = false;
> > -	if (locked)
> > +	if (locked) {
> >  		mem_cgroup_oom_notify(mem);
> > +		if (mem->oom_kill_delay && !mem->oom_kill_delay_expired) {
> > +			need_to_kill = false;
> > +			need_to_delay = true;
> > +		}
> > +	}
> >  	mutex_unlock(&memcg_oom_mutex);
> > 
> >  	if (need_to_kill) {
> >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> >  		mem_cgroup_out_of_memory(mem, mask);
> >  	} else {
> > -		schedule();
> > +		schedule_timeout(need_to_delay ? mem->oom_kill_delay :
> > +						 MAX_SCHEDULE_TIMEOUT);
> >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> >  	}
> >  	mutex_lock(&memcg_oom_mutex);
> >  	mem_cgroup_oom_unlock(mem);
> >  	memcg_wakeup_oom(mem);
> > +	mem->oom_kill_delay_expired = need_to_delay;
> >  	mutex_unlock(&memcg_oom_mutex);
> > 
> >  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> >  		return false;
> >  	/* Give chance to dying process */
> > -	schedule_timeout(1);
> > +	if (!need_to_delay)
> > +		schedule_timeout(1);
> >  	return true;
> >  }
> 
> I think we need additional statistics for tracking oom kills due to
> timer expiry.
> 

I can add an export for how many times memory.oom_delay_millisecs expired 
and allowed an oom kill if you'd like.  If you have a special use case for 
that, please let me know and I'll put it in the changelog.

> > @@ -2007,6 +2021,7 @@ again:
> >  		refill_stock(mem, csize - PAGE_SIZE);
> >  	css_put(&mem->css);
> >  done:
> > +	mem->oom_kill_delay_expired = false;
> >  	*memcg = mem;
> >  	return 0;
> >  nomem:
> > @@ -4053,6 +4068,29 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
> >  	return 0;
> >  }
> > 
> > +static u64 mem_cgroup_oom_delay_read(struct cgroup *cgrp, struct cftype *cft)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +
> > +	return jiffies_to_msecs(memcg->oom_kill_delay);
> > +}
> > +
> > +static int mem_cgroup_oom_delay_write(struct cgroup *cgrp, struct cftype *cft,
> > +				      u64 val)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +
> > +	/* Sanity check -- don't wait longer than an hour */
> > +	if (val > (60 * 60 * 1000))
> > +		return -EINVAL;
> 
> Why do this and not document it? These sort of things get exremely
> confusing. I would prefer not to have it without a resonable use case
> or very good documentation, explaining why we need an upper bound.
> 

The upper-bound is necessary, although not at 60 minutes, because 
mem_cgroup_oom_delay_write() takes a u64 and memcg->oom_delay_millisecs is 
an int.  I don't think it makes sense for a memcg to deadlock and fail all 
memory allocations for over an hour, so it's a pretty arbitrary value.  
I'll document it in the change to Documentation/cgroups/memory.txt.

Thanks for the review!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-01-06  5:58 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-22  7:27 [patch] memcg: add oom killer delay David Rientjes
2010-12-22  7:59 ` Andrew Morton
2010-12-22  8:17   ` KAMEZAWA Hiroyuki
2010-12-22  8:31     ` KOSAKI Motohiro
2010-12-22  8:48     ` David Rientjes
2010-12-22  8:48       ` KAMEZAWA Hiroyuki
2010-12-22  8:55         ` KAMEZAWA Hiroyuki
2010-12-22  9:21           ` David Rientjes
2010-12-27  1:47             ` KAMEZAWA Hiroyuki
2010-12-22  9:04         ` David Rientjes
2010-12-22  8:42   ` David Rientjes
2010-12-22 22:45     ` [patch v2] " David Rientjes
2010-12-27  0:52       ` KAMEZAWA Hiroyuki
2010-12-28  5:22         ` David Rientjes
2010-12-28  6:29           ` [patch v3] " David Rientjes
2011-01-04  1:41             ` KAMEZAWA Hiroyuki
2011-01-04  3:59               ` Balbir Singh
2011-01-06  1:53                 ` KAMEZAWA Hiroyuki
2011-01-06  5:46                   ` Balbir Singh
2011-01-06  5:52                     ` KAMEZAWA Hiroyuki
2010-12-25 10:47 ` [patch] " Balbir Singh
2010-12-26 20:35   ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).