* [patch] memcg: add oom killer delay @ 2010-12-22 7:27 David Rientjes 2010-12-22 7:59 ` Andrew Morton 2010-12-25 10:47 ` [patch] " Balbir Singh 0 siblings, 2 replies; 22+ messages in thread From: David Rientjes @ 2010-12-22 7:27 UTC (permalink / raw) To: Andrew Morton Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah, linux-mm Completely disabling the oom killer for a memcg is problematic if userspace is unable to address the condition itself, usually because userspace is unresponsive. This scenario creates a memcg livelock: tasks are continuously trying to allocate memory and nothing is getting killed, so memory freeing is impossible since reclaim has failed, and all work stalls with no remedy in sight. This patch adds an oom killer delay so that a memcg may be configured to wait at least a pre-defined number of milliseconds before calling the oom killer. If the oom condition persists for this number of milliseconds, the oom killer will be called the next time the memory controller attempts to charge a page (and memory.oom_control is set to 0). This allows userspace to have a short period of time to respond to the condition before timing out and deferring to the kernel to kill a task. Admins may set the oom killer timeout using the new interface: # echo 60000 > memory.oom_delay This will defer oom killing to the kernel only after 60 seconds has elapsed. When setting memory.oom_delay, all pending timeouts are restarted. Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/cgroups/memory.txt | 15 ++++++++++ mm/memcontrol.c | 56 +++++++++++++++++++++++++++++++++---- 2 files changed, 65 insertions(+), 6 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -68,6 +68,7 @@ Brief summary of control files. (See sysctl's vm.swappiness) memory.move_charge_at_immigrate # set/show controls of moving charges memory.oom_control # set/show oom controls. + memory.oom_delay # set/show millisecs to wait before oom kill 1. History @@ -640,6 +641,20 @@ At reading, current status of OOM is shown. under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.) +It is also possible to configure an oom killer timeout to prevent the +possibility that the memcg will livelock looking for memory if userspace +has disabled the oom killer with oom_control but cannot act to fix the +condition itself (usually because userspace has become unresponsive). + +To set an oom killer timeout for a memcg, write the number of milliseconds +to wait before killing a task to memory.oom_delay: + + # echo 60000 > memory.oom_delay # wait 60 seconds, then oom kill + +This timeout is reset the next time the memcg successfully charges memory +to a task. + + 11. TODO 1. Add support for accounting huge pages (as a separate controller) diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -233,12 +233,16 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; + /* oom_delay has expired and still out of memory? */ + bool oom_kill_delay_expired; atomic_t oom_lock; atomic_t refcnt; unsigned int swappiness; /* OOM-Killer disable */ int oom_kill_disable; + /* min number of ticks to wait before calling oom killer */ + int oom_kill_delay; /* set when res.limit == memsw.limit */ bool memsw_is_minimum; @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem) static void memcg_oom_recover(struct mem_cgroup *mem) { + mem->oom_kill_delay_expired = false; if (mem && atomic_read(&mem->oom_lock)) memcg_wakeup_oom(mem); } @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem) /* * try to call OOM killer. returns false if we should exit memory-reclaim loop. */ -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) { struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked; + bool need_to_kill = true; + bool need_to_delay = false; owait.mem = mem; owait.wait.flags = 0; owait.wait.func = memcg_oom_wake_function; owait.wait.private = current; INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; /* At first, try to OOM lock hierarchy under mem.*/ mutex_lock(&memcg_oom_mutex); locked = mem_cgroup_oom_lock(mem); @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || mem->oom_kill_disable) need_to_kill = false; - if (locked) + if (locked) { mem_cgroup_oom_notify(mem); + if (mem->oom_kill_delay && !mem->oom_kill_delay_expired) { + need_to_kill = false; + need_to_delay = true; + } + } mutex_unlock(&memcg_oom_mutex); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(mem, mask); } else { - schedule(); + schedule_timeout(need_to_delay ? mem->oom_kill_delay : + MAX_SCHEDULE_TIMEOUT); finish_wait(&memcg_oom_waitq, &owait.wait); } mutex_lock(&memcg_oom_mutex); mem_cgroup_oom_unlock(mem); memcg_wakeup_oom(mem); + mem->oom_kill_delay_expired = need_to_delay; mutex_unlock(&memcg_oom_mutex); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) return false; /* Give chance to dying process */ - schedule_timeout(1); + if (!need_to_delay) + schedule_timeout(1); return true; } @@ -2007,6 +2021,7 @@ again: refill_stock(mem, csize - PAGE_SIZE); css_put(&mem->css); done: + mem->oom_kill_delay_expired = false; *memcg = mem; return 0; nomem: @@ -4053,6 +4068,29 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, return 0; } +static u64 mem_cgroup_oom_delay_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + + return jiffies_to_msecs(memcg->oom_kill_delay); +} + +static int mem_cgroup_oom_delay_write(struct cgroup *cgrp, struct cftype *cft, + u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + + /* Sanity check -- don't wait longer than an hour */ + if (val > (60 * 60 * 1000)) + return -EINVAL; + + cgroup_lock(); + memcg->oom_kill_delay = msecs_to_jiffies(val); + memcg_oom_recover(memcg); + cgroup_unlock(); + return 0; +} + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -4116,6 +4154,11 @@ static struct cftype mem_cgroup_files[] = { .unregister_event = mem_cgroup_oom_unregister_event, .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), }, + { + .name = "oom_delay", + .read_u64 = mem_cgroup_oom_delay_read, + .write_u64 = mem_cgroup_oom_delay_write, + }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -4357,6 +4400,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) parent = mem_cgroup_from_cont(cont->parent); mem->use_hierarchy = parent->use_hierarchy; mem->oom_kill_disable = parent->oom_kill_disable; + mem->oom_kill_delay = parent->oom_kill_delay; } if (parent && parent->use_hierarchy) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 7:27 [patch] memcg: add oom killer delay David Rientjes @ 2010-12-22 7:59 ` Andrew Morton 2010-12-22 8:17 ` KAMEZAWA Hiroyuki 2010-12-22 8:42 ` David Rientjes 2010-12-25 10:47 ` [patch] " Balbir Singh 1 sibling, 2 replies; 22+ messages in thread From: Andrew Morton @ 2010-12-22 7:59 UTC (permalink / raw) To: David Rientjes Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah, linux-mm On Tue, 21 Dec 2010 23:27:25 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > Completely disabling the oom killer for a memcg is problematic if > userspace is unable to address the condition itself, usually because > userspace is unresponsive. This scenario creates a memcg livelock: > tasks are continuously trying to allocate memory and nothing is getting > killed, so memory freeing is impossible since reclaim has failed, and > all work stalls with no remedy in sight. Userspace was buggy, surely. If userspace has elected to disable the oom-killer then it should ensure that it can cope with the ensuing result. One approach might be to run a mlockall()ed watchdog which monitors the worker tasks via shared memory. Another approach would be to run that watchdog in a different memcg, without mlockall(). There are surely plenty of other ways of doing it. > This patch adds an oom killer delay so that a memcg may be configured to > wait at least a pre-defined number of milliseconds before calling the > oom killer. If the oom condition persists for this number of > milliseconds, the oom killer will be called the next time the memory > controller attempts to charge a page (and memory.oom_control is set to > 0). This allows userspace to have a short period of time to respond to > the condition before timing out and deferring to the kernel to kill a > task. > > Admins may set the oom killer timeout using the new interface: > > # echo 60000 > memory.oom_delay > > This will defer oom killing to the kernel only after 60 seconds has > elapsed. When setting memory.oom_delay, all pending timeouts are > restarted. > eww, ick ick ick. Minutea: - changelog and docs forgot to mention that oom_delay=0 disables. - it's called oom_kill_delay in the kernel and oom_delay in userspace. - oom_delay_millisecs would be a better name for the pseudo file. - Also, ick. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 7:59 ` Andrew Morton @ 2010-12-22 8:17 ` KAMEZAWA Hiroyuki 2010-12-22 8:31 ` KOSAKI Motohiro 2010-12-22 8:48 ` David Rientjes 2010-12-22 8:42 ` David Rientjes 1 sibling, 2 replies; 22+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-12-22 8:17 UTC (permalink / raw) To: Andrew Morton Cc: David Rientjes, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Tue, 21 Dec 2010 23:59:24 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > On Tue, 21 Dec 2010 23:27:25 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > > > Completely disabling the oom killer for a memcg is problematic if > > userspace is unable to address the condition itself, usually because > > userspace is unresponsive. This scenario creates a memcg livelock: > > tasks are continuously trying to allocate memory and nothing is getting > > killed, so memory freeing is impossible since reclaim has failed, and > > all work stalls with no remedy in sight. > > Userspace was buggy, surely. If userspace has elected to disable the > oom-killer then it should ensure that it can cope with the ensuing result. > > One approach might be to run a mlockall()ed watchdog which monitors the > worker tasks via shared memory. Another approach would be to run that > watchdog in a different memcg, without mlockall(). There are surely > plenty of other ways of doing it. > > > This patch adds an oom killer delay so that a memcg may be configured to > > wait at least a pre-defined number of milliseconds before calling the > > oom killer. If the oom condition persists for this number of > > milliseconds, the oom killer will be called the next time the memory > > controller attempts to charge a page (and memory.oom_control is set to > > 0). This allows userspace to have a short period of time to respond to > > the condition before timing out and deferring to the kernel to kill a > > task. > > > > Admins may set the oom killer timeout using the new interface: > > > > # echo 60000 > memory.oom_delay > > > > This will defer oom killing to the kernel only after 60 seconds has > > elapsed. When setting memory.oom_delay, all pending timeouts are > > restarted. > > > > eww, ick ick ick. > > > Minutea: > > - changelog and docs forgot to mention that oom_delay=0 disables. > > - it's called oom_kill_delay in the kernel and oom_delay in userspace. > > - oom_delay_millisecs would be a better name for the pseudo file. > > - Also, ick. > seems to be hard to use. No one can estimate "milisecond" for avoidling OOM-kill. I think this is very bad. Nack to this feature itself. If you want something smart _in kernel_, please implement followings. - When hit oom, enlarge limit to some extent. - All processes in cgroup should be stopped. - A helper application will be called by usermode_helper(). - When a helper application exit(), automatically release all processes to run again. Then, you can avoid oom-kill situation in automatic with kernel's help. BTW, don't call cgroup_lock(). It's always dangerous. You can add your own lock. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 8:17 ` KAMEZAWA Hiroyuki @ 2010-12-22 8:31 ` KOSAKI Motohiro 2010-12-22 8:48 ` David Rientjes 1 sibling, 0 replies; 22+ messages in thread From: KOSAKI Motohiro @ 2010-12-22 8:31 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: kosaki.motohiro, Andrew Morton, David Rientjes, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm > seems to be hard to use. No one can estimate "milisecond" for avoidling > OOM-kill. I think this is very bad. Nack to this feature itself. > > > If you want something smart _in kernel_, please implement followings. > > - When hit oom, enlarge limit to some extent. > - All processes in cgroup should be stopped. > - A helper application will be called by usermode_helper(). > - When a helper application exit(), automatically release all processes > to run again. > > Then, you can avoid oom-kill situation in automatic with kernel's help. I bet a monitor use diffent memcg is simplest thing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 8:17 ` KAMEZAWA Hiroyuki 2010-12-22 8:31 ` KOSAKI Motohiro @ 2010-12-22 8:48 ` David Rientjes 2010-12-22 8:48 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 22+ messages in thread From: David Rientjes @ 2010-12-22 8:48 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote: > seems to be hard to use. No one can estimate "milisecond" for avoidling > OOM-kill. I think this is very bad. Nack to this feature itself. > There's no estimation that is really needed, we simply need to be able to stall long enough that we'll eventually kill "something" if userspace fails to act. > If you want something smart _in kernel_, please implement followings. > > - When hit oom, enlarge limit to some extent. > - All processes in cgroup should be stopped. > - A helper application will be called by usermode_helper(). > - When a helper application exit(), automatically release all processes > to run again. > Hmm, that's a _lot_ of policy to be implemented in the kernel itself and comes at the cost of either being faulty (if the limit cannot be increased) or harmful (when increasing the limit is detrimental to other memcg). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 8:48 ` David Rientjes @ 2010-12-22 8:48 ` KAMEZAWA Hiroyuki 2010-12-22 8:55 ` KAMEZAWA Hiroyuki 2010-12-22 9:04 ` David Rientjes 0 siblings, 2 replies; 22+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-12-22 8:48 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Wed, 22 Dec 2010 00:48:53 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote: > > > seems to be hard to use. No one can estimate "milisecond" for avoidling > > OOM-kill. I think this is very bad. Nack to this feature itself. > > > > There's no estimation that is really needed, we simply need to be able to > stall long enough that we'll eventually kill "something" if userspace > fails to act. > Why we have to think of usermode failure by mis configuration or user mode bug ? It's a work of Middleware in usual. Please make libcgroup or libvirt more useful. > > If you want something smart _in kernel_, please implement followings. > > > > - When hit oom, enlarge limit to some extent. > > - All processes in cgroup should be stopped. > > - A helper application will be called by usermode_helper(). > > - When a helper application exit(), automatically release all processes > > to run again. > > > > Hmm, that's a _lot_ of policy to be implemented in the kernel itself and > comes at the cost of either being faulty (if the limit cannot be > increased) or harmful (when increasing the limit is detrimental to other > memcg). > Or runnking a helper function in "root" cgroup which has no limit at all. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 8:48 ` KAMEZAWA Hiroyuki @ 2010-12-22 8:55 ` KAMEZAWA Hiroyuki 2010-12-22 9:21 ` David Rientjes 2010-12-22 9:04 ` David Rientjes 1 sibling, 1 reply; 22+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-12-22 8:55 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: David Rientjes, Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Wed, 22 Dec 2010 17:48:29 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Wed, 22 Dec 2010 00:48:53 -0800 (PST) > David Rientjes <rientjes@google.com> wrote: > > > On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote: > > > > > seems to be hard to use. No one can estimate "milisecond" for avoidling > > > OOM-kill. I think this is very bad. Nack to this feature itself. > > > > > > > There's no estimation that is really needed, we simply need to be able to > > stall long enough that we'll eventually kill "something" if userspace > > fails to act. > > > > Why we have to think of usermode failure by mis configuration or user mode bug ? > It's a work of Middleware in usual. For example. oom_check_deadlockd can work as 1. disable oom by memory.oom_disable=1 2. check memory.oom_notify and wait it by poll() 3. At oom, it wakes up. 4. wait for 60 secs. 5. If the cgroup is still in OOM, set oom_disalble=0 This daemon will not use much memory and can run in /roog memory cgroup. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 8:55 ` KAMEZAWA Hiroyuki @ 2010-12-22 9:21 ` David Rientjes 2010-12-27 1:47 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 22+ messages in thread From: David Rientjes @ 2010-12-22 9:21 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote: > For example. oom_check_deadlockd can work as > > 1. disable oom by memory.oom_disable=1 > 2. check memory.oom_notify and wait it by poll() > 3. At oom, it wakes up. > 4. wait for 60 secs. > 5. If the cgroup is still in OOM, set oom_disalble=0 > > This daemon will not use much memory and can run in /roog memory cgroup. > Yes, this is almost the same as the "simple and perfect implementation" that I eluded to in my response to Andrew (and I think KOSAKI-san suggested something similiar), although it doesn't quite work because all threads in the cgroup are sitting on the waitqueue and don't get woken up to see oom_control == 0 unless memory is freed, a task is moved, or the limit is resized so this daemon will need to trigger that as step #6. That certainly works if it is indeed perfect and guaranteed to always be running. In the interest of a robust resource isolation model, I don't think we can ever make that conclusion, though, so this discussion is really only about how fault tolerant the kernel is because the end result is if this daemon fails, the kernel livelocks. I'd personally prefer not to allow a buggy or imperfect userspace to allow the kernel to livelock; we control the kernel so I think it would be best to ensure that it cannot livelock no matter what userspace happens to do despite its best effort. If you or Andrew come to the conclusion that it's overkill and at the end of the day we have to trust userspace, I really can't argue that philosophy though :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 9:21 ` David Rientjes @ 2010-12-27 1:47 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 22+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-12-27 1:47 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Wed, 22 Dec 2010 01:21:01 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote: > > > For example. oom_check_deadlockd can work as > > > > 1. disable oom by memory.oom_disable=1 > > 2. check memory.oom_notify and wait it by poll() > > 3. At oom, it wakes up. > > 4. wait for 60 secs. > > 5. If the cgroup is still in OOM, set oom_disalble=0 > > > > This daemon will not use much memory and can run in /roog memory cgroup. > > > > Yes, this is almost the same as the "simple and perfect implementation" > that I eluded to in my response to Andrew (and I think KOSAKI-san > suggested something similiar), although it doesn't quite work because all > threads in the cgroup are sitting on the waitqueue and don't get woken up > to see oom_control == 0 unless memory is freed, a task is moved, or the > limit is resized so this daemon will need to trigger that as step #6. > > That certainly works if it is indeed perfect and guaranteed to always be > running. In the interest of a robust resource isolation model, I don't > think we can ever make that conclusion, though, so this discussion is > really only about how fault tolerant the kernel is because the end result > is if this daemon fails, the kernel livelocks. > > I'd personally prefer not to allow a buggy or imperfect userspace to allow > the kernel to livelock; we control the kernel so I think it would be best > to ensure that it cannot livelock no matter what userspace happens to do > despite its best effort. If you or Andrew come to the conclusion that > it's overkill and at the end of the day we have to trust userspace, I > really can't argue that philosophy though :) > It's not livelock. A user can create a new thread in a cgroup not under OOM. IMHO, oom-kill itself is totally of-no-use and panic_at_oom or the system stop is always good. We can do cluster keep alive. If I was you, I'll add a "pool of memory for emergency cgroup" and run watchdog tasks in it. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 8:48 ` KAMEZAWA Hiroyuki 2010-12-22 8:55 ` KAMEZAWA Hiroyuki @ 2010-12-22 9:04 ` David Rientjes 1 sibling, 0 replies; 22+ messages in thread From: David Rientjes @ 2010-12-22 9:04 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote: > > > seems to be hard to use. No one can estimate "milisecond" for avoidling > > > OOM-kill. I think this is very bad. Nack to this feature itself. > > > > > > > There's no estimation that is really needed, we simply need to be able to > > stall long enough that we'll eventually kill "something" if userspace > > fails to act. > > > > Why we have to think of usermode failure by mis configuration or user mode bug ? > It's a work of Middleware in usual. > Please make libcgroup or libvirt more useful. > It's a general concern for users who wish to defer the kernel oom killer unless userspace chooses not to act or cannot act and the only way to do that without memory.oom_delay is to set all memcgs to have memory.oom_control of 1. memory.oom_control of 1 is equivalent to OOM_DISABLE for all attached tasks and if all tasks are assigned non-root memcg for resource isolation (and the sum of those memcgs' limits equals system RAM), we always get memcg oom kills instead of system wide oom kills. The difference in this case is that with the memcg oom kills, the kernel livelocks whereas the system wide oom kills would panic the machine since all eligible tasks are OOM_DISABLE, the equivalent of all memcgs having memory.oom_control of 1. Since the kernel has opened this possibility up by disabling oom killing without giving userspace any other chance of deferring the oom killer, we need a way to preserve the machine by having a fallback plan if userspace cannot act. The other possibility would be to panic if all memcgs have memory.oom_control of 1 and the sum of their limits equals the machine's memory capacity. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 7:59 ` Andrew Morton 2010-12-22 8:17 ` KAMEZAWA Hiroyuki @ 2010-12-22 8:42 ` David Rientjes 2010-12-22 22:45 ` [patch v2] " David Rientjes 1 sibling, 1 reply; 22+ messages in thread From: David Rientjes @ 2010-12-22 8:42 UTC (permalink / raw) To: Andrew Morton Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah, linux-mm On Tue, 21 Dec 2010, Andrew Morton wrote: > > Completely disabling the oom killer for a memcg is problematic if > > userspace is unable to address the condition itself, usually because > > userspace is unresponsive. This scenario creates a memcg livelock: > > tasks are continuously trying to allocate memory and nothing is getting > > killed, so memory freeing is impossible since reclaim has failed, and > > all work stalls with no remedy in sight. > > Userspace was buggy, surely. If userspace has elected to disable the > oom-killer then it should ensure that it can cope with the ensuing result. > I think it would be argued that no such guarantee can ever be made. > One approach might be to run a mlockall()ed watchdog which monitors the > worker tasks via shared memory. Another approach would be to run that > watchdog in a different memcg, without mlockall(). There are surely > plenty of other ways of doing it. > Yeah, we considered a simple and perfect userspace implementation that would be as fault tolerant unless it ends up getting killed (not by the oom killer) or dies itself, but there was a concern that setting every memcg to have oom_control of 0 could render the entire kernel useless without the help of userspace and that is a bad policy. In our particular use case, we _always_ want to defer using the kernel oom killer unless userspace chooses not to act (because the limit is already high enough) or cannot act (because of a bug). The former is accomplished by setting memory.oom_control to 0 originally and then setting it to 1 for that particular memcg to allow the oom kill, but it is not possible for the latter. > Minutea: > > - changelog and docs forgot to mention that oom_delay=0 disables. > I thought it would be intuitive that an oom_delay of 0 would mean there was no delay :) > - it's called oom_kill_delay in the kernel and oom_delay in userspace. > Right, this was because of the symmetry to the oom_kill_disable naming in the struct itself. I'd be happy to change it if we're to go ahead in this direction. > - oom_delay_millisecs would be a better name for the pseudo file. > Agreed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* [patch v2] memcg: add oom killer delay 2010-12-22 8:42 ` David Rientjes @ 2010-12-22 22:45 ` David Rientjes 2010-12-27 0:52 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 22+ messages in thread From: David Rientjes @ 2010-12-22 22:45 UTC (permalink / raw) To: Andrew Morton Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah, linux-mm Completely disabling the oom killer for a memcg is problematic if userspace is unable to address the condition itself, usually because userspace is unresponsive. This scenario creates a memcg livelock: tasks are continuously trying to allocate memory and nothing is getting killed, so memory freeing is impossible since reclaim has failed, and all work stalls with no remedy in sight. This patch adds an oom killer delay so that a memcg may be configured to wait at least a pre-defined number of milliseconds before calling the oom killer. If the oom condition persists for this number of milliseconds, the oom killer will be called the next time the memory controller attempts to charge a page (and memory.oom_control is set to 0). This allows userspace to have a short period of time to respond to the condition before timing out and deferring to the kernel to kill a task. Admins may set the oom killer timeout using the new interface: # echo 60000 > memory.oom_delay This will defer oom killing to the kernel only after 60 seconds has elapsed. When setting memory.oom_delay, all pending timeouts are restarted. Signed-off-by: David Rientjes <rientjes@google.com> --- v2 of the patch to address your suggestions -- if we _really_ want to leave the kernel open to the possibility of livelock as the result of a userspace bug, then this doesn't need to be merged. Otherwise, it would be nice to get this support for a more robust memory controller. Documentation/cgroups/memory.txt | 17 +++++++++++ mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++---- 2 files changed, 66 insertions(+), 6 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -68,6 +68,7 @@ Brief summary of control files. (See sysctl's vm.swappiness) memory.move_charge_at_immigrate # set/show controls of moving charges memory.oom_control # set/show oom controls. + memory.oom_delay_millisecs # set/show millisecs to wait before oom kill 1. History @@ -640,6 +641,22 @@ At reading, current status of OOM is shown. under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.) +It is also possible to configure an oom killer timeout to prevent the +possibility that the memcg will livelock looking for memory if userspace +has disabled the oom killer with oom_control but cannot act to fix the +condition itself (usually because userspace has become unresponsive). + +To set an oom killer timeout for a memcg, write the number of milliseconds +to wait before killing a task to memory.oom_delay_millisecs: + + # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill + +This timeout is reset the next time the memcg successfully charges memory +to a task. + +There is no delay if memory.oom_delay_millisecs is set to 0 (default). + + 11. TODO 1. Add support for accounting huge pages (as a separate controller) diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -233,12 +233,16 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; + /* oom_delay has expired and still out of memory? */ + bool oom_delay_expired; atomic_t oom_lock; atomic_t refcnt; unsigned int swappiness; /* OOM-Killer disable */ int oom_kill_disable; + /* number of ticks to stall before calling oom killer */ + int oom_delay; /* set when res.limit == memsw.limit */ bool memsw_is_minimum; @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem) static void memcg_oom_recover(struct mem_cgroup *mem) { + mem->oom_delay_expired = false; if (mem && atomic_read(&mem->oom_lock)) memcg_wakeup_oom(mem); } @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem) /* * try to call OOM killer. returns false if we should exit memory-reclaim loop. */ -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) { struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked; + bool need_to_kill = true; + bool need_to_delay = false; owait.mem = mem; owait.wait.flags = 0; owait.wait.func = memcg_oom_wake_function; owait.wait.private = current; INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; /* At first, try to OOM lock hierarchy under mem.*/ mutex_lock(&memcg_oom_mutex); locked = mem_cgroup_oom_lock(mem); @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || mem->oom_kill_disable) need_to_kill = false; - if (locked) + if (locked) { mem_cgroup_oom_notify(mem); + if (mem->oom_delay && !mem->oom_delay_expired) { + need_to_kill = false; + need_to_delay = true; + } + } mutex_unlock(&memcg_oom_mutex); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(mem, mask); } else { - schedule(); + schedule_timeout(need_to_delay ? mem->oom_delay : + MAX_SCHEDULE_TIMEOUT); finish_wait(&memcg_oom_waitq, &owait.wait); } mutex_lock(&memcg_oom_mutex); mem_cgroup_oom_unlock(mem); memcg_wakeup_oom(mem); + mem->oom_delay_expired = need_to_delay; mutex_unlock(&memcg_oom_mutex); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) return false; /* Give chance to dying process */ - schedule_timeout(1); + if (!need_to_delay) + schedule_timeout(1); return true; } @@ -2007,6 +2021,7 @@ again: refill_stock(mem, csize - PAGE_SIZE); css_put(&mem->css); done: + mem->oom_delay_expired = false; *memcg = mem; return 0; nomem: @@ -4053,6 +4068,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, return 0; } +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + + return jiffies_to_msecs(memcg->oom_delay); +} + +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp, + struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + + /* Sanity check -- don't wait longer than an hour */ + if (val > (60 * 60 * 1000)) + return -EINVAL; + + memcg->oom_delay = msecs_to_jiffies(val); + memcg_oom_recover(memcg); + return 0; +} + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -4116,6 +4153,11 @@ static struct cftype mem_cgroup_files[] = { .unregister_event = mem_cgroup_oom_unregister_event, .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), }, + { + .name = "oom_delay_millisecs", + .read_u64 = mem_cgroup_oom_delay_millisecs_read, + .write_u64 = mem_cgroup_oom_delay_millisecs_write, + }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -4357,6 +4399,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) parent = mem_cgroup_from_cont(cont->parent); mem->use_hierarchy = parent->use_hierarchy; mem->oom_kill_disable = parent->oom_kill_disable; + mem->oom_delay = parent->oom_delay; } if (parent && parent->use_hierarchy) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch v2] memcg: add oom killer delay 2010-12-22 22:45 ` [patch v2] " David Rientjes @ 2010-12-27 0:52 ` KAMEZAWA Hiroyuki 2010-12-28 5:22 ` David Rientjes 0 siblings, 1 reply; 22+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-12-27 0:52 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Wed, 22 Dec 2010 14:45:05 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > Completely disabling the oom killer for a memcg is problematic if > userspace is unable to address the condition itself, usually because > userspace is unresponsive. This scenario creates a memcg livelock: > tasks are continuously trying to allocate memory and nothing is getting > killed, so memory freeing is impossible since reclaim has failed, and > all work stalls with no remedy in sight. > > This patch adds an oom killer delay so that a memcg may be configured to > wait at least a pre-defined number of milliseconds before calling the > oom killer. If the oom condition persists for this number of > milliseconds, the oom killer will be called the next time the memory > controller attempts to charge a page (and memory.oom_control is set to > 0). This allows userspace to have a short period of time to respond to > the condition before timing out and deferring to the kernel to kill a > task. > > Admins may set the oom killer timeout using the new interface: > > # echo 60000 > memory.oom_delay > > This will defer oom killing to the kernel only after 60 seconds has > elapsed. When setting memory.oom_delay, all pending timeouts are > restarted. > > Signed-off-by: David Rientjes <rientjes@google.com> I dislike this feature but if someone other than goole want this, I'll ack. some comments below. > --- > v2 of the patch to address your suggestions -- if we _really_ want to > leave the kernel open to the possibility of livelock as the result of > a userspace bug, then this doesn't need to be merged. Otherwise, it > would be nice to get this support for a more robust memory controller. > > Documentation/cgroups/memory.txt | 17 +++++++++++ > mm/memcontrol.c | 55 +++++++++++++++++++++++++++++++++---- > 2 files changed, 66 insertions(+), 6 deletions(-) > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -68,6 +68,7 @@ Brief summary of control files. > (See sysctl's vm.swappiness) > memory.move_charge_at_immigrate # set/show controls of moving charges > memory.oom_control # set/show oom controls. > + memory.oom_delay_millisecs # set/show millisecs to wait before oom kill > > 1. History > > @@ -640,6 +641,22 @@ At reading, current status of OOM is shown. > under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may > be stopped.) > > +It is also possible to configure an oom killer timeout to prevent the > +possibility that the memcg will livelock looking for memory if userspace It's not livelock. It's just 'stop'. No cpu consumption at all even if oom is disabled. > +has disabled the oom killer with oom_control but cannot act to fix the > +condition itself (usually because userspace has become unresponsive). > + > +To set an oom killer timeout for a memcg, write the number of milliseconds > +to wait before killing a task to memory.oom_delay_millisecs: > + > + # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill > + I wonder whether this should be call as oom_delay you mention this feature as 'timeout' a few times before here. I like 'timeout' rather than 'delay'. And from this ducument, They are unclear that 1. what happens when it used with oom_disable. 2. what kind of timer is this. Is it a one-shot timer ? 3. how work with hierarchy ? My suggestion for 1. is: Please return -EBUSY or some if oom_disable=true and allow set timeout only when oom_disable=false. Using both of two interface at the same time is too complex. > +This timeout is reset the next time the memcg successfully charges memory > +to a task. > + > +There is no delay if memory.oom_delay_millisecs is set to 0 (default). > + > + > 11. TODO > > 1. Add support for accounting huge pages (as a separate controller) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -233,12 +233,16 @@ struct mem_cgroup { > * Should the accounting and control be hierarchical, per subtree? > */ > bool use_hierarchy; > + /* oom_delay has expired and still out of memory? */ > + bool oom_delay_expired; > atomic_t oom_lock; > atomic_t refcnt; > > unsigned int swappiness; > /* OOM-Killer disable */ > int oom_kill_disable; > + /* number of ticks to stall before calling oom killer */ > + int oom_delay; > > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem) > > static void memcg_oom_recover(struct mem_cgroup *mem) > { > + mem->oom_delay_expired = false; > if (mem && atomic_read(&mem->oom_lock)) > memcg_wakeup_oom(mem); > } > @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem) > /* > * try to call OOM killer. returns false if we should exit memory-reclaim loop. > */ > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > { > struct oom_wait_info owait; > - bool locked, need_to_kill; > + bool locked; > + bool need_to_kill = true; > + bool need_to_delay = false; > > owait.mem = mem; > owait.wait.flags = 0; > owait.wait.func = memcg_oom_wake_function; > owait.wait.private = current; > INIT_LIST_HEAD(&owait.wait.task_list); > - need_to_kill = true; > /* At first, try to OOM lock hierarchy under mem.*/ > mutex_lock(&memcg_oom_mutex); > locked = mem_cgroup_oom_lock(mem); > @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); > if (!locked || mem->oom_kill_disable) > need_to_kill = false; > - if (locked) > + if (locked) { > mem_cgroup_oom_notify(mem); > + if (mem->oom_delay && !mem->oom_delay_expired) { > + need_to_kill = false; > + need_to_delay = true; > + } > + } Hmm. When threads T1 and T2 enters this routine, it seems broken. Case 1) T1 T2 lock_oom. locked=true lock_oom. oom_notify() locked = false. wait for msecs. wait until wakeup. ...... unlock_oom. wakes up. wake up all threads. oom_delay_expired=true. wakes up. oom_delay_expired=false. 2nd call of oom. lock_oom. locked=true. oom_notify. wait for msecs. Then, oom_notify is duplicated and no OOM happens. memcg_wakeup_oom() wakes up all threads. So, I guess you should avoid to call that. But hmm...I think there are other pitfalls. Assume a hierachy as this. A / \ B C A.memory.use_hierarchy= 1 and (A,B,C) are under hierarchical control. At first, oom_disable is allowed to be set only against A. By setting oom_disable to A, OOM in B and C are disabled, too. For that purpose, mem_cgroup_oom_lock/unlock is provided. With your patch, even if oom_delay is set to A, B and C will never delay. Please fix. > mutex_unlock(&memcg_oom_mutex); > > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(mem, mask); > } else { > - schedule(); > + schedule_timeout(need_to_delay ? mem->oom_delay : > + MAX_SCHEDULE_TIMEOUT); > finish_wait(&memcg_oom_waitq, &owait.wait); > } > mutex_lock(&memcg_oom_mutex); > mem_cgroup_oom_unlock(mem); > memcg_wakeup_oom(mem); > + mem->oom_delay_expired = need_to_delay; If someone charges successfully, this oom_delay_expired will tunrs back to be false. I think this is not good....race is complicated. If I was you, I'll add a function like memcg_oom_recover() to update status. (You need somethink like that for supporting hierarchy. I think.) > mutex_unlock(&memcg_oom_mutex); > > if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) > return false; > /* Give chance to dying process */ > - schedule_timeout(1); > + if (!need_to_delay) > + schedule_timeout(1); This is unnecessary. After mem_cgroup_oom_unlock(), memcg_wakeup_oom() is called and all thread wakes up. I think all other threads than TIF_MEMDIE should sleep for a while. > return true; > } > > @@ -2007,6 +2021,7 @@ again: > refill_stock(mem, csize - PAGE_SIZE); > css_put(&mem->css); > done: > + mem->oom_delay_expired = false; Don't add this in 'fast path'. This is too bad. Have a good new year. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch v2] memcg: add oom killer delay 2010-12-27 0:52 ` KAMEZAWA Hiroyuki @ 2010-12-28 5:22 ` David Rientjes 2010-12-28 6:29 ` [patch v3] " David Rientjes 0 siblings, 1 reply; 22+ messages in thread From: David Rientjes @ 2010-12-28 5:22 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Mon, 27 Dec 2010, KAMEZAWA Hiroyuki wrote: > I dislike this feature but if someone other than goole want this, I'll ack. > some comments below. > I don't really understand that; if you think the future is useful for the described use-case with minimal impact to memcg, then I'd hope you would ack it without needing to build a coalition of companies to ask for the feature to be merged. We're going to use it regardless of whether it's merged upstream, but I think there's other use cases where it could potentially be helpful: for example, to avoid a quick oom kill when there's a temporary spike in memory usage over time without modifying the memcg's limit. It seems plausible that a user would want to defer oom killing only if a memcg is oom for a certain amount of time. That's distinct and in addition to our use case, which is to not rely solely on userspace to solve these memcg deadlocks when it fails to respond appropriately and in a timely manner. > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > > --- a/Documentation/cgroups/memory.txt > > +++ b/Documentation/cgroups/memory.txt > > @@ -68,6 +68,7 @@ Brief summary of control files. > > (See sysctl's vm.swappiness) > > memory.move_charge_at_immigrate # set/show controls of moving charges > > memory.oom_control # set/show oom controls. > > + memory.oom_delay_millisecs # set/show millisecs to wait before oom kill > > > > 1. History > > > > @@ -640,6 +641,22 @@ At reading, current status of OOM is shown. > > under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may > > be stopped.) > > > > +It is also possible to configure an oom killer timeout to prevent the > > +possibility that the memcg will livelock looking for memory if userspace > > It's not livelock. It's just 'stop'. No cpu consumption at all even if oom is > disabled. > I did s/livelock/deadlock. > > +has disabled the oom killer with oom_control but cannot act to fix the > > +condition itself (usually because userspace has become unresponsive). > > + > > +To set an oom killer timeout for a memcg, write the number of milliseconds > > +to wait before killing a task to memory.oom_delay_millisecs: > > + > > + # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill > > + > > I wonder whether this should be call as oom_delay you mention this feature as > 'timeout' a few times before here. I like 'timeout' rather than 'delay'. > And from this ducument, They are unclear that > 1. what happens when it used with oom_disable. > 2. what kind of timer is this. Is it a one-shot timer ? > 3. how work with hierarchy ? > These are three very good questions regarding my (currently lack of adequate) documentation and I'll add the answers to v3. I'd prefer to keep the name oom_delay_millisecs, however, since "delay" means oom killing will be deferred rather than "timeout," which means oom killing has taken too long and expired. > My suggestion for 1. is: > Please return -EBUSY or some if oom_disable=true and allow set timeout only when > oom_disable=false. Using both of two interface at the same time is too complex. > I added documentation that states that memory.oom_delay_millisecs is only effective if memory.oom_control == 0. I think configuring both should be allowed and memory.oom_control == 1 takes precedence anytime it is set; if it is cleared later, then any existing memory.oom_delay_millisecs becomes effective. > > +This timeout is reset the next time the memcg successfully charges memory > > +to a task. > > + > > +There is no delay if memory.oom_delay_millisecs is set to 0 (default). > > + > > + > > 11. TODO > > > > 1. Add support for accounting huge pages (as a separate controller) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -233,12 +233,16 @@ struct mem_cgroup { > > * Should the accounting and control be hierarchical, per subtree? > > */ > > bool use_hierarchy; > > + /* oom_delay has expired and still out of memory? */ > > + bool oom_delay_expired; > > atomic_t oom_lock; > > atomic_t refcnt; > > > > unsigned int swappiness; > > /* OOM-Killer disable */ > > int oom_kill_disable; > > + /* number of ticks to stall before calling oom killer */ > > + int oom_delay; > > > > /* set when res.limit == memsw.limit */ > > bool memsw_is_minimum; > > @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem) > > > > static void memcg_oom_recover(struct mem_cgroup *mem) > > { > > + mem->oom_delay_expired = false; > > if (mem && atomic_read(&mem->oom_lock)) > > memcg_wakeup_oom(mem); > > } > > @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem) > > /* > > * try to call OOM killer. returns false if we should exit memory-reclaim loop. > > */ > > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > { > > struct oom_wait_info owait; > > - bool locked, need_to_kill; > > + bool locked; > > + bool need_to_kill = true; > > + bool need_to_delay = false; > > > > owait.mem = mem; > > owait.wait.flags = 0; > > owait.wait.func = memcg_oom_wake_function; > > owait.wait.private = current; > > INIT_LIST_HEAD(&owait.wait.task_list); > > - need_to_kill = true; > > /* At first, try to OOM lock hierarchy under mem.*/ > > mutex_lock(&memcg_oom_mutex); > > locked = mem_cgroup_oom_lock(mem); > > @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); > > if (!locked || mem->oom_kill_disable) > > need_to_kill = false; > > - if (locked) > > + if (locked) { > > mem_cgroup_oom_notify(mem); > > + if (mem->oom_delay && !mem->oom_delay_expired) { > > + need_to_kill = false; > > + need_to_delay = true; > > + } > > + } > > Hmm. When threads T1 and T2 enters this routine, it seems broken. > > Case 1) > T1 T2 > lock_oom. > locked=true lock_oom. > oom_notify() locked = false. > wait for msecs. wait until wakeup. > ...... > unlock_oom. > wakes up. > wake up all threads. > oom_delay_expired=true. wakes up. > oom_delay_expired=false. > > 2nd call of oom. > lock_oom. > locked=true. > oom_notify. > wait for msecs. > > Then, oom_notify is duplicated and no OOM happens. > memcg_wakeup_oom() wakes up all threads. So, I guess you should avoid to call > that. But hmm...I think there are other pitfalls. > > > Assume a hierachy as this. > > A > / \ > B C > > A.memory.use_hierarchy= 1 and (A,B,C) are under hierarchical control. > > At first, oom_disable is allowed to be set only against A. By setting > oom_disable to A, OOM in B and C are disabled, too. For that purpose, > mem_cgroup_oom_lock/unlock is provided. > > With your patch, even if oom_delay is set to A, B and C will never delay. > Please fix. > I think what might be clearer is if we made memory.oom_delay_millisecs to be a one-shot timer as you mentioned previously. So once a memcg is oom, the task that grabs the oom_lock sleeps for the delay and then it is cleared. Later threads that are oom and grab oom_lock will immediately call the oom killer. Then, if userspace truly is dead, oom killing won't be delayed in the future and the kernel can take immediate action. And if userspace is alive, it can reset memory.oom_delay_millisecs. What do you think about that functionality instead? > Have a good new year. > Same to you, Kame! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* [patch v3] memcg: add oom killer delay 2010-12-28 5:22 ` David Rientjes @ 2010-12-28 6:29 ` David Rientjes 2011-01-04 1:41 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 22+ messages in thread From: David Rientjes @ 2010-12-28 6:29 UTC (permalink / raw) To: Andrew Morton Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm Completely disabling the oom killer for a memcg is problematic if userspace is unable to address the condition itself, usually because it is unresponsive. This scenario creates a memcg deadlock: tasks are sitting in TASK_KILLABLE waiting for the limit to be increased, a task to exit or move, or the oom killer reenabled and userspace is unable to do so. An additional possible use case is to defer oom killing within a memcg for a set period of time, probably to prevent unnecessary kills due to temporary memory spikes, before allowing the kernel to handle the condition. This patch adds an oom killer delay so that a memcg may be configured to wait at least a pre-defined number of milliseconds before calling the oom killer. If the oom condition persists for this number of milliseconds, the oom killer will be called the next time the memory controller attempts to charge a page (and memory.oom_control is set to 0). This allows userspace to have a short period of time to respond to the condition before deferring to the kernel to kill a task. Admins may set the oom killer delay using the new interface: # echo 60000 > memory.oom_delay_millisecs This will defer oom killing to the kernel only after 60 seconds has elapsed. When setting memory.oom_delay, all pending delays have their charge retried and, if necessary, the new delay is then effected. The delay is cleared the first time the memcg is oom to avoid unnecessary waiting when userspace is unresponsive for future oom conditions. It may be set again using the above interface to enforce a delay on the next oom. Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/cgroups/memory.txt | 26 +++++++++++++++++++++ mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++--- 2 files changed, 68 insertions(+), 4 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 7781857..e426733 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -68,6 +68,7 @@ Brief summary of control files. (See sysctl's vm.swappiness) memory.move_charge_at_immigrate # set/show controls of moving charges memory.oom_control # set/show oom controls. + memory.oom_delay_millisecs # set/show millisecs to wait before oom kill 1. History @@ -640,6 +641,31 @@ At reading, current status of OOM is shown. under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.) +It is also possible to configure an oom killer timeout to prevent the +possibility that the memcg will deadlock looking for memory if userspace +has disabled the oom killer with oom_control but cannot act to fix the +condition itself (usually because userspace has become unresponsive). + +To set an oom killer timeout for a memcg, write the number of milliseconds +to wait before killing a task to memory.oom_delay_millisecs: + + # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill + +This timeout is reset the first time the memcg is oom to prevent needlessly +waiting for the next oom when userspace is truly unresponsive. It may be +set again using the above interface to defer killing a task the next time +the memcg is oom. + +Disabling the oom killer for a memcg with memory.oom_control takes +precedence over memory.oom_delay_millisecs, so it must be set to 0 +(default) to allow the oom kill after the delay has expired. + +This value is inherited from the memcg's parent on creation. + +There is no delay if memory.oom_delay_millisecs is set to 0 (default). +This tunable's upper bound is 60 minutes. + + 11. TODO 1. Add support for accounting huge pages (as a separate controller) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e6aadd6..951a22c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -229,6 +229,8 @@ struct mem_cgroup { unsigned int swappiness; /* OOM-Killer disable */ int oom_kill_disable; + /* number of ticks to stall before calling oom killer */ + int oom_delay; /* set when res.limit == memsw.limit */ bool memsw_is_minimum; @@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem) /* * try to call OOM killer. returns false if we should exit memory-reclaim loop. */ -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) { struct oom_wait_info owait; bool locked, need_to_kill; + long timeout = MAX_SCHEDULE_TIMEOUT; owait.mem = mem; owait.wait.flags = 0; @@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || mem->oom_kill_disable) need_to_kill = false; - if (locked) + if (locked) { + if (mem->oom_delay) { + need_to_kill = false; + timeout = mem->oom_delay; + mem->oom_delay = 0; + } mem_cgroup_oom_notify(mem); + } mutex_unlock(&memcg_oom_mutex); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(mem, mask); } else { - schedule(); + schedule_timeout(timeout); finish_wait(&memcg_oom_waitq, &owait.wait); } mutex_lock(&memcg_oom_mutex); @@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) return false; /* Give chance to dying process */ - schedule_timeout(1); + if (timeout != MAX_SCHEDULE_TIMEOUT) + schedule_timeout(1); return true; } @@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, return 0; } +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + + return jiffies_to_msecs(memcg->oom_delay); +} + +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp, + struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + + /* Sanity check -- don't wait longer than an hour */ + if (val > (60 * 60 * 1000)) + return -EINVAL; + + memcg->oom_delay = msecs_to_jiffies(val); + memcg_oom_recover(memcg); + return 0; +} + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -3926,6 +3958,11 @@ static struct cftype mem_cgroup_files[] = { .unregister_event = mem_cgroup_oom_unregister_event, .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), }, + { + .name = "oom_delay_millisecs", + .read_u64 = mem_cgroup_oom_delay_millisecs_read, + .write_u64 = mem_cgroup_oom_delay_millisecs_write, + }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -4164,6 +4201,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) parent = mem_cgroup_from_cont(cont->parent); mem->use_hierarchy = parent->use_hierarchy; mem->oom_kill_disable = parent->oom_kill_disable; + mem->oom_delay = parent->oom_delay; } if (parent && parent->use_hierarchy) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [patch v3] memcg: add oom killer delay 2010-12-28 6:29 ` [patch v3] " David Rientjes @ 2011-01-04 1:41 ` KAMEZAWA Hiroyuki 2011-01-04 3:59 ` Balbir Singh 0 siblings, 1 reply; 22+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-01-04 1:41 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm On Mon, 27 Dec 2010 22:29:05 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > Completely disabling the oom killer for a memcg is problematic if > userspace is unable to address the condition itself, usually because it > is unresponsive. This scenario creates a memcg deadlock: tasks are > sitting in TASK_KILLABLE waiting for the limit to be increased, a task to > exit or move, or the oom killer reenabled and userspace is unable to do > so. > > An additional possible use case is to defer oom killing within a memcg > for a set period of time, probably to prevent unnecessary kills due to > temporary memory spikes, before allowing the kernel to handle the > condition. > > This patch adds an oom killer delay so that a memcg may be configured to > wait at least a pre-defined number of milliseconds before calling the oom > killer. If the oom condition persists for this number of milliseconds, > the oom killer will be called the next time the memory controller > attempts to charge a page (and memory.oom_control is set to 0). This > allows userspace to have a short period of time to respond to the > condition before deferring to the kernel to kill a task. > > Admins may set the oom killer delay using the new interface: > > # echo 60000 > memory.oom_delay_millisecs > > This will defer oom killing to the kernel only after 60 seconds has > elapsed. When setting memory.oom_delay, all pending delays have their > charge retried and, if necessary, the new delay is then effected. > > The delay is cleared the first time the memcg is oom to avoid unnecessary > waiting when userspace is unresponsive for future oom conditions. It may > be set again using the above interface to enforce a delay on the next > oom. > > Signed-off-by: David Rientjes <rientjes@google.com> Changelog please. > --- > Documentation/cgroups/memory.txt | 26 +++++++++++++++++++++ > mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++--- > 2 files changed, 68 insertions(+), 4 deletions(-) > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > index 7781857..e426733 100644 > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -68,6 +68,7 @@ Brief summary of control files. > (See sysctl's vm.swappiness) > memory.move_charge_at_immigrate # set/show controls of moving charges > memory.oom_control # set/show oom controls. > + memory.oom_delay_millisecs # set/show millisecs to wait before oom kill > > 1. History > > @@ -640,6 +641,31 @@ At reading, current status of OOM is shown. > under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may > be stopped.) > > +It is also possible to configure an oom killer timeout to prevent the > +possibility that the memcg will deadlock looking for memory if userspace > +has disabled the oom killer with oom_control but cannot act to fix the > +condition itself (usually because userspace has become unresponsive). > + > +To set an oom killer timeout for a memcg, write the number of milliseconds > +to wait before killing a task to memory.oom_delay_millisecs: > + > + # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill > + > +This timeout is reset the first time the memcg is oom to prevent needlessly > +waiting for the next oom when userspace is truly unresponsive. It may be > +set again using the above interface to defer killing a task the next time > +the memcg is oom. > + > +Disabling the oom killer for a memcg with memory.oom_control takes > +precedence over memory.oom_delay_millisecs, so it must be set to 0 > +(default) to allow the oom kill after the delay has expired. > + > +This value is inherited from the memcg's parent on creation. > + > +There is no delay if memory.oom_delay_millisecs is set to 0 (default). > +This tunable's upper bound is 60 minutes. Why upper-bounds is 60 minutes ? Do we have to have a limit ? Hmm, I feel 60minutes is too short. I like 32 or 31 bit limit. > + > + > 11. TODO > > 1. Add support for accounting huge pages (as a separate controller) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e6aadd6..951a22c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -229,6 +229,8 @@ struct mem_cgroup { > unsigned int swappiness; > /* OOM-Killer disable */ > int oom_kill_disable; > + /* number of ticks to stall before calling oom killer */ > + int oom_delay; > > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > @@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem) > /* > * try to call OOM killer. returns false if we should exit memory-reclaim loop. > */ > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > { > struct oom_wait_info owait; > bool locked, need_to_kill; > + long timeout = MAX_SCHEDULE_TIMEOUT; > > owait.mem = mem; > owait.wait.flags = 0; > @@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); > if (!locked || mem->oom_kill_disable) > need_to_kill = false; > - if (locked) > + if (locked) { > + if (mem->oom_delay) { > + need_to_kill = false; > + timeout = mem->oom_delay; > + mem->oom_delay = 0; > + } > mem_cgroup_oom_notify(mem); > + } > mutex_unlock(&memcg_oom_mutex); > > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(mem, mask); > } else { > - schedule(); > + schedule_timeout(timeout); > finish_wait(&memcg_oom_waitq, &owait.wait); > } > mutex_lock(&memcg_oom_mutex); > @@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) > return false; > /* Give chance to dying process */ > - schedule_timeout(1); > + if (timeout != MAX_SCHEDULE_TIMEOUT) != ? This seems to change existing behavior. > + schedule_timeout(1); > return true; > } > > @@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, > return 0; > } > > +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp, > + struct cftype *cft) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > + > + return jiffies_to_msecs(memcg->oom_delay); > +} > + > +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp, > + struct cftype *cft, u64 val) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > + > + /* Sanity check -- don't wait longer than an hour */ > + if (val > (60 * 60 * 1000)) > + return -EINVAL; > + > + memcg->oom_delay = msecs_to_jiffies(val); > + memcg_oom_recover(memcg); > + return 0; > +} > + Please allow this to the root of sub-hierarchy and no children....(*) (please check how mem_cgroup_oom_lock/unlock() works under use_hierarchy=1) > static struct cftype mem_cgroup_files[] = { > { > .name = "usage_in_bytes", > @@ -3926,6 +3958,11 @@ static struct cftype mem_cgroup_files[] = { > .unregister_event = mem_cgroup_oom_unregister_event, > .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), > }, > + { > + .name = "oom_delay_millisecs", > + .read_u64 = mem_cgroup_oom_delay_millisecs_read, > + .write_u64 = mem_cgroup_oom_delay_millisecs_write, > + }, > }; > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP > @@ -4164,6 +4201,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > parent = mem_cgroup_from_cont(cont->parent); > mem->use_hierarchy = parent->use_hierarchy; > mem->oom_kill_disable = parent->oom_kill_disable; > + mem->oom_delay = parent->oom_delay; Becasue of (*), oom_kill_disable can be copied here. If you want to inherit this, you should do (*) or update all hierarchy value. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch v3] memcg: add oom killer delay 2011-01-04 1:41 ` KAMEZAWA Hiroyuki @ 2011-01-04 3:59 ` Balbir Singh 2011-01-06 1:53 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 22+ messages in thread From: Balbir Singh @ 2011-01-04 3:59 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: David Rientjes, Andrew Morton, Daisuke Nishimura, Divyesh Shah, linux-mm * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-04 10:41:30]: > On Mon, 27 Dec 2010 22:29:05 -0800 (PST) > David Rientjes <rientjes@google.com> wrote: > > > Completely disabling the oom killer for a memcg is problematic if > > userspace is unable to address the condition itself, usually because it > > is unresponsive. This scenario creates a memcg deadlock: tasks are > > sitting in TASK_KILLABLE waiting for the limit to be increased, a task to > > exit or move, or the oom killer reenabled and userspace is unable to do > > so. > > > > An additional possible use case is to defer oom killing within a memcg > > for a set period of time, probably to prevent unnecessary kills due to > > temporary memory spikes, before allowing the kernel to handle the > > condition. > > > > This patch adds an oom killer delay so that a memcg may be configured to > > wait at least a pre-defined number of milliseconds before calling the oom > > killer. If the oom condition persists for this number of milliseconds, > > the oom killer will be called the next time the memory controller > > attempts to charge a page (and memory.oom_control is set to 0). This > > allows userspace to have a short period of time to respond to the > > condition before deferring to the kernel to kill a task. > > > > Admins may set the oom killer delay using the new interface: > > > > # echo 60000 > memory.oom_delay_millisecs > > > > This will defer oom killing to the kernel only after 60 seconds has > > elapsed. When setting memory.oom_delay, all pending delays have their > > charge retried and, if necessary, the new delay is then effected. > > > > The delay is cleared the first time the memcg is oom to avoid unnecessary > > waiting when userspace is unresponsive for future oom conditions. It may > > be set again using the above interface to enforce a delay on the next > > oom. > > > > Signed-off-by: David Rientjes <rientjes@google.com> > > Changelog please. > > > > --- > > Documentation/cgroups/memory.txt | 26 +++++++++++++++++++++ > > mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++--- > > 2 files changed, 68 insertions(+), 4 deletions(-) > > > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > > index 7781857..e426733 100644 > > --- a/Documentation/cgroups/memory.txt > > +++ b/Documentation/cgroups/memory.txt > > @@ -68,6 +68,7 @@ Brief summary of control files. > > (See sysctl's vm.swappiness) > > memory.move_charge_at_immigrate # set/show controls of moving charges > > memory.oom_control # set/show oom controls. > > + memory.oom_delay_millisecs # set/show millisecs to wait before oom kill > > > > 1. History > > > > @@ -640,6 +641,31 @@ At reading, current status of OOM is shown. > > under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may > > be stopped.) > > > > +It is also possible to configure an oom killer timeout to prevent the > > +possibility that the memcg will deadlock looking for memory if userspace > > +has disabled the oom killer with oom_control but cannot act to fix the > > +condition itself (usually because userspace has become unresponsive). > > + > > +To set an oom killer timeout for a memcg, write the number of milliseconds > > +to wait before killing a task to memory.oom_delay_millisecs: > > + > > + # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill > > + > > +This timeout is reset the first time the memcg is oom to prevent needlessly > > +waiting for the next oom when userspace is truly unresponsive. It may be > > +set again using the above interface to defer killing a task the next time > > +the memcg is oom. > > + > > +Disabling the oom killer for a memcg with memory.oom_control takes > > +precedence over memory.oom_delay_millisecs, so it must be set to 0 > > +(default) to allow the oom kill after the delay has expired. > > + > > +This value is inherited from the memcg's parent on creation. > > + > > +There is no delay if memory.oom_delay_millisecs is set to 0 (default). > > +This tunable's upper bound is 60 minutes. > > Why upper-bounds is 60 minutes ? Do we have to have a limit ? > Hmm, I feel 60minutes is too short. I like 32 or 31 bit limit. > I agree > > > + > > + > > 11. TODO > > > > 1. Add support for accounting huge pages (as a separate controller) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index e6aadd6..951a22c 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -229,6 +229,8 @@ struct mem_cgroup { > > unsigned int swappiness; > > /* OOM-Killer disable */ > > int oom_kill_disable; > > + /* number of ticks to stall before calling oom killer */ > > + int oom_delay; > > > > /* set when res.limit == memsw.limit */ > > bool memsw_is_minimum; > > @@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem) > > /* > > * try to call OOM killer. returns false if we should exit memory-reclaim loop. > > */ > > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > { > > struct oom_wait_info owait; > > bool locked, need_to_kill; > > + long timeout = MAX_SCHEDULE_TIMEOUT; > > > > owait.mem = mem; > > owait.wait.flags = 0; > > @@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); > > if (!locked || mem->oom_kill_disable) > > need_to_kill = false; > > - if (locked) > > + if (locked) { > > + if (mem->oom_delay) { > > + need_to_kill = false; > > + timeout = mem->oom_delay; > > + mem->oom_delay = 0; > > + } > > mem_cgroup_oom_notify(mem); > > + } > > mutex_unlock(&memcg_oom_mutex); > > > > if (need_to_kill) { > > finish_wait(&memcg_oom_waitq, &owait.wait); > > mem_cgroup_out_of_memory(mem, mask); > > } else { > > - schedule(); > > + schedule_timeout(timeout); > > finish_wait(&memcg_oom_waitq, &owait.wait); > > } > > mutex_lock(&memcg_oom_mutex); > > @@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) > > return false; > > /* Give chance to dying process */ > > - schedule_timeout(1); > > + if (timeout != MAX_SCHEDULE_TIMEOUT) > > != ? > > This seems to change existing behavior. > Ideally it should be "==", if oom_delay was never set, we want to schedule_timeout(1). BTW, the sched* makes me wonder by how much we increase the ctxsw rate, but I guess in the OOM path, we should not bother much. I'll do some testing around this. > > + schedule_timeout(1); > > return true; > > } > > > > @@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, > > return 0; > > } > > > > +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp, > > + struct cftype *cft) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > > + > > + return jiffies_to_msecs(memcg->oom_delay); > > +} > > + > > +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp, > > + struct cftype *cft, u64 val) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > > + > > + /* Sanity check -- don't wait longer than an hour */ > > + if (val > (60 * 60 * 1000)) > > + return -EINVAL; > > + > > + memcg->oom_delay = msecs_to_jiffies(val); > > + memcg_oom_recover(memcg); > > + return 0; > > +} > > + > > Please allow this to the root of sub-hierarchy and no children....(*) > (please check how mem_cgroup_oom_lock/unlock() works under use_hierarchy=1) > Kamezawa-San, not sure if your comment is clear, are you suggesting Since memcg is the root of a hierarchy, we need to use hierarchical locking before changing the value of the root oom_delay? > > static struct cftype mem_cgroup_files[] = { > > { > > .name = "usage_in_bytes", > > @@ -3926,6 +3958,11 @@ static struct cftype mem_cgroup_files[] = { > > .unregister_event = mem_cgroup_oom_unregister_event, > > .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), > > }, > > + { > > + .name = "oom_delay_millisecs", > > + .read_u64 = mem_cgroup_oom_delay_millisecs_read, > > + .write_u64 = mem_cgroup_oom_delay_millisecs_write, > > + }, > > }; > > > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP > > @@ -4164,6 +4201,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > > parent = mem_cgroup_from_cont(cont->parent); > > mem->use_hierarchy = parent->use_hierarchy; > > mem->oom_kill_disable = parent->oom_kill_disable; > > + mem->oom_delay = parent->oom_delay; > > Becasue of (*), oom_kill_disable can be copied here. > If you want to inherit this, you should do (*) or update all hierarchy value. > Not sure I understand this either. I would ideally like to see these copied if use_hierarchy is set. > > Thanks, > -Kame > -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch v3] memcg: add oom killer delay 2011-01-04 3:59 ` Balbir Singh @ 2011-01-06 1:53 ` KAMEZAWA Hiroyuki 2011-01-06 5:46 ` Balbir Singh 0 siblings, 1 reply; 22+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-01-06 1:53 UTC (permalink / raw) To: balbir Cc: David Rientjes, Andrew Morton, Daisuke Nishimura, Divyesh Shah, linux-mm On Tue, 4 Jan 2011 09:29:57 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-04 10:41:30]: > > > On Mon, 27 Dec 2010 22:29:05 -0800 (PST) > > David Rientjes <rientjes@google.com> wrote: > > > > > Completely disabling the oom killer for a memcg is problematic if > > > userspace is unable to address the condition itself, usually because it > > > is unresponsive. This scenario creates a memcg deadlock: tasks are > > > sitting in TASK_KILLABLE waiting for the limit to be increased, a task to > > > exit or move, or the oom killer reenabled and userspace is unable to do > > > so. > > > > > > An additional possible use case is to defer oom killing within a memcg > > > for a set period of time, probably to prevent unnecessary kills due to > > > temporary memory spikes, before allowing the kernel to handle the > > > condition. > > > > > > This patch adds an oom killer delay so that a memcg may be configured to > > > wait at least a pre-defined number of milliseconds before calling the oom > > > killer. If the oom condition persists for this number of milliseconds, > > > the oom killer will be called the next time the memory controller > > > attempts to charge a page (and memory.oom_control is set to 0). This > > > allows userspace to have a short period of time to respond to the > > > condition before deferring to the kernel to kill a task. > > > > > > Admins may set the oom killer delay using the new interface: > > > > > > # echo 60000 > memory.oom_delay_millisecs > > > > > > This will defer oom killing to the kernel only after 60 seconds has > > > elapsed. When setting memory.oom_delay, all pending delays have their > > > charge retried and, if necessary, the new delay is then effected. > > > > > > The delay is cleared the first time the memcg is oom to avoid unnecessary > > > waiting when userspace is unresponsive for future oom conditions. It may > > > be set again using the above interface to enforce a delay on the next > > > oom. > > > > > > Signed-off-by: David Rientjes <rientjes@google.com> > > > > Changelog please. > > > > > > > --- > > > Documentation/cgroups/memory.txt | 26 +++++++++++++++++++++ > > > mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++--- > > > 2 files changed, 68 insertions(+), 4 deletions(-) > > > > > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > > > index 7781857..e426733 100644 > > > --- a/Documentation/cgroups/memory.txt > > > +++ b/Documentation/cgroups/memory.txt > > > @@ -68,6 +68,7 @@ Brief summary of control files. > > > (See sysctl's vm.swappiness) > > > memory.move_charge_at_immigrate # set/show controls of moving charges > > > memory.oom_control # set/show oom controls. > > > + memory.oom_delay_millisecs # set/show millisecs to wait before oom kill > > > > > > 1. History > > > > > > @@ -640,6 +641,31 @@ At reading, current status of OOM is shown. > > > under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may > > > be stopped.) > > > > > > +It is also possible to configure an oom killer timeout to prevent the > > > +possibility that the memcg will deadlock looking for memory if userspace > > > +has disabled the oom killer with oom_control but cannot act to fix the > > > +condition itself (usually because userspace has become unresponsive). > > > + > > > +To set an oom killer timeout for a memcg, write the number of milliseconds > > > +to wait before killing a task to memory.oom_delay_millisecs: > > > + > > > + # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill > > > + > > > +This timeout is reset the first time the memcg is oom to prevent needlessly > > > +waiting for the next oom when userspace is truly unresponsive. It may be > > > +set again using the above interface to defer killing a task the next time > > > +the memcg is oom. > > > + > > > +Disabling the oom killer for a memcg with memory.oom_control takes > > > +precedence over memory.oom_delay_millisecs, so it must be set to 0 > > > +(default) to allow the oom kill after the delay has expired. > > > + > > > +This value is inherited from the memcg's parent on creation. > > > + > > > +There is no delay if memory.oom_delay_millisecs is set to 0 (default). > > > +This tunable's upper bound is 60 minutes. > > > > Why upper-bounds is 60 minutes ? Do we have to have a limit ? > > Hmm, I feel 60minutes is too short. I like 32 or 31 bit limit. > > > > I agree > > > > > > + > > > + > > > 11. TODO > > > > > > 1. Add support for accounting huge pages (as a separate controller) > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index e6aadd6..951a22c 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -229,6 +229,8 @@ struct mem_cgroup { > > > unsigned int swappiness; > > > /* OOM-Killer disable */ > > > int oom_kill_disable; > > > + /* number of ticks to stall before calling oom killer */ > > > + int oom_delay; > > > > > > /* set when res.limit == memsw.limit */ > > > bool memsw_is_minimum; > > > @@ -1415,10 +1417,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem) > > > /* > > > * try to call OOM killer. returns false if we should exit memory-reclaim loop. > > > */ > > > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > > { > > > struct oom_wait_info owait; > > > bool locked, need_to_kill; > > > + long timeout = MAX_SCHEDULE_TIMEOUT; > > > > > > owait.mem = mem; > > > owait.wait.flags = 0; > > > @@ -1437,15 +1440,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > > prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); > > > if (!locked || mem->oom_kill_disable) > > > need_to_kill = false; > > > - if (locked) > > > + if (locked) { > > > + if (mem->oom_delay) { > > > + need_to_kill = false; > > > + timeout = mem->oom_delay; > > > + mem->oom_delay = 0; > > > + } > > > mem_cgroup_oom_notify(mem); > > > + } > > > mutex_unlock(&memcg_oom_mutex); > > > > > > if (need_to_kill) { > > > finish_wait(&memcg_oom_waitq, &owait.wait); > > > mem_cgroup_out_of_memory(mem, mask); > > > } else { > > > - schedule(); > > > + schedule_timeout(timeout); > > > finish_wait(&memcg_oom_waitq, &owait.wait); > > > } > > > mutex_lock(&memcg_oom_mutex); > > > @@ -1456,7 +1465,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > > if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) > > > return false; > > > /* Give chance to dying process */ > > > - schedule_timeout(1); > > > + if (timeout != MAX_SCHEDULE_TIMEOUT) > > > > != ? > > > > This seems to change existing behavior. > > > > Ideally it should be "==", if oom_delay was never set, we want to > schedule_timeout(1). BTW, the sched* makes me wonder by how much we > increase the ctxsw rate, but I guess in the OOM path, we should not > bother much. I'll do some testing around this. > Anyway, please do in other patch if you change existing behavior for some purpose. > > > + schedule_timeout(1); > > > return true; > > > } > > > > > > @@ -3863,6 +3873,28 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, > > > return 0; > > > } > > > > > > +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp, > > > + struct cftype *cft) > > > +{ > > > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > > > + > > > + return jiffies_to_msecs(memcg->oom_delay); > > > +} > > > + > > > +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp, > > > + struct cftype *cft, u64 val) > > > +{ > > > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > > > + > > > + /* Sanity check -- don't wait longer than an hour */ > > > + if (val > (60 * 60 * 1000)) > > > + return -EINVAL; > > > + > > > + memcg->oom_delay = msecs_to_jiffies(val); > > > + memcg_oom_recover(memcg); > > > + return 0; > > > +} > > > + > > > > Please allow this to the root of sub-hierarchy and no children....(*) > > (please check how mem_cgroup_oom_lock/unlock() works under use_hierarchy=1) > > > > Kamezawa-San, not sure if your comment is clear, are you suggesting > > Since memcg is the root of a hierarchy, we need to use hierarchical > locking before changing the value of the root oom_delay? > No. mem_cgroup_oom_lock() is a lock for hierarchy, not for a group. For example, A / \ B C In above hierarchy, when C is in OOM, A's OOM will be blocked by C's OOM. Because A's OOM can be fixed by C's oom-kill. This means oom_delay for A should be for C (and B), IOW, for hierarchy. A and B, C should have the same oom_delay, oom_disable value. About oom_disable, - It can be set only when A has no children and root of hierarchy. - It's inherited at creating children. Then, A, B ,C have the same value. Considering race conditions, I like current oom_disable's approach. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch v3] memcg: add oom killer delay 2011-01-06 1:53 ` KAMEZAWA Hiroyuki @ 2011-01-06 5:46 ` Balbir Singh 2011-01-06 5:52 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 22+ messages in thread From: Balbir Singh @ 2011-01-06 5:46 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: David Rientjes, Andrew Morton, Daisuke Nishimura, Divyesh Shah, linux-mm * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-06 10:53:15]: > > Kamezawa-San, not sure if your comment is clear, are you suggesting > > > > Since memcg is the root of a hierarchy, we need to use hierarchical > > locking before changing the value of the root oom_delay? > > > > No. mem_cgroup_oom_lock() is a lock for hierarchy, not for a group. > > For example, > > A > / \ > B C > > In above hierarchy, when C is in OOM, A's OOM will be blocked by C's OOM. > Because A's OOM can be fixed by C's oom-kill. > This means oom_delay for A should be for C (and B), IOW, for hierarchy. > > > A and B, C should have the same oom_delay, oom_disable value. > Why so? You already mentioned that A's OOM will be blocked by C's OOM? If we keep that behaviour, if C has a different oom_delay value, it won't matter, since we'll never go up to A. If the patch breaks that behaviour then we are in trouble. With hierarchy we need to ensure that if A has a oom_delay set and C does not, A's setting takes precendence. In the absence of that logic what you say makes sense. > About oom_disable, > - It can be set only when A has no children and root of hierarchy. > - It's inherited at creating children. > > Then, A, B ,C have the same value. > > Considering race conditions, I like current oom_disable's approach. > Thanks for clarifying. -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch v3] memcg: add oom killer delay 2011-01-06 5:46 ` Balbir Singh @ 2011-01-06 5:52 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 22+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-01-06 5:52 UTC (permalink / raw) To: balbir Cc: David Rientjes, Andrew Morton, Daisuke Nishimura, Divyesh Shah, linux-mm On Thu, 6 Jan 2011 11:16:00 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-06 10:53:15]: > > > > Kamezawa-San, not sure if your comment is clear, are you suggesting > > > > > > Since memcg is the root of a hierarchy, we need to use hierarchical > > > locking before changing the value of the root oom_delay? > > > > > > > No. mem_cgroup_oom_lock() is a lock for hierarchy, not for a group. > > > > For example, > > > > A > > / \ > > B C > > > > In above hierarchy, when C is in OOM, A's OOM will be blocked by C's OOM. > > Because A's OOM can be fixed by C's oom-kill. > > This means oom_delay for A should be for C (and B), IOW, for hierarchy. > > > > > > A and B, C should have the same oom_delay, oom_disable value. > > > > Why so? You already mentioned that A's OOM will be blocked by C's OOM? > If we keep that behaviour, if C has a different oom_delay value, it > won't matter, since we'll never go up to A. When C's oom_delay is 10min and A's oom_delay is 1min, A can be blocked for 10min even if it has 1min delay. I don't want this complex rule. > If the patch breaks that > behaviour then we are in trouble. With hierarchy we need to ensure > that if A has a oom_delay set and C does not, A's setting takes > precendence. In the absence of that logic what you say makes sense. > His implemenation doesn't do that and I want a simple one even if I have to make an Ack to a feature which seems of-no-use to me. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-22 7:27 [patch] memcg: add oom killer delay David Rientjes 2010-12-22 7:59 ` Andrew Morton @ 2010-12-25 10:47 ` Balbir Singh 2010-12-26 20:35 ` David Rientjes 1 sibling, 1 reply; 22+ messages in thread From: Balbir Singh @ 2010-12-25 10:47 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah, linux-mm * David Rientjes <rientjes@google.com> [2010-12-21 23:27:25]: > Completely disabling the oom killer for a memcg is problematic if > userspace is unable to address the condition itself, usually because > userspace is unresponsive. This scenario creates a memcg livelock: > tasks are continuously trying to allocate memory and nothing is getting > killed, so memory freeing is impossible since reclaim has failed, and > all work stalls with no remedy in sight. > > This patch adds an oom killer delay so that a memcg may be configured to > wait at least a pre-defined number of milliseconds before calling the > oom killer. If the oom condition persists for this number of > milliseconds, the oom killer will be called the next time the memory > controller attempts to charge a page (and memory.oom_control is set to > 0). This allows userspace to have a short period of time to respond to > the condition before timing out and deferring to the kernel to kill a > task. > > Admins may set the oom killer timeout using the new interface: > > # echo 60000 > memory.oom_delay > > This will defer oom killing to the kernel only after 60 seconds has > elapsed. When setting memory.oom_delay, all pending timeouts are > restarted. I think Paul mentioned this problem and solution (I think already in use at google) some time back. Is x miliseconds a per oom kill decision timer or is it global? > > Signed-off-by: David Rientjes <rientjes@google.com> > --- > Documentation/cgroups/memory.txt | 15 ++++++++++ > mm/memcontrol.c | 56 +++++++++++++++++++++++++++++++++---- > 2 files changed, 65 insertions(+), 6 deletions(-) > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -68,6 +68,7 @@ Brief summary of control files. > (See sysctl's vm.swappiness) > memory.move_charge_at_immigrate # set/show controls of moving charges > memory.oom_control # set/show oom controls. > + memory.oom_delay # set/show millisecs to wait before oom kill > > 1. History > > @@ -640,6 +641,20 @@ At reading, current status of OOM is shown. > under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may > be stopped.) > > +It is also possible to configure an oom killer timeout to prevent the > +possibility that the memcg will livelock looking for memory if userspace > +has disabled the oom killer with oom_control but cannot act to fix the > +condition itself (usually because userspace has become unresponsive). > + > +To set an oom killer timeout for a memcg, write the number of milliseconds > +to wait before killing a task to memory.oom_delay: > + > + # echo 60000 > memory.oom_delay # wait 60 seconds, then oom kill > + > +This timeout is reset the next time the memcg successfully charges memory > +to a task. > + > + > 11. TODO > > 1. Add support for accounting huge pages (as a separate controller) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -233,12 +233,16 @@ struct mem_cgroup { > * Should the accounting and control be hierarchical, per subtree? > */ > bool use_hierarchy; > + /* oom_delay has expired and still out of memory? */ > + bool oom_kill_delay_expired; > atomic_t oom_lock; > atomic_t refcnt; > > unsigned int swappiness; > /* OOM-Killer disable */ > int oom_kill_disable; > + /* min number of ticks to wait before calling oom killer */ > + int oom_kill_delay; > > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem) > > static void memcg_oom_recover(struct mem_cgroup *mem) > { > + mem->oom_kill_delay_expired = false; > if (mem && atomic_read(&mem->oom_lock)) > memcg_wakeup_oom(mem); > } > @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem) > /* > * try to call OOM killer. returns false if we should exit memory-reclaim loop. > */ > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > { > struct oom_wait_info owait; > - bool locked, need_to_kill; > + bool locked; > + bool need_to_kill = true; > + bool need_to_delay = false; > > owait.mem = mem; > owait.wait.flags = 0; > owait.wait.func = memcg_oom_wake_function; > owait.wait.private = current; > INIT_LIST_HEAD(&owait.wait.task_list); > - need_to_kill = true; > /* At first, try to OOM lock hierarchy under mem.*/ > mutex_lock(&memcg_oom_mutex); > locked = mem_cgroup_oom_lock(mem); > @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); > if (!locked || mem->oom_kill_disable) > need_to_kill = false; > - if (locked) > + if (locked) { > mem_cgroup_oom_notify(mem); > + if (mem->oom_kill_delay && !mem->oom_kill_delay_expired) { > + need_to_kill = false; > + need_to_delay = true; > + } > + } > mutex_unlock(&memcg_oom_mutex); > > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(mem, mask); > } else { > - schedule(); > + schedule_timeout(need_to_delay ? mem->oom_kill_delay : > + MAX_SCHEDULE_TIMEOUT); > finish_wait(&memcg_oom_waitq, &owait.wait); > } > mutex_lock(&memcg_oom_mutex); > mem_cgroup_oom_unlock(mem); > memcg_wakeup_oom(mem); > + mem->oom_kill_delay_expired = need_to_delay; > mutex_unlock(&memcg_oom_mutex); > > if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) > return false; > /* Give chance to dying process */ > - schedule_timeout(1); > + if (!need_to_delay) > + schedule_timeout(1); > return true; > } I think we need additional statistics for tracking oom kills due to timer expiry. > > @@ -2007,6 +2021,7 @@ again: > refill_stock(mem, csize - PAGE_SIZE); > css_put(&mem->css); > done: > + mem->oom_kill_delay_expired = false; > *memcg = mem; > return 0; > nomem: > @@ -4053,6 +4068,29 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, > return 0; > } > > +static u64 mem_cgroup_oom_delay_read(struct cgroup *cgrp, struct cftype *cft) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > + > + return jiffies_to_msecs(memcg->oom_kill_delay); > +} > + > +static int mem_cgroup_oom_delay_write(struct cgroup *cgrp, struct cftype *cft, > + u64 val) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > + > + /* Sanity check -- don't wait longer than an hour */ > + if (val > (60 * 60 * 1000)) > + return -EINVAL; Why do this and not document it? These sort of things get exremely confusing. I would prefer not to have it without a resonable use case or very good documentation, explaining why we need an upper bound. > + > + cgroup_lock(); > + memcg->oom_kill_delay = msecs_to_jiffies(val); > + memcg_oom_recover(memcg); > + cgroup_unlock(); > + return 0; > +} > + > static struct cftype mem_cgroup_files[] = { > { > .name = "usage_in_bytes", > @@ -4116,6 +4154,11 @@ static struct cftype mem_cgroup_files[] = { > .unregister_event = mem_cgroup_oom_unregister_event, > .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), > }, > + { > + .name = "oom_delay", > + .read_u64 = mem_cgroup_oom_delay_read, > + .write_u64 = mem_cgroup_oom_delay_write, > + }, > }; > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP > @@ -4357,6 +4400,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > parent = mem_cgroup_from_cont(cont->parent); > mem->use_hierarchy = parent->use_hierarchy; > mem->oom_kill_disable = parent->oom_kill_disable; > + mem->oom_kill_delay = parent->oom_kill_delay; > } > > if (parent && parent->use_hierarchy) { -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] memcg: add oom killer delay 2010-12-25 10:47 ` [patch] " Balbir Singh @ 2010-12-26 20:35 ` David Rientjes 0 siblings, 0 replies; 22+ messages in thread From: David Rientjes @ 2010-12-26 20:35 UTC (permalink / raw) To: Balbir Singh Cc: Andrew Morton, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah, linux-mm On Sat, 25 Dec 2010, Balbir Singh wrote: > > Completely disabling the oom killer for a memcg is problematic if > > userspace is unable to address the condition itself, usually because > > userspace is unresponsive. This scenario creates a memcg livelock: > > tasks are continuously trying to allocate memory and nothing is getting > > killed, so memory freeing is impossible since reclaim has failed, and > > all work stalls with no remedy in sight. > > > > This patch adds an oom killer delay so that a memcg may be configured to > > wait at least a pre-defined number of milliseconds before calling the > > oom killer. If the oom condition persists for this number of > > milliseconds, the oom killer will be called the next time the memory > > controller attempts to charge a page (and memory.oom_control is set to > > 0). This allows userspace to have a short period of time to respond to > > the condition before timing out and deferring to the kernel to kill a > > task. > > > > Admins may set the oom killer timeout using the new interface: > > > > # echo 60000 > memory.oom_delay > > > > This will defer oom killing to the kernel only after 60 seconds has > > elapsed. When setting memory.oom_delay, all pending timeouts are > > restarted. > > I think Paul mentioned this problem and solution (I think already in > use at google) some time back. Is x miliseconds a per oom kill decision > timer or is it global? > It's global for that memcg and works for all oom kills in that memcg until changed by userspace. The example given in the changelog of 60 seconds would be for a use-case where we're only concerned about deadlock because nothing can allocate memory, the oom killer is disabled, and userspace fails to act for whatever reason. It's necessary because of how memcg handles oom disabling compared to OOM_DISABLE: if al tasks are OOM_DISABLE for a system-wide oom then the machine panics whereas memcg will simply deadlock for memory.oom_control == 0. [ Your email is in response to the first version of the patch that doesn't have the changes requested by Andrew, but this response will be in terms of how it is implemented in v2. ] > > Signed-off-by: David Rientjes <rientjes@google.com> > > --- > > Documentation/cgroups/memory.txt | 15 ++++++++++ > > mm/memcontrol.c | 56 +++++++++++++++++++++++++++++++++---- > > 2 files changed, 65 insertions(+), 6 deletions(-) > > > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > > --- a/Documentation/cgroups/memory.txt > > +++ b/Documentation/cgroups/memory.txt > > @@ -68,6 +68,7 @@ Brief summary of control files. > > (See sysctl's vm.swappiness) > > memory.move_charge_at_immigrate # set/show controls of moving charges > > memory.oom_control # set/show oom controls. > > + memory.oom_delay # set/show millisecs to wait before oom kill > > > > 1. History > > > > @@ -640,6 +641,20 @@ At reading, current status of OOM is shown. > > under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may > > be stopped.) > > > > +It is also possible to configure an oom killer timeout to prevent the > > +possibility that the memcg will livelock looking for memory if userspace > > +has disabled the oom killer with oom_control but cannot act to fix the > > +condition itself (usually because userspace has become unresponsive). > > + > > +To set an oom killer timeout for a memcg, write the number of milliseconds > > +to wait before killing a task to memory.oom_delay: > > + > > + # echo 60000 > memory.oom_delay # wait 60 seconds, then oom kill > > + > > +This timeout is reset the next time the memcg successfully charges memory > > +to a task. > > + > > + > > 11. TODO > > > > 1. Add support for accounting huge pages (as a separate controller) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -233,12 +233,16 @@ struct mem_cgroup { > > * Should the accounting and control be hierarchical, per subtree? > > */ > > bool use_hierarchy; > > + /* oom_delay has expired and still out of memory? */ > > + bool oom_kill_delay_expired; > > atomic_t oom_lock; > > atomic_t refcnt; > > > > unsigned int swappiness; > > /* OOM-Killer disable */ > > int oom_kill_disable; > > + /* min number of ticks to wait before calling oom killer */ > > + int oom_kill_delay; > > > > /* set when res.limit == memsw.limit */ > > bool memsw_is_minimum; > > @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem) > > > > static void memcg_oom_recover(struct mem_cgroup *mem) > > { > > + mem->oom_kill_delay_expired = false; > > if (mem && atomic_read(&mem->oom_lock)) > > memcg_wakeup_oom(mem); > > } > > @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem) > > /* > > * try to call OOM killer. returns false if we should exit memory-reclaim loop. > > */ > > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > { > > struct oom_wait_info owait; > > - bool locked, need_to_kill; > > + bool locked; > > + bool need_to_kill = true; > > + bool need_to_delay = false; > > > > owait.mem = mem; > > owait.wait.flags = 0; > > owait.wait.func = memcg_oom_wake_function; > > owait.wait.private = current; > > INIT_LIST_HEAD(&owait.wait.task_list); > > - need_to_kill = true; > > /* At first, try to OOM lock hierarchy under mem.*/ > > mutex_lock(&memcg_oom_mutex); > > locked = mem_cgroup_oom_lock(mem); > > @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) > > prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); > > if (!locked || mem->oom_kill_disable) > > need_to_kill = false; > > - if (locked) > > + if (locked) { > > mem_cgroup_oom_notify(mem); > > + if (mem->oom_kill_delay && !mem->oom_kill_delay_expired) { > > + need_to_kill = false; > > + need_to_delay = true; > > + } > > + } > > mutex_unlock(&memcg_oom_mutex); > > > > if (need_to_kill) { > > finish_wait(&memcg_oom_waitq, &owait.wait); > > mem_cgroup_out_of_memory(mem, mask); > > } else { > > - schedule(); > > + schedule_timeout(need_to_delay ? mem->oom_kill_delay : > > + MAX_SCHEDULE_TIMEOUT); > > finish_wait(&memcg_oom_waitq, &owait.wait); > > } > > mutex_lock(&memcg_oom_mutex); > > mem_cgroup_oom_unlock(mem); > > memcg_wakeup_oom(mem); > > + mem->oom_kill_delay_expired = need_to_delay; > > mutex_unlock(&memcg_oom_mutex); > > > > if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) > > return false; > > /* Give chance to dying process */ > > - schedule_timeout(1); > > + if (!need_to_delay) > > + schedule_timeout(1); > > return true; > > } > > I think we need additional statistics for tracking oom kills due to > timer expiry. > I can add an export for how many times memory.oom_delay_millisecs expired and allowed an oom kill if you'd like. If you have a special use case for that, please let me know and I'll put it in the changelog. > > @@ -2007,6 +2021,7 @@ again: > > refill_stock(mem, csize - PAGE_SIZE); > > css_put(&mem->css); > > done: > > + mem->oom_kill_delay_expired = false; > > *memcg = mem; > > return 0; > > nomem: > > @@ -4053,6 +4068,29 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, > > return 0; > > } > > > > +static u64 mem_cgroup_oom_delay_read(struct cgroup *cgrp, struct cftype *cft) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > > + > > + return jiffies_to_msecs(memcg->oom_kill_delay); > > +} > > + > > +static int mem_cgroup_oom_delay_write(struct cgroup *cgrp, struct cftype *cft, > > + u64 val) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > > + > > + /* Sanity check -- don't wait longer than an hour */ > > + if (val > (60 * 60 * 1000)) > > + return -EINVAL; > > Why do this and not document it? These sort of things get exremely > confusing. I would prefer not to have it without a resonable use case > or very good documentation, explaining why we need an upper bound. > The upper-bound is necessary, although not at 60 minutes, because mem_cgroup_oom_delay_write() takes a u64 and memcg->oom_delay_millisecs is an int. I don't think it makes sense for a memcg to deadlock and fail all memory allocations for over an hour, so it's a pretty arbitrary value. I'll document it in the change to Documentation/cgroups/memory.txt. Thanks for the review! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2011-01-06 5:58 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-12-22 7:27 [patch] memcg: add oom killer delay David Rientjes 2010-12-22 7:59 ` Andrew Morton 2010-12-22 8:17 ` KAMEZAWA Hiroyuki 2010-12-22 8:31 ` KOSAKI Motohiro 2010-12-22 8:48 ` David Rientjes 2010-12-22 8:48 ` KAMEZAWA Hiroyuki 2010-12-22 8:55 ` KAMEZAWA Hiroyuki 2010-12-22 9:21 ` David Rientjes 2010-12-27 1:47 ` KAMEZAWA Hiroyuki 2010-12-22 9:04 ` David Rientjes 2010-12-22 8:42 ` David Rientjes 2010-12-22 22:45 ` [patch v2] " David Rientjes 2010-12-27 0:52 ` KAMEZAWA Hiroyuki 2010-12-28 5:22 ` David Rientjes 2010-12-28 6:29 ` [patch v3] " David Rientjes 2011-01-04 1:41 ` KAMEZAWA Hiroyuki 2011-01-04 3:59 ` Balbir Singh 2011-01-06 1:53 ` KAMEZAWA Hiroyuki 2011-01-06 5:46 ` Balbir Singh 2011-01-06 5:52 ` KAMEZAWA Hiroyuki 2010-12-25 10:47 ` [patch] " Balbir Singh 2010-12-26 20:35 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).