[RFC][PATCH 0/2] memcg: oom notifier and handling oom by user

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/2]  memcg: oom notifier and handling oom by user
@ 2010-03-08  7:24 KAMEZAWA Hiroyuki
  2010-03-08  7:25 ` [RFC][PATCH 1/2] memcg: oom notifier KAMEZAWA Hiroyuki
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-08  7:24 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	linux-kernel@vger.kernel.org

This 2 patches is for memcg's oom handling.

At first, memcg's oom doesn't mean "no more resource" but means "we hit limit."
Then, daemons/user shells out of a memcg can work even if it's under oom.
So, if we have notifier and some more features, we can do something moderate
rather than killing at oom. 

This patch includes
[1/2] oom notifier for memcg (using evetfd framework of cgroups.)
[2/2] oom killer disalibing and hooks for waitq and wake-up.

When memcg's oom-killer is disabled, all tasks which request accountable memory
will sleep in waitq. It will be waken up by user's action as
 - enlarge limit. (memory or memsw)
 - kill some tasks
 - move some tasks (account migration is enabled.)

As an example, some moderate way is
 - send SIGSTOP to all tasks under memcg.
 - send a signal to terminate to a process, or shrink.
 - enlarge limit temporary, send SIGCONT to the task
 - reduce limit after task exits
 or 
 - move a terminating task to root cgroup

etc..etc...Maybe we can take coredump of memory-leaked process in above 
sequence.

Following is a sample script to show all process if oom happens.
Maybe some pop-up for X-window will show something nice.

I did easy test but it seems I have to do more.
Any comments are welcome.
(especially for user-interface and overhead of all checks.)

== memcg_oom_ps.sh
#!/bin/bash -x
# Usage:  ./memcg_oom_ps <path-to-cgroup>

./memcg_oom_waiter $1/memory.oom_control

if [ $? -ne 0 ]; then
        echo "something unexpected happens"
fi

ps -o pid,ppid,uid,vsz,rss,args -p `cat $1/cgroup.procs`
==

/*
 * memcg_oom_waiter: simple waiter for a memcg's OOM.
 *
 * Based on cgroup_event_listener.c
 * by Copyright (C) Kirill A. Shutemov <kirill@shutemov.name>
 */

#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <libgen.h>
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <sys/eventfd.h>

#define USAGE_STR "Usage: memcg_oom_waiter <path-to-control-file>\n"

int main(int argc, char **argv)
{
	int efd = -1;
	int cfd = -1;
	int event_control = -1;
	char event_control_path[PATH_MAX];
	char line[LINE_MAX];
	uint64_t result;
	int ret;

	cfd = open(argv[1], O_RDONLY);
	if (cfd == -1) {
		fprintf(stderr, "Cannot open %s: %s\n", argv[1],
				strerror(errno));
		goto out;
	}

	ret = snprintf(event_control_path, PATH_MAX, "%s/cgroup.event_control",
			dirname(argv[1]));
	if (ret >= PATH_MAX) {
		fputs("Path to cgroup.event_control is too long\n", stderr);
		goto out;
	}

	event_control = open(event_control_path, O_WRONLY);
	if (event_control == -1) {
		fprintf(stderr, "Cannot open %s: %s\n", event_control_path,
				strerror(errno));
		goto out;
	}

	efd = eventfd(0, 0);
	if (efd == -1) {
		perror("eventfd() failed");
		goto out;
	}

	ret = snprintf(line, LINE_MAX, "%d %d", efd, cfd);
	if (ret >= LINE_MAX) {
		fputs("Arguments string is too long\n", stderr);
		goto out;
	}

	ret = write(event_control, line, strlen(line) + 1);
	if (ret == -1) {
		perror("Cannot write to cgroup.event_control");
		goto out;
	}

	while (1) {
		ret = read(efd, &result, sizeof(result));
		if (ret == -1) {
			if (errno == EINTR)
				continue;
			perror("Cannot read from eventfd");
			break;
		} else
			break;
	}
	assert(ret == sizeof(result));

	ret = access(event_control_path, W_OK);
	if ((ret == -1) && (errno == ENOENT)) {
		puts("The cgroup seems to have removed.");
		ret = 0;
		goto out;
	}

	if (ret == -1)
		perror("cgroup.event_control "
				"is not accessable any more");
out:
	if (efd >= 0)
		close(efd);
	if (event_control >= 0)
		close(event_control);
	if (cfd >= 0)
		close(cfd);

	return (ret != 0);
}


















--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC][PATCH 1/2]  memcg: oom notifier
  2010-03-08  7:24 [RFC][PATCH 0/2] memcg: oom notifier and handling oom by user KAMEZAWA Hiroyuki
@ 2010-03-08  7:25 ` KAMEZAWA Hiroyuki
  2010-03-08  8:32   ` Kirill A. Shutemov
  2010-03-08  7:27 ` [RFC][PATCH 2/2] memcg: oom killer disable and hooks for stop and recover KAMEZAWA Hiroyuki
  2010-03-08 17:26 ` [RFC][PATCH 0/2] memcg: oom notifier and handling oom by user Balbir Singh
  2 siblings, 1 reply; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-08  7:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, linux-kernel@vger.kernel.org

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Considering containers or other resource management softwares in userland,
event notification of OOM in memcg should be implemented.
Now, memcg has "threshold" notifier which uses eventfd, we can make
use of it for oom notification.

This patch adds oom notification eventfd callback for memcg. The usage
is very similar to threshold notifier, but control file is
memory.oom_control and no arguments other than eventfd is required.

	% cgroup_event_notifier /cgroup/A/memory.oom_control dummy
	(About cgroup_event_notifier, see Documentation/cgroup/)

TODO:
 - add a knob to disable oom-kill under a memcg.
 - add read/write function to oom_control

Changelog: 20100304
 - renewed implemnation.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/memory.txt |   20 ++++-
 mm/memcontrol.c                  |  155 ++++++++++++++++++++++++++++-----------
 2 files changed, 131 insertions(+), 44 deletions(-)

Index: mmotm-2.6.33-Mar5/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Mar5.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Mar5/mm/memcontrol.c
@@ -159,6 +159,7 @@ struct mem_cgroup_threshold_ary {
 };
 
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
+static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
 
 /*
  * The memory controller data structure. The memory controller controls both
@@ -220,6 +221,9 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_threshold_ary *memsw_thresholds;
 
+	/* For oom notifier event fd */
+	struct mem_cgroup_threshold_ary *oom_notify;
+
 	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
@@ -282,9 +286,12 @@ enum charge_type {
 /* for encoding cft->private value on file */
 #define _MEM			(0)
 #define _MEMSWAP		(1)
+#define _OOM_TYPE		(2)
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
+/* Used for OOM nofiier */
+#define OOM_CONTROL		(0)
 
 /*
  * Reclaim flags for mem_cgroup_hierarchical_reclaim
@@ -1313,9 +1320,10 @@ bool mem_cgroup_handle_oom(struct mem_cg
 		prepare_to_wait(&memcg_oom_waitq, &wait, TASK_KILLABLE);
 	mutex_unlock(&memcg_oom_mutex);
 
-	if (locked)
+	if (locked) {
+		mem_cgroup_oom_notify(mem);
 		mem_cgroup_out_of_memory(mem, mask);
-	else {
+	} else {
 		schedule();
 		finish_wait(&memcg_oom_waitq, &wait);
 	}
@@ -3363,33 +3371,65 @@ static int compare_thresholds(const void
 	return _a->threshold - _b->threshold;
 }
 
+static int mem_cgroup_oom_notify_cb(struct mem_cgroup *mem, void *data)
+{
+	struct mem_cgroup_threshold_ary *x;
+	int i;
+
+	rcu_read_lock();
+	x = rcu_dereference(mem->oom_notify);
+	for (i = 0; x && i < x->size; i++)
+		eventfd_signal(x->entries[i].eventfd, 1);
+	rcu_read_unlock();
+	return 0;
+}
+
+static void mem_cgroup_oom_notify(struct mem_cgroup *mem)
+{
+	mem_cgroup_walk_tree(mem, NULL, mem_cgroup_oom_notify_cb);
+}
+
 static int mem_cgroup_register_event(struct cgroup *cgrp, struct cftype *cft,
 		struct eventfd_ctx *eventfd, const char *args)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 	struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
 	int type = MEMFILE_TYPE(cft->private);
-	u64 threshold, usage;
+	u64 threshold;
+	u64 usage = 0;
 	int size;
 	int i, ret;
 
-	ret = res_counter_memparse_write_strategy(args, &threshold);
-	if (ret)
-		return ret;
+	if (type != _OOM_TYPE) {
+		ret = res_counter_memparse_write_strategy(args, &threshold);
+		if (ret)
+			return ret;
+	} else if (mem_cgroup_is_root(memcg)) /* root cgroup ? */
+		return -ENOTSUPP;
 
 	mutex_lock(&memcg->thresholds_lock);
-	if (type == _MEM)
+	/* For waiting OOM notify, "-1" is passed */
+
+	switch (type) {
+	case _MEM:
 		thresholds = memcg->thresholds;
-	else if (type == _MEMSWAP)
+		break;
+	case _MEMSWAP:
 		thresholds = memcg->memsw_thresholds;
-	else
+		break;
+	case _OOM_TYPE:
+		thresholds = memcg->oom_notify;
+		break;
+	default:
 		BUG();
+	}
 
-	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
-
-	/* Check if a threshold crossed before adding a new one */
-	if (thresholds)
-		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+	if (type != _OOM_TYPE) {
+		usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
+		/* Check if a threshold crossed before adding a new one */
+		if (thresholds)
+			__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+	}
 
 	if (thresholds)
 		size = thresholds->size + 1;
@@ -3416,27 +3456,34 @@ static int mem_cgroup_register_event(str
 	thresholds_new->entries[size - 1].threshold = threshold;
 
 	/* Sort thresholds. Registering of new threshold isn't time-critical */
-	sort(thresholds_new->entries, size,
+	if (type != _OOM_TYPE) {
+		sort(thresholds_new->entries, size,
 			sizeof(struct mem_cgroup_threshold),
 			compare_thresholds, NULL);
-
-	/* Find current threshold */
-	atomic_set(&thresholds_new->current_threshold, -1);
-	for (i = 0; i < size; i++) {
-		if (thresholds_new->entries[i].threshold < usage) {
-			/*
-			 * thresholds_new->current_threshold will not be used
-			 * until rcu_assign_pointer(), so it's safe to increment
-			 * it here.
-			 */
-			atomic_inc(&thresholds_new->current_threshold);
+		/* Find current threshold */
+		atomic_set(&thresholds_new->current_threshold, -1);
+		for (i = 0; i < size; i++) {
+			if (thresholds_new->entries[i].threshold < usage) {
+				/*
+				 * thresholds_new->current_threshold will not
+				 * be used until rcu_assign_pointer(), so it's
+				 * safe to increment it here.
+				 */
+				atomic_inc(&thresholds_new->current_threshold);
+			}
 		}
 	}
-
-	if (type == _MEM)
+	switch (type) {
+	case _MEM:
 		rcu_assign_pointer(memcg->thresholds, thresholds_new);
-	else
+		break;
+	case _MEMSWAP:
 		rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
+		break;
+	case _OOM_TYPE:
+		rcu_assign_pointer(memcg->oom_notify, thresholds_new);
+		break;
+	}
 
 	/* To be sure that nobody uses thresholds before freeing it */
 	synchronize_rcu();
@@ -3454,17 +3501,25 @@ static int mem_cgroup_unregister_event(s
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 	struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
 	int type = MEMFILE_TYPE(cft->private);
-	u64 usage;
+	u64 usage = 0;
 	int size = 0;
 	int i, j, ret;
 
 	mutex_lock(&memcg->thresholds_lock);
-	if (type == _MEM)
+	/* check eventfd is for OOM check or not */
+	switch (type) {
+	case _MEM:
 		thresholds = memcg->thresholds;
-	else if (type == _MEMSWAP)
+		break;
+	case _MEMSWAP:
 		thresholds = memcg->memsw_thresholds;
-	else
+		break;
+	case _OOM_TYPE:
+		thresholds = memcg->oom_notify;
+		break;
+	default:
 		BUG();
+	}
 
 	/*
 	 * Something went wrong if we trying to unregister a threshold
@@ -3472,11 +3527,11 @@ static int mem_cgroup_unregister_event(s
 	 */
 	BUG_ON(!thresholds);
 
-	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
-
-	/* Check if a threshold crossed before removing */
-	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
-
+	if (type != _OOM_TYPE) {
+		usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
+		/* Check if a threshold crossed before removing */
+		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+	}
 	/* Calculate new number of threshold */
 	for (i = 0; i < thresholds->size; i++) {
 		if (thresholds->entries[i].eventfd != eventfd)
@@ -3500,13 +3555,15 @@ static int mem_cgroup_unregister_event(s
 	thresholds_new->size = size;
 
 	/* Copy thresholds and find current threshold */
-	atomic_set(&thresholds_new->current_threshold, -1);
+	if (type != _OOM_TYPE)
+		atomic_set(&thresholds_new->current_threshold, -1);
 	for (i = 0, j = 0; i < thresholds->size; i++) {
 		if (thresholds->entries[i].eventfd == eventfd)
 			continue;
 
 		thresholds_new->entries[j] = thresholds->entries[i];
-		if (thresholds_new->entries[j].threshold < usage) {
+		if (type != _OOM_TYPE &&
+			thresholds_new->entries[j].threshold < usage) {
 			/*
 			 * thresholds_new->current_threshold will not be used
 			 * until rcu_assign_pointer(), so it's safe to increment
@@ -3518,11 +3575,17 @@ static int mem_cgroup_unregister_event(s
 	}
 
 assign:
-	if (type == _MEM)
+	switch (type) {
+	case _MEM:
 		rcu_assign_pointer(memcg->thresholds, thresholds_new);
-	else
+		break;
+	case _MEMSWAP:
 		rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
-
+		break;
+	case _OOM_TYPE:
+		rcu_assign_pointer(memcg->oom_notify, thresholds_new);
+		break;
+	}
 	/* To be sure that nobody uses thresholds before freeing it */
 	synchronize_rcu();
 
@@ -3588,6 +3651,12 @@ static struct cftype mem_cgroup_files[] 
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
 	},
+	{
+		.name = "oom_control",
+		.register_event = mem_cgroup_register_event,
+		.unregister_event = mem_cgroup_unregister_event,
+		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
Index: mmotm-2.6.33-Mar5/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-2.6.33-Mar5.orig/Documentation/cgroups/memory.txt
+++ mmotm-2.6.33-Mar5/Documentation/cgroups/memory.txt
@@ -184,6 +184,9 @@ limits on the root cgroup.
 
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 
+When oom event notifier is registered, event will be delivered.
+(See oom_control section)
+
 2. Locking
 
 The memory controller uses the following hierarchy
@@ -486,7 +489,22 @@ threshold in any direction.
 
 It's applicable for root and non-root cgroup.
 
-10. TODO
+10. OOM Control
+
+Memory controler implements oom notifier using cgroup notification
+API (See cgroups.txt). It allows to register multiple oom notification
+delivery and gets notification when oom happens.
+
+To register a notifier, application need:
+ - create an eventfd using eventfd(2)
+ - open memory.oom_control file
+ - write string like "<event_fd> <memory.oom_control>" to cgroup.event_control
+
+Application will be notifier through eventfd when oom happens.
+OOM notification doesn't work for root cgroup.
+
+
+11. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC][PATCH 2/2]  memcg: oom killer disable and hooks for stop and recover
  2010-03-08  7:24 [RFC][PATCH 0/2] memcg: oom notifier and handling oom by user KAMEZAWA Hiroyuki
  2010-03-08  7:25 ` [RFC][PATCH 1/2] memcg: oom notifier KAMEZAWA Hiroyuki
@ 2010-03-08  7:27 ` KAMEZAWA Hiroyuki
  2010-03-08 17:26 ` [RFC][PATCH 0/2] memcg: oom notifier and handling oom by user Balbir Singh
  2 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-08  7:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, linux-kernel@vger.kernel.org

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This adds a feature to disable oom-killer for memcg, if disabled,
of course, tasks under memcg will stop.

But now, we have oom-notifier for memcg. And the world around
memcg is not under out-of-memory. memcg's out-of-memory just
shows memcg hits limit. Then, administrator or
management daemon can recover the situation by
	- kill some process
	- enlarge limit, add more swap.
	- migrate some tasks
	- remove file cache on tmps (difficult ?)

TODO:
	more brush up and find races.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/memory.txt |   19 ++++++
 mm/memcontrol.c                  |  118 ++++++++++++++++++++++++++++++++++-----
 2 files changed, 122 insertions(+), 15 deletions(-)

Index: mmotm-2.6.33-Mar5/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Mar5.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Mar5/mm/memcontrol.c
@@ -229,7 +229,8 @@ struct mem_cgroup {
 	 * mem_cgroup ? And what type of charges should we move ?
 	 */
 	unsigned long 	move_charge_at_immigrate;
-
+	/* Disable OOM killer */
+	unsigned long	oom_kill_disable;
 	/*
 	 * percpu counter.
 	 */
@@ -1300,14 +1301,30 @@ static void mem_cgroup_oom_unlock(struct
 static DEFINE_MUTEX(memcg_oom_mutex);
 static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
 
+void memcg_oom_recover(struct mem_cgroup *mem)
+{
+	/*
+	 * This may wakes up unrelated threads, but handling complex
+	 * hierarchy is painful and there is no big side-efffect for
+	 * wake up.
+	 *
+	 * Note: This function is called by __do_uncharge(). In extreme case,
+	 * we may not able to guarantee *mem is a valid memcg. But we do
+	 * no "write", side-effect is just (false) wake up.
+	 */
+	if (mem->oom_kill_disable && atomic_read(&mem->oom_lock))
+		wake_up_all(&memcg_oom_waitq);
+}
+
 /*
  * try to call OOM killer. returns false if we should exit memory-reclaim loop.
  */
 bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 {
 	DEFINE_WAIT(wait);
-	bool locked;
+	bool locked, notify;
 
+	notify = false;
 	/* At first, try to OOM lock hierarchy under mem.*/
 	mutex_lock(&memcg_oom_mutex);
 	locked = mem_cgroup_oom_lock(mem);
@@ -1316,12 +1333,17 @@ bool mem_cgroup_handle_oom(struct mem_cg
 	 * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
 	 * under OOM is always welcomed, use TASK_KILLABLE here.
 	 */
-	if (!locked)
+	if (!locked || mem->oom_kill_disable) {
+		notify = !waitqueue_active(&memcg_oom_waitq);
 		prepare_to_wait(&memcg_oom_waitq, &wait, TASK_KILLABLE);
+		locked = false;
+	}
 	mutex_unlock(&memcg_oom_mutex);
 
-	if (locked) {
+	if (locked || notify) /* we do lock or we're the 1st waiter */
 		mem_cgroup_oom_notify(mem);
+
+	if (locked) {
 		mem_cgroup_out_of_memory(mem, mask);
 	} else {
 		schedule();
@@ -2128,15 +2150,6 @@ __do_uncharge(struct mem_cgroup *mem, co
 	/* If swapout, usage of swap doesn't decrease */
 	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		uncharge_memsw = false;
-	/*
-	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
-	 * In those cases, all pages freed continously can be expected to be in
-	 * the same cgroup and we have chance to coalesce uncharges.
-	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
-	 * because we want to do uncharge as soon as possible.
-	 */
-	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
-		goto direct_uncharge;
 
 	batch = &current->memcg_batch;
 	/*
@@ -2147,6 +2160,17 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (!batch->memcg)
 		batch->memcg = mem;
 	/*
+	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
+	 * In those cases, all pages freed continously can be expected to be in
+	 * the same cgroup and we have chance to coalesce uncharges.
+	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
+	 * because we want to do uncharge as soon as possible.
+	 */
+
+	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
+		goto direct_uncharge;
+
+	/*
 	 * In typical case, batch->memcg == mem. This means we can
 	 * merge a series of uncharges to an uncharge of res_counter.
 	 * If not, we uncharge res_counter ony by one.
@@ -2162,6 +2186,8 @@ direct_uncharge:
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	if (uncharge_memsw)
 		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	if (unlikely(batch->memcg != mem))
+		memcg_oom_recover(mem);
 	return;
 }
 
@@ -2298,6 +2324,7 @@ void mem_cgroup_uncharge_end(void)
 		res_counter_uncharge(&batch->memcg->res, batch->bytes);
 	if (batch->memsw_bytes)
 		res_counter_uncharge(&batch->memcg->memsw, batch->memsw_bytes);
+	memcg_oom_recover(batch->memcg);
 	/* forget this pointer (for sanity check) */
 	batch->memcg = NULL;
 }
@@ -2534,10 +2561,11 @@ static int mem_cgroup_resize_limit(struc
 				unsigned long long val)
 {
 	int retry_count;
-	u64 memswlimit;
+	u64 memswlimit, memlimit;
 	int ret = 0;
 	int children = mem_cgroup_count_children(memcg);
 	u64 curusage, oldusage;
+	int enlarge;
 
 	/*
 	 * For keeping hierarchical_reclaim simple, how long we should retry
@@ -2548,6 +2576,7 @@ static int mem_cgroup_resize_limit(struc
 
 	oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 
+	enlarge = 0;
 	while (retry_count) {
 		if (signal_pending(current)) {
 			ret = -EINTR;
@@ -2565,6 +2594,11 @@ static int mem_cgroup_resize_limit(struc
 			mutex_unlock(&set_limit_mutex);
 			break;
 		}
+
+		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+		if (memlimit < val)
+			enlarge = 1;
+
 		ret = res_counter_set_limit(&memcg->res, val);
 		if (!ret) {
 			if (memswlimit == val)
@@ -2586,6 +2620,8 @@ static int mem_cgroup_resize_limit(struc
 		else
 			oldusage = curusage;
 	}
+	if (!ret && enlarge)
+		memcg_oom_recover(memcg);
 
 	return ret;
 }
@@ -2594,9 +2630,10 @@ static int mem_cgroup_resize_memsw_limit
 					unsigned long long val)
 {
 	int retry_count;
-	u64 memlimit, oldusage, curusage;
+	u64 memlimit, memswlimit, oldusage, curusage;
 	int children = mem_cgroup_count_children(memcg);
 	int ret = -EBUSY;
+	int enlarge = 0;
 
 	/* see mem_cgroup_resize_res_limit */
  	retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
@@ -2618,6 +2655,9 @@ static int mem_cgroup_resize_memsw_limit
 			mutex_unlock(&set_limit_mutex);
 			break;
 		}
+		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+		if (memswlimit < val)
+			enlarge = 1;
 		ret = res_counter_set_limit(&memcg->memsw, val);
 		if (!ret) {
 			if (memlimit == val)
@@ -2640,6 +2680,8 @@ static int mem_cgroup_resize_memsw_limit
 		else
 			oldusage = curusage;
 	}
+	if (!ret && enlarge)
+		memcg_oom_recover(memcg);
 	return ret;
 }
 
@@ -2831,6 +2873,7 @@ move_account:
 			if (ret)
 				break;
 		}
+		memcg_oom_recover(mem);
 		/* it seems parent cgroup doesn't have enough mem */
 		if (ret == -ENOMEM)
 			goto try_to_free;
@@ -3596,6 +3639,46 @@ unlock:
 	return ret;
 }
 
+static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
+
+	if (atomic_read(&mem->oom_lock))
+		cb->fill(cb, "under_oom", 1);
+	else
+		cb->fill(cb, "under_oom", 0);
+	return 0;
+}
+
+/*
+ */
+static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
+	struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent;
+
+	/* cannot set to root cgroup and only 0 and 1 are allowed */
+	if (!cgrp->parent || !((val == 0) || (val == 1)))
+		return -EINVAL;
+
+	parent = mem_cgroup_from_cont(cgrp->parent);
+
+	cgroup_lock();
+	/* oom-kill-disable is a flag for subhierarchy. */
+	if ((parent->use_hierarchy) ||
+	    (mem->use_hierarchy && !list_empty(&cgrp->children))) {
+		cgroup_unlock();
+		return -EINVAL;
+	}
+	mem->oom_kill_disable = val;
+	cgroup_unlock();
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3653,6 +3736,8 @@ static struct cftype mem_cgroup_files[] 
 	},
 	{
 		.name = "oom_control",
+		.read_map = mem_cgroup_oom_control_read,
+		.write_u64 = mem_cgroup_oom_control_write,
 		.register_event = mem_cgroup_register_event,
 		.unregister_event = mem_cgroup_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
@@ -3892,6 +3977,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
+		mem->oom_kill_disable = parent->oom_kill_disable;
 	}
 
 	if (parent && parent->use_hierarchy) {
@@ -4162,6 +4248,7 @@ static void mem_cgroup_clear_mc(void)
 	if (mc.precharge) {
 		__mem_cgroup_cancel_charge(mc.to, mc.precharge);
 		mc.precharge = 0;
+		memcg_oom_recover(mc.to);
 	}
 	/*
 	 * we didn't uncharge from mc.from at mem_cgroup_move_account(), so
@@ -4170,6 +4257,7 @@ static void mem_cgroup_clear_mc(void)
 	if (mc.moved_charge) {
 		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
+		memcg_oom_recover(mc.from);
 	}
 	/* we must fixup refcnts and charges */
 	if (mc.moved_swap) {
Index: mmotm-2.6.33-Mar5/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-2.6.33-Mar5.orig/Documentation/cgroups/memory.txt
+++ mmotm-2.6.33-Mar5/Documentation/cgroups/memory.txt
@@ -491,6 +491,8 @@ It's applicable for root and non-root cg
 
 10. OOM Control
 
+memory.oom_control file is for OOM notification and other controls.
+
 Memory controler implements oom notifier using cgroup notification
 API (See cgroups.txt). It allows to register multiple oom notification
 delivery and gets notification when oom happens.
@@ -503,6 +505,23 @@ To register a notifier, application need
 Application will be notifier through eventfd when oom happens.
 OOM notification doesn't work for root cgroup.
 
+You can disable oom-killer by writing "1" to memory.oom_control file.
+As.
+	#echo 1 > memory.oom_control
+
+This operation is only allowed to the top cgroup of subhierarchy.
+If oom-killer is disabled, tasks under cgroup will hang/sleep
+in memcg's oom-waitq when they request accountable memory.
+For running them, you have to relax the memcg's oom sitaution by
+	* enlarge limit
+	* kill some tasks.
+	* move some tasks to other group with account migration.
+Then, stopped tasks will work again.
+
+At reading, current status of OOM is shown.
+	oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
+	under_oom	 0 or 1 (if 1, the memcg is under OOM,tasks may
+				 be stopped.)
 
 11. TODO
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: oom notifier
  2010-03-08  7:25 ` [RFC][PATCH 1/2] memcg: oom notifier KAMEZAWA Hiroyuki
@ 2010-03-08  8:32   ` Kirill A. Shutemov
  2010-03-08  8:33     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 7+ messages in thread
From: Kirill A. Shutemov @ 2010-03-08  8:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, linux-kernel@vger.kernel.org

On Mon, Mar 8, 2010 at 9:25 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Considering containers or other resource management softwares in userland,
> event notification of OOM in memcg should be implemented.
> Now, memcg has "threshold" notifier which uses eventfd, we can make
> use of it for oom notification.
>
> This patch adds oom notification eventfd callback for memcg. The usage
> is very similar to threshold notifier, but control file is
> memory.oom_control and no arguments other than eventfd is required.
>
>        % cgroup_event_notifier /cgroup/A/memory.oom_control dummy
>        (About cgroup_event_notifier, see Documentation/cgroup/)

Nice idea!

But I don't think that sharing mem_cgroup_(un)register_event()
with thresholds is a good idea. There are too many
"if (type != _OOM_TYPE)". Probably, it's cleaner to create separate
register/unregister for oom events, since oom event is quite different
from threshold. We, also, don't need RCU for oom events. It's not
a critical path.

> TODO:
>  - add a knob to disable oom-kill under a memcg.
>  - add read/write function to oom_control
>
> Changelog: 20100304
>  - renewed implemnation.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  Documentation/cgroups/memory.txt |   20 ++++-
>  mm/memcontrol.c                  |  155 ++++++++++++++++++++++++++++-----------
>  2 files changed, 131 insertions(+), 44 deletions(-)
>
> Index: mmotm-2.6.33-Mar5/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Mar5.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Mar5/mm/memcontrol.c
> @@ -159,6 +159,7 @@ struct mem_cgroup_threshold_ary {
>  };
>
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
> +static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>
>  /*
>  * The memory controller data structure. The memory controller controls both
> @@ -220,6 +221,9 @@ struct mem_cgroup {
>        /* thresholds for mem+swap usage. RCU-protected */
>        struct mem_cgroup_threshold_ary *memsw_thresholds;
>
> +       /* For oom notifier event fd */
> +       struct mem_cgroup_threshold_ary *oom_notify;
> +
>        /*
>         * Should we move charges of a task when a task is moved into this
>         * mem_cgroup ? And what type of charges should we move ?
> @@ -282,9 +286,12 @@ enum charge_type {
>  /* for encoding cft->private value on file */
>  #define _MEM                   (0)
>  #define _MEMSWAP               (1)
> +#define _OOM_TYPE              (2)
>  #define MEMFILE_PRIVATE(x, val)        (((x) << 16) | (val))
>  #define MEMFILE_TYPE(val)      (((val) >> 16) & 0xffff)
>  #define MEMFILE_ATTR(val)      ((val) & 0xffff)
> +/* Used for OOM nofiier */
> +#define OOM_CONTROL            (0)
>
>  /*
>  * Reclaim flags for mem_cgroup_hierarchical_reclaim
> @@ -1313,9 +1320,10 @@ bool mem_cgroup_handle_oom(struct mem_cg
>                prepare_to_wait(&memcg_oom_waitq, &wait, TASK_KILLABLE);
>        mutex_unlock(&memcg_oom_mutex);
>
> -       if (locked)
> +       if (locked) {
> +               mem_cgroup_oom_notify(mem);
>                mem_cgroup_out_of_memory(mem, mask);
> -       else {
> +       } else {
>                schedule();
>                finish_wait(&memcg_oom_waitq, &wait);
>        }
> @@ -3363,33 +3371,65 @@ static int compare_thresholds(const void
>        return _a->threshold - _b->threshold;
>  }
>
> +static int mem_cgroup_oom_notify_cb(struct mem_cgroup *mem, void *data)
> +{
> +       struct mem_cgroup_threshold_ary *x;
> +       int i;
> +
> +       rcu_read_lock();
> +       x = rcu_dereference(mem->oom_notify);
> +       for (i = 0; x && i < x->size; i++)
> +               eventfd_signal(x->entries[i].eventfd, 1);
> +       rcu_read_unlock();
> +       return 0;
> +}
> +
> +static void mem_cgroup_oom_notify(struct mem_cgroup *mem)
> +{
> +       mem_cgroup_walk_tree(mem, NULL, mem_cgroup_oom_notify_cb);
> +}
> +
>  static int mem_cgroup_register_event(struct cgroup *cgrp, struct cftype *cft,
>                struct eventfd_ctx *eventfd, const char *args)
>  {
>        struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>        struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
>        int type = MEMFILE_TYPE(cft->private);
> -       u64 threshold, usage;
> +       u64 threshold;
> +       u64 usage = 0;
>        int size;
>        int i, ret;
>
> -       ret = res_counter_memparse_write_strategy(args, &threshold);
> -       if (ret)
> -               return ret;
> +       if (type != _OOM_TYPE) {
> +               ret = res_counter_memparse_write_strategy(args, &threshold);
> +               if (ret)
> +                       return ret;
> +       } else if (mem_cgroup_is_root(memcg)) /* root cgroup ? */
> +               return -ENOTSUPP;
>
>        mutex_lock(&memcg->thresholds_lock);
> -       if (type == _MEM)
> +       /* For waiting OOM notify, "-1" is passed */
> +
> +       switch (type) {
> +       case _MEM:
>                thresholds = memcg->thresholds;
> -       else if (type == _MEMSWAP)
> +               break;
> +       case _MEMSWAP:
>                thresholds = memcg->memsw_thresholds;
> -       else
> +               break;
> +       case _OOM_TYPE:
> +               thresholds = memcg->oom_notify;
> +               break;
> +       default:
>                BUG();
> +       }
>
> -       usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> -
> -       /* Check if a threshold crossed before adding a new one */
> -       if (thresholds)
> -               __mem_cgroup_threshold(memcg, type == _MEMSWAP);
> +       if (type != _OOM_TYPE) {
> +               usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> +               /* Check if a threshold crossed before adding a new one */
> +               if (thresholds)
> +                       __mem_cgroup_threshold(memcg, type == _MEMSWAP);
> +       }
>
>        if (thresholds)
>                size = thresholds->size + 1;
> @@ -3416,27 +3456,34 @@ static int mem_cgroup_register_event(str
>        thresholds_new->entries[size - 1].threshold = threshold;
>
>        /* Sort thresholds. Registering of new threshold isn't time-critical */
> -       sort(thresholds_new->entries, size,
> +       if (type != _OOM_TYPE) {
> +               sort(thresholds_new->entries, size,
>                        sizeof(struct mem_cgroup_threshold),
>                        compare_thresholds, NULL);
> -
> -       /* Find current threshold */
> -       atomic_set(&thresholds_new->current_threshold, -1);
> -       for (i = 0; i < size; i++) {
> -               if (thresholds_new->entries[i].threshold < usage) {
> -                       /*
> -                        * thresholds_new->current_threshold will not be used
> -                        * until rcu_assign_pointer(), so it's safe to increment
> -                        * it here.
> -                        */
> -                       atomic_inc(&thresholds_new->current_threshold);
> +               /* Find current threshold */
> +               atomic_set(&thresholds_new->current_threshold, -1);
> +               for (i = 0; i < size; i++) {
> +                       if (thresholds_new->entries[i].threshold < usage) {
> +                               /*
> +                                * thresholds_new->current_threshold will not
> +                                * be used until rcu_assign_pointer(), so it's
> +                                * safe to increment it here.
> +                                */
> +                               atomic_inc(&thresholds_new->current_threshold);
> +                       }
>                }
>        }
> -
> -       if (type == _MEM)
> +       switch (type) {
> +       case _MEM:
>                rcu_assign_pointer(memcg->thresholds, thresholds_new);
> -       else
> +               break;
> +       case _MEMSWAP:
>                rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
> +               break;
> +       case _OOM_TYPE:
> +               rcu_assign_pointer(memcg->oom_notify, thresholds_new);
> +               break;
> +       }
>
>        /* To be sure that nobody uses thresholds before freeing it */
>        synchronize_rcu();
> @@ -3454,17 +3501,25 @@ static int mem_cgroup_unregister_event(s
>        struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>        struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
>        int type = MEMFILE_TYPE(cft->private);
> -       u64 usage;
> +       u64 usage = 0;
>        int size = 0;
>        int i, j, ret;
>
>        mutex_lock(&memcg->thresholds_lock);
> -       if (type == _MEM)
> +       /* check eventfd is for OOM check or not */
> +       switch (type) {
> +       case _MEM:
>                thresholds = memcg->thresholds;
> -       else if (type == _MEMSWAP)
> +               break;
> +       case _MEMSWAP:
>                thresholds = memcg->memsw_thresholds;
> -       else
> +               break;
> +       case _OOM_TYPE:
> +               thresholds = memcg->oom_notify;
> +               break;
> +       default:
>                BUG();
> +       }
>
>        /*
>         * Something went wrong if we trying to unregister a threshold
> @@ -3472,11 +3527,11 @@ static int mem_cgroup_unregister_event(s
>         */
>        BUG_ON(!thresholds);
>
> -       usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> -
> -       /* Check if a threshold crossed before removing */
> -       __mem_cgroup_threshold(memcg, type == _MEMSWAP);
> -
> +       if (type != _OOM_TYPE) {
> +               usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> +               /* Check if a threshold crossed before removing */
> +               __mem_cgroup_threshold(memcg, type == _MEMSWAP);
> +       }
>        /* Calculate new number of threshold */
>        for (i = 0; i < thresholds->size; i++) {
>                if (thresholds->entries[i].eventfd != eventfd)
> @@ -3500,13 +3555,15 @@ static int mem_cgroup_unregister_event(s
>        thresholds_new->size = size;
>
>        /* Copy thresholds and find current threshold */
> -       atomic_set(&thresholds_new->current_threshold, -1);
> +       if (type != _OOM_TYPE)
> +               atomic_set(&thresholds_new->current_threshold, -1);
>        for (i = 0, j = 0; i < thresholds->size; i++) {
>                if (thresholds->entries[i].eventfd == eventfd)
>                        continue;
>
>                thresholds_new->entries[j] = thresholds->entries[i];
> -               if (thresholds_new->entries[j].threshold < usage) {
> +               if (type != _OOM_TYPE &&
> +                       thresholds_new->entries[j].threshold < usage) {
>                        /*
>                         * thresholds_new->current_threshold will not be used
>                         * until rcu_assign_pointer(), so it's safe to increment
> @@ -3518,11 +3575,17 @@ static int mem_cgroup_unregister_event(s
>        }
>
>  assign:
> -       if (type == _MEM)
> +       switch (type) {
> +       case _MEM:
>                rcu_assign_pointer(memcg->thresholds, thresholds_new);
> -       else
> +               break;
> +       case _MEMSWAP:
>                rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
> -
> +               break;
> +       case _OOM_TYPE:
> +               rcu_assign_pointer(memcg->oom_notify, thresholds_new);
> +               break;
> +       }
>        /* To be sure that nobody uses thresholds before freeing it */
>        synchronize_rcu();
>
> @@ -3588,6 +3651,12 @@ static struct cftype mem_cgroup_files[]
>                .read_u64 = mem_cgroup_move_charge_read,
>                .write_u64 = mem_cgroup_move_charge_write,
>        },
> +       {
> +               .name = "oom_control",
> +               .register_event = mem_cgroup_register_event,
> +               .unregister_event = mem_cgroup_unregister_event,
> +               .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
> +       },
>  };
>
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> Index: mmotm-2.6.33-Mar5/Documentation/cgroups/memory.txt
> ===================================================================
> --- mmotm-2.6.33-Mar5.orig/Documentation/cgroups/memory.txt
> +++ mmotm-2.6.33-Mar5/Documentation/cgroups/memory.txt
> @@ -184,6 +184,9 @@ limits on the root cgroup.
>
>  Note2: When panic_on_oom is set to "2", the whole system will panic.
>
> +When oom event notifier is registered, event will be delivered.
> +(See oom_control section)
> +
>  2. Locking
>
>  The memory controller uses the following hierarchy
> @@ -486,7 +489,22 @@ threshold in any direction.
>
>  It's applicable for root and non-root cgroup.
>
> -10. TODO
> +10. OOM Control
> +
> +Memory controler implements oom notifier using cgroup notification
> +API (See cgroups.txt). It allows to register multiple oom notification
> +delivery and gets notification when oom happens.
> +
> +To register a notifier, application need:
> + - create an eventfd using eventfd(2)
> + - open memory.oom_control file
> + - write string like "<event_fd> <memory.oom_control>" to cgroup.event_control
> +
> +Application will be notifier through eventfd when oom happens.
> +OOM notification doesn't work for root cgroup.
> +
> +
> +11. TODO
>
>  1. Add support for accounting huge pages (as a separate controller)
>  2. Make per-cgroup scanner reclaim not-shared pages first
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: oom notifier
  2010-03-08  8:32   ` Kirill A. Shutemov
@ 2010-03-08  8:33     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-08  8:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, linux-kernel@vger.kernel.org

On Mon, 8 Mar 2010 10:32:59 +0200
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Mon, Mar 8, 2010 at 9:25 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Considering containers or other resource management softwares in userland,
> > event notification of OOM in memcg should be implemented.
> > Now, memcg has "threshold" notifier which uses eventfd, we can make
> > use of it for oom notification.
> >
> > This patch adds oom notification eventfd callback for memcg. The usage
> > is very similar to threshold notifier, but control file is
> > memory.oom_control and no arguments other than eventfd is required.
> >
> > A  A  A  A % cgroup_event_notifier /cgroup/A/memory.oom_control dummy
> > A  A  A  A (About cgroup_event_notifier, see Documentation/cgroup/)
> 
> Nice idea!
> 
> But I don't think that sharing mem_cgroup_(un)register_event()
> with thresholds is a good idea. There are too many
> "if (type != _OOM_TYPE)". Probably, it's cleaner to create separate
> register/unregister for oom events, since oom event is quite different
> from threshold. We, also, don't need RCU for oom events. It's not
> a critical path.
> 

Ah, okay. I'll write independent functions. I just wanted to reuse existing
good codes :)

Thanks,
-Kame


> > TODO:
> > A - add a knob to disable oom-kill under a memcg.
> > A - add read/write function to oom_control
> >
> > Changelog: 20100304
> > A - renewed implemnation.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> > A Documentation/cgroups/memory.txt | A  20 ++++-
> > A mm/memcontrol.c A  A  A  A  A  A  A  A  A | A 155 ++++++++++++++++++++++++++++-----------
> > A 2 files changed, 131 insertions(+), 44 deletions(-)
> >
> > Index: mmotm-2.6.33-Mar5/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.33-Mar5.orig/mm/memcontrol.c
> > +++ mmotm-2.6.33-Mar5/mm/memcontrol.c
> > @@ -159,6 +159,7 @@ struct mem_cgroup_threshold_ary {
> > A };
> >
> > A static void mem_cgroup_threshold(struct mem_cgroup *mem);
> > +static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
> >
> > A /*
> > A * The memory controller data structure. The memory controller controls both
> > @@ -220,6 +221,9 @@ struct mem_cgroup {
> > A  A  A  A /* thresholds for mem+swap usage. RCU-protected */
> > A  A  A  A struct mem_cgroup_threshold_ary *memsw_thresholds;
> >
> > + A  A  A  /* For oom notifier event fd */
> > + A  A  A  struct mem_cgroup_threshold_ary *oom_notify;
> > +
> > A  A  A  A /*
> > A  A  A  A  * Should we move charges of a task when a task is moved into this
> > A  A  A  A  * mem_cgroup ? And what type of charges should we move ?
> > @@ -282,9 +286,12 @@ enum charge_type {
> > A /* for encoding cft->private value on file */
> > A #define _MEM A  A  A  A  A  A  A  A  A  (0)
> > A #define _MEMSWAP A  A  A  A  A  A  A  (1)
> > +#define _OOM_TYPE A  A  A  A  A  A  A (2)
> > A #define MEMFILE_PRIVATE(x, val) A  A  A  A (((x) << 16) | (val))
> > A #define MEMFILE_TYPE(val) A  A  A (((val) >> 16) & 0xffff)
> > A #define MEMFILE_ATTR(val) A  A  A ((val) & 0xffff)
> > +/* Used for OOM nofiier */
> > +#define OOM_CONTROL A  A  A  A  A  A (0)
> >
> > A /*
> > A * Reclaim flags for mem_cgroup_hierarchical_reclaim
> > @@ -1313,9 +1320,10 @@ bool mem_cgroup_handle_oom(struct mem_cg
> > A  A  A  A  A  A  A  A prepare_to_wait(&memcg_oom_waitq, &wait, TASK_KILLABLE);
> > A  A  A  A mutex_unlock(&memcg_oom_mutex);
> >
> > - A  A  A  if (locked)
> > + A  A  A  if (locked) {
> > + A  A  A  A  A  A  A  mem_cgroup_oom_notify(mem);
> > A  A  A  A  A  A  A  A mem_cgroup_out_of_memory(mem, mask);
> > - A  A  A  else {
> > + A  A  A  } else {
> > A  A  A  A  A  A  A  A schedule();
> > A  A  A  A  A  A  A  A finish_wait(&memcg_oom_waitq, &wait);
> > A  A  A  A }
> > @@ -3363,33 +3371,65 @@ static int compare_thresholds(const void
> > A  A  A  A return _a->threshold - _b->threshold;
> > A }
> >
> > +static int mem_cgroup_oom_notify_cb(struct mem_cgroup *mem, void *data)
> > +{
> > + A  A  A  struct mem_cgroup_threshold_ary *x;
> > + A  A  A  int i;
> > +
> > + A  A  A  rcu_read_lock();
> > + A  A  A  x = rcu_dereference(mem->oom_notify);
> > + A  A  A  for (i = 0; x && i < x->size; i++)
> > + A  A  A  A  A  A  A  eventfd_signal(x->entries[i].eventfd, 1);
> > + A  A  A  rcu_read_unlock();
> > + A  A  A  return 0;
> > +}
> > +
> > +static void mem_cgroup_oom_notify(struct mem_cgroup *mem)
> > +{
> > + A  A  A  mem_cgroup_walk_tree(mem, NULL, mem_cgroup_oom_notify_cb);
> > +}
> > +
> > A static int mem_cgroup_register_event(struct cgroup *cgrp, struct cftype *cft,
> > A  A  A  A  A  A  A  A struct eventfd_ctx *eventfd, const char *args)
> > A {
> > A  A  A  A struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > A  A  A  A struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
> > A  A  A  A int type = MEMFILE_TYPE(cft->private);
> > - A  A  A  u64 threshold, usage;
> > + A  A  A  u64 threshold;
> > + A  A  A  u64 usage = 0;
> > A  A  A  A int size;
> > A  A  A  A int i, ret;
> >
> > - A  A  A  ret = res_counter_memparse_write_strategy(args, &threshold);
> > - A  A  A  if (ret)
> > - A  A  A  A  A  A  A  return ret;
> > + A  A  A  if (type != _OOM_TYPE) {
> > + A  A  A  A  A  A  A  ret = res_counter_memparse_write_strategy(args, &threshold);
> > + A  A  A  A  A  A  A  if (ret)
> > + A  A  A  A  A  A  A  A  A  A  A  return ret;
> > + A  A  A  } else if (mem_cgroup_is_root(memcg)) /* root cgroup ? */
> > + A  A  A  A  A  A  A  return -ENOTSUPP;
> >
> > A  A  A  A mutex_lock(&memcg->thresholds_lock);
> > - A  A  A  if (type == _MEM)
> > + A  A  A  /* For waiting OOM notify, "-1" is passed */
> > +
> > + A  A  A  switch (type) {
> > + A  A  A  case _MEM:
> > A  A  A  A  A  A  A  A thresholds = memcg->thresholds;
> > - A  A  A  else if (type == _MEMSWAP)
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  case _MEMSWAP:
> > A  A  A  A  A  A  A  A thresholds = memcg->memsw_thresholds;
> > - A  A  A  else
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  case _OOM_TYPE:
> > + A  A  A  A  A  A  A  thresholds = memcg->oom_notify;
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  default:
> > A  A  A  A  A  A  A  A BUG();
> > + A  A  A  }
> >
> > - A  A  A  usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> > -
> > - A  A  A  /* Check if a threshold crossed before adding a new one */
> > - A  A  A  if (thresholds)
> > - A  A  A  A  A  A  A  __mem_cgroup_threshold(memcg, type == _MEMSWAP);
> > + A  A  A  if (type != _OOM_TYPE) {
> > + A  A  A  A  A  A  A  usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> > + A  A  A  A  A  A  A  /* Check if a threshold crossed before adding a new one */
> > + A  A  A  A  A  A  A  if (thresholds)
> > + A  A  A  A  A  A  A  A  A  A  A  __mem_cgroup_threshold(memcg, type == _MEMSWAP);
> > + A  A  A  }
> >
> > A  A  A  A if (thresholds)
> > A  A  A  A  A  A  A  A size = thresholds->size + 1;
> > @@ -3416,27 +3456,34 @@ static int mem_cgroup_register_event(str
> > A  A  A  A thresholds_new->entries[size - 1].threshold = threshold;
> >
> > A  A  A  A /* Sort thresholds. Registering of new threshold isn't time-critical */
> > - A  A  A  sort(thresholds_new->entries, size,
> > + A  A  A  if (type != _OOM_TYPE) {
> > + A  A  A  A  A  A  A  sort(thresholds_new->entries, size,
> > A  A  A  A  A  A  A  A  A  A  A  A sizeof(struct mem_cgroup_threshold),
> > A  A  A  A  A  A  A  A  A  A  A  A compare_thresholds, NULL);
> > -
> > - A  A  A  /* Find current threshold */
> > - A  A  A  atomic_set(&thresholds_new->current_threshold, -1);
> > - A  A  A  for (i = 0; i < size; i++) {
> > - A  A  A  A  A  A  A  if (thresholds_new->entries[i].threshold < usage) {
> > - A  A  A  A  A  A  A  A  A  A  A  /*
> > - A  A  A  A  A  A  A  A  A  A  A  A * thresholds_new->current_threshold will not be used
> > - A  A  A  A  A  A  A  A  A  A  A  A * until rcu_assign_pointer(), so it's safe to increment
> > - A  A  A  A  A  A  A  A  A  A  A  A * it here.
> > - A  A  A  A  A  A  A  A  A  A  A  A */
> > - A  A  A  A  A  A  A  A  A  A  A  atomic_inc(&thresholds_new->current_threshold);
> > + A  A  A  A  A  A  A  /* Find current threshold */
> > + A  A  A  A  A  A  A  atomic_set(&thresholds_new->current_threshold, -1);
> > + A  A  A  A  A  A  A  for (i = 0; i < size; i++) {
> > + A  A  A  A  A  A  A  A  A  A  A  if (thresholds_new->entries[i].threshold < usage) {
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  /*
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * thresholds_new->current_threshold will not
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * be used until rcu_assign_pointer(), so it's
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * safe to increment it here.
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A */
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  atomic_inc(&thresholds_new->current_threshold);
> > + A  A  A  A  A  A  A  A  A  A  A  }
> > A  A  A  A  A  A  A  A }
> > A  A  A  A }
> > -
> > - A  A  A  if (type == _MEM)
> > + A  A  A  switch (type) {
> > + A  A  A  case _MEM:
> > A  A  A  A  A  A  A  A rcu_assign_pointer(memcg->thresholds, thresholds_new);
> > - A  A  A  else
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  case _MEMSWAP:
> > A  A  A  A  A  A  A  A rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  case _OOM_TYPE:
> > + A  A  A  A  A  A  A  rcu_assign_pointer(memcg->oom_notify, thresholds_new);
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  }
> >
> > A  A  A  A /* To be sure that nobody uses thresholds before freeing it */
> > A  A  A  A synchronize_rcu();
> > @@ -3454,17 +3501,25 @@ static int mem_cgroup_unregister_event(s
> > A  A  A  A struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > A  A  A  A struct mem_cgroup_threshold_ary *thresholds, *thresholds_new;
> > A  A  A  A int type = MEMFILE_TYPE(cft->private);
> > - A  A  A  u64 usage;
> > + A  A  A  u64 usage = 0;
> > A  A  A  A int size = 0;
> > A  A  A  A int i, j, ret;
> >
> > A  A  A  A mutex_lock(&memcg->thresholds_lock);
> > - A  A  A  if (type == _MEM)
> > + A  A  A  /* check eventfd is for OOM check or not */
> > + A  A  A  switch (type) {
> > + A  A  A  case _MEM:
> > A  A  A  A  A  A  A  A thresholds = memcg->thresholds;
> > - A  A  A  else if (type == _MEMSWAP)
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  case _MEMSWAP:
> > A  A  A  A  A  A  A  A thresholds = memcg->memsw_thresholds;
> > - A  A  A  else
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  case _OOM_TYPE:
> > + A  A  A  A  A  A  A  thresholds = memcg->oom_notify;
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  default:
> > A  A  A  A  A  A  A  A BUG();
> > + A  A  A  }
> >
> > A  A  A  A /*
> > A  A  A  A  * Something went wrong if we trying to unregister a threshold
> > @@ -3472,11 +3527,11 @@ static int mem_cgroup_unregister_event(s
> > A  A  A  A  */
> > A  A  A  A BUG_ON(!thresholds);
> >
> > - A  A  A  usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> > -
> > - A  A  A  /* Check if a threshold crossed before removing */
> > - A  A  A  __mem_cgroup_threshold(memcg, type == _MEMSWAP);
> > -
> > + A  A  A  if (type != _OOM_TYPE) {
> > + A  A  A  A  A  A  A  usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
> > + A  A  A  A  A  A  A  /* Check if a threshold crossed before removing */
> > + A  A  A  A  A  A  A  __mem_cgroup_threshold(memcg, type == _MEMSWAP);
> > + A  A  A  }
> > A  A  A  A /* Calculate new number of threshold */
> > A  A  A  A for (i = 0; i < thresholds->size; i++) {
> > A  A  A  A  A  A  A  A if (thresholds->entries[i].eventfd != eventfd)
> > @@ -3500,13 +3555,15 @@ static int mem_cgroup_unregister_event(s
> > A  A  A  A thresholds_new->size = size;
> >
> > A  A  A  A /* Copy thresholds and find current threshold */
> > - A  A  A  atomic_set(&thresholds_new->current_threshold, -1);
> > + A  A  A  if (type != _OOM_TYPE)
> > + A  A  A  A  A  A  A  atomic_set(&thresholds_new->current_threshold, -1);
> > A  A  A  A for (i = 0, j = 0; i < thresholds->size; i++) {
> > A  A  A  A  A  A  A  A if (thresholds->entries[i].eventfd == eventfd)
> > A  A  A  A  A  A  A  A  A  A  A  A continue;
> >
> > A  A  A  A  A  A  A  A thresholds_new->entries[j] = thresholds->entries[i];
> > - A  A  A  A  A  A  A  if (thresholds_new->entries[j].threshold < usage) {
> > + A  A  A  A  A  A  A  if (type != _OOM_TYPE &&
> > + A  A  A  A  A  A  A  A  A  A  A  thresholds_new->entries[j].threshold < usage) {
> > A  A  A  A  A  A  A  A  A  A  A  A /*
> > A  A  A  A  A  A  A  A  A  A  A  A  * thresholds_new->current_threshold will not be used
> > A  A  A  A  A  A  A  A  A  A  A  A  * until rcu_assign_pointer(), so it's safe to increment
> > @@ -3518,11 +3575,17 @@ static int mem_cgroup_unregister_event(s
> > A  A  A  A }
> >
> > A assign:
> > - A  A  A  if (type == _MEM)
> > + A  A  A  switch (type) {
> > + A  A  A  case _MEM:
> > A  A  A  A  A  A  A  A rcu_assign_pointer(memcg->thresholds, thresholds_new);
> > - A  A  A  else
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  case _MEMSWAP:
> > A  A  A  A  A  A  A  A rcu_assign_pointer(memcg->memsw_thresholds, thresholds_new);
> > -
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  case _OOM_TYPE:
> > + A  A  A  A  A  A  A  rcu_assign_pointer(memcg->oom_notify, thresholds_new);
> > + A  A  A  A  A  A  A  break;
> > + A  A  A  }
> > A  A  A  A /* To be sure that nobody uses thresholds before freeing it */
> > A  A  A  A synchronize_rcu();
> >
> > @@ -3588,6 +3651,12 @@ static struct cftype mem_cgroup_files[]
> > A  A  A  A  A  A  A  A .read_u64 = mem_cgroup_move_charge_read,
> > A  A  A  A  A  A  A  A .write_u64 = mem_cgroup_move_charge_write,
> > A  A  A  A },
> > + A  A  A  {
> > + A  A  A  A  A  A  A  .name = "oom_control",
> > + A  A  A  A  A  A  A  .register_event = mem_cgroup_register_event,
> > + A  A  A  A  A  A  A  .unregister_event = mem_cgroup_unregister_event,
> > + A  A  A  A  A  A  A  .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
> > + A  A  A  },
> > A };
> >
> > A #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > Index: mmotm-2.6.33-Mar5/Documentation/cgroups/memory.txt
> > ===================================================================
> > --- mmotm-2.6.33-Mar5.orig/Documentation/cgroups/memory.txt
> > +++ mmotm-2.6.33-Mar5/Documentation/cgroups/memory.txt
> > @@ -184,6 +184,9 @@ limits on the root cgroup.
> >
> > A Note2: When panic_on_oom is set to "2", the whole system will panic.
> >
> > +When oom event notifier is registered, event will be delivered.
> > +(See oom_control section)
> > +
> > A 2. Locking
> >
> > A The memory controller uses the following hierarchy
> > @@ -486,7 +489,22 @@ threshold in any direction.
> >
> > A It's applicable for root and non-root cgroup.
> >
> > -10. TODO
> > +10. OOM Control
> > +
> > +Memory controler implements oom notifier using cgroup notification
> > +API (See cgroups.txt). It allows to register multiple oom notification
> > +delivery and gets notification when oom happens.
> > +
> > +To register a notifier, application need:
> > + - create an eventfd using eventfd(2)
> > + - open memory.oom_control file
> > + - write string like "<event_fd> <memory.oom_control>" to cgroup.event_control
> > +
> > +Application will be notifier through eventfd when oom happens.
> > +OOM notification doesn't work for root cgroup.
> > +
> > +
> > +11. TODO
> >
> > A 1. Add support for accounting huge pages (as a separate controller)
> > A 2. Make per-cgroup scanner reclaim not-shared pages first
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org. A For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 0/2]  memcg: oom notifier and handling oom by user
  2010-03-08  7:24 [RFC][PATCH 0/2] memcg: oom notifier and handling oom by user KAMEZAWA Hiroyuki
  2010-03-08  7:25 ` [RFC][PATCH 1/2] memcg: oom notifier KAMEZAWA Hiroyuki
  2010-03-08  7:27 ` [RFC][PATCH 2/2] memcg: oom killer disable and hooks for stop and recover KAMEZAWA Hiroyuki
@ 2010-03-08 17:26 ` Balbir Singh
  2010-03-08 23:57   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 7+ messages in thread
From: Balbir Singh @ 2010-03-08 17:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp,
	linux-kernel@vger.kernel.org

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-08 16:24:14]:

> This 2 patches is for memcg's oom handling.
> 
> At first, memcg's oom doesn't mean "no more resource" but means "we hit limit."
> Then, daemons/user shells out of a memcg can work even if it's under oom.
> So, if we have notifier and some more features, we can do something moderate
> rather than killing at oom. 
> 
> This patch includes
> [1/2] oom notifier for memcg (using evetfd framework of cgroups.)
> [2/2] oom killer disalibing and hooks for waitq and wake-up.
> 
> When memcg's oom-killer is disabled, all tasks which request accountable memory
> will sleep in waitq. It will be waken up by user's action as
>  - enlarge limit. (memory or memsw)
>  - kill some tasks
>  - move some tasks (account migration is enabled.)
> 

Hmm... I've not seen the waitq and wake-up patches, but does that mean
user space will control resumtion of tasks?


-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 0/2]  memcg: oom notifier and handling oom by user
  2010-03-08 17:26 ` [RFC][PATCH 0/2] memcg: oom notifier and handling oom by user Balbir Singh
@ 2010-03-08 23:57   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-08 23:57 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp,
	linux-kernel@vger.kernel.org

On Mon, 8 Mar 2010 22:56:09 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-08 16:24:14]:
> 
> > This 2 patches is for memcg's oom handling.
> > 
> > At first, memcg's oom doesn't mean "no more resource" but means "we hit limit."
> > Then, daemons/user shells out of a memcg can work even if it's under oom.
> > So, if we have notifier and some more features, we can do something moderate
> > rather than killing at oom. 
> > 
> > This patch includes
> > [1/2] oom notifier for memcg (using evetfd framework of cgroups.)
> > [2/2] oom killer disalibing and hooks for waitq and wake-up.
> > 
> > When memcg's oom-killer is disabled, all tasks which request accountable memory
> > will sleep in waitq. It will be waken up by user's action as
> >  - enlarge limit. (memory or memsw)
> >  - kill some tasks
> >  - move some tasks (account migration is enabled.)
> > 
> 
> Hmm... I've not seen the waitq and wake-up patches, but does that mean
> user space will control resumtion of tasks?
> 
Yes. And what's useful in this behavior rathar than oom-kill(SIGKILL) by
the kernel is that users can take coredump (by gcore at el.) and snapshot of
all tasks's resource usage (by ps at el.) even if he has to kill a task.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-03-09  0:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-08  7:24 [RFC][PATCH 0/2] memcg: oom notifier and handling oom by user KAMEZAWA Hiroyuki
2010-03-08  7:25 ` [RFC][PATCH 1/2] memcg: oom notifier KAMEZAWA Hiroyuki
2010-03-08  8:32   ` Kirill A. Shutemov
2010-03-08  8:33     ` KAMEZAWA Hiroyuki
2010-03-08  7:27 ` [RFC][PATCH 2/2] memcg: oom killer disable and hooks for stop and recover KAMEZAWA Hiroyuki
2010-03-08 17:26 ` [RFC][PATCH 0/2] memcg: oom notifier and handling oom by user Balbir Singh
2010-03-08 23:57   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).