[PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6]  Fix oom killer doesn't work at all if system have > gigabytes memory  (aka CAI founded issue)
@ 2011-06-22 10:45 KOSAKI Motohiro
  2011-06-22 10:46 ` [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process KOSAKI Motohiro
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: KOSAKI Motohiro @ 2011-06-22 10:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel, akpm, caiqian, rientjes, hughd,
	kamezawa.hiroyu, minchan.kim, oleg
  Cc: kosaki.motohiro

CAI Qian reported current oom logic doesn't work at all on his 16GB RAM
machine. oom killer killed all system daemon at first and his system
stopped responding.

The brief log is below.

> > Out of memory: Kill process 1175 (dhclient) score 1 or sacrifice child
> > Out of memory: Kill process 1247 (rsyslogd) score 1 or sacrifice child
> > Out of memory: Kill process 1284 (irqbalance) score 1 or sacrifice child
> > Out of memory: Kill process 1303 (rpcbind) score 1 or sacrifice child
> > Out of memory: Kill process 1321 (rpc.statd) score 1 or sacrifice child
> > Out of memory: Kill process 1333 (mdadm) score 1 or sacrifice child
> > Out of memory: Kill process 1365 (rpc.idmapd) score 1 or sacrifice child
> > Out of memory: Kill process 1403 (dbus-daemon) score 1 or sacrifice child
> > Out of memory: Kill process 1438 (acpid) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1487 (hald-addon-inpu) score 1 or sacrifice child
> > Out of memory: Kill process 1488 (hald-addon-acpi) score 1 or sacrifice child
> > Out of memory: Kill process 1507 (automount) score 1 or sacrifice child

The problems are three.

1) if two processes have the same oom score, we should kill younger process.
but current logic kill older. Typically oldest processes are system daemons.
2) Current logic use 'unsigned int' for internal score calculation. (exactly says,
it only use 0-1000 value). its very low precision calculation makes a lot of
same oom score and kill an ineligible process.
3) Current logic give 3% of SystemRAM to root processes. It obviously too big
if you have plenty memory. Now, your fork-bomb processes have 500MB OOM immune
bonus. then your fork-bomb never ever be killed.

Changes from v2
 o added [patch 1/5] use euid instead of CAP_SYS_ADMIN


KOSAKI Motohiro (6):
  oom: use euid instead of CAP_SYS_ADMIN for protection root process
  oom: improve dump_tasks() show items
  oom: kill younger process first
  oom: oom-killer don't use proportion of system-ram internally
  oom: don't kill random process
  oom: merge oom_kill_process() with oom_kill_task()

 fs/proc/base.c        |   13 ++-
 include/linux/oom.h   |    5 +-
 include/linux/sched.h |   11 +++
 mm/oom_kill.c         |  201 ++++++++++++++++++++++++++----------------------
 4 files changed, 131 insertions(+), 99 deletions(-)

-- 
1.7.3.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process
  2011-06-22 10:45 [PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue) KOSAKI Motohiro
@ 2011-06-22 10:46 ` KOSAKI Motohiro
  2011-06-22 22:57   ` David Rientjes
  2011-06-22 10:47 ` [PATCH 2/6] oom: improve dump_tasks() show items KOSAKI Motohiro
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2011-06-22 10:46 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-mm, linux-kernel, akpm, caiqian, rientjes, hughd,
	kamezawa.hiroyu, minchan.kim, oleg

Recently, many userland daemon prefer to use libcap-ng and drop
all privilege just after startup. Because of (1) Almost privilege
are necessary only when special file open, and aren't necessary
read and write. (2) In general, privilege dropping brings better
protection from exploit when bugs are found in the daemon.

But, it makes suboptimal oom-killer behavior. CAI Qian reported
oom killer killed some important daemon at first on his fedora
like distro. Because they've lost CAP_SYS_ADMIN.

Of course, we recommend to drop privileges as far as possible
instead of keeping them. Thus, oom killer don't have to check
any capability. It implicitly suggest wrong programming style.

This patch change root process check way from CAP_SYS_ADMIN to
just euid==0.

Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e4b0991..f552e39 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -203,7 +203,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 * Root processes get 3% bonus, just like the __vm_enough_memory()
 	 * implementation used by LSMs.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+	if (task_euid(p) == 0)
 		points -= 30;

 	/*
@@ -373,7 +373,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
 	struct task_struct *p;
 	struct task_struct *task;

-	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
+	pr_info("[ pid ]   uid  euid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
 	for_each_process(p) {
 		if (oom_unkillable_task(p, mem, nodemask))
 			continue;
@@ -388,8 +388,9 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
 			continue;
 		}

-		pr_info("[%5d] %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
-			task->pid, task_uid(task), task->tgid,
+		pr_info("[%5d] %5d %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
+			task->pid, task_uid(task), task_euid(task),
+			task->tgid,
 			task->mm->total_vm, get_mm_rss(task->mm),
 			task_cpu(task), task->signal->oom_adj,
 			task->signal->oom_score_adj, task->comm);
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/6] oom: improve dump_tasks() show items
  2011-06-22 10:45 [PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue) KOSAKI Motohiro
  2011-06-22 10:46 ` [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process KOSAKI Motohiro
@ 2011-06-22 10:47 ` KOSAKI Motohiro
  2011-06-22 22:59   ` David Rientjes
  2011-06-22 10:47 ` [PATCH 3/6] oom: kill younger process first KOSAKI Motohiro
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2011-06-22 10:47 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-mm, linux-kernel, akpm, caiqian, rientjes, hughd,
	kamezawa.hiroyu, minchan.kim, oleg

Recently, oom internal logic was dramatically changed. Thus
dump_tasks() doesn't show enough information for bug report
analysis. it has some meaningless items and don't have some
oom socre related items.

This patch adapt displaying fields to new oom logic.

details
--------
removed: pid (we always kill process. don't need thread id),
         signal->oom_adj (we no longer uses it internally)
	 cpu (we no longer uses it)
added:  ppid (we often kill sacrifice child process)
        swap (it's accounted)
modify: RSS (account mm->nr_ptes too)

<old>
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[ 3886]     0  3886     2893      441   1       0             0 bash
[ 3905]     0  3905    29361    25833   0       0             0 memtoy

<new>
[   pid]   ppid   uid euid total_vm      rss     swap score_adj name
[   417]      1     0    0     3298       12      184     -1000 udevd
[   830]      1     0    0     1776       11       16         0 system-setup-ke
[   973]      1     0    0    61179       35      116         0 rsyslogd
[  1733]   1732     0    0  1052337   958582        0         0 memtoy

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f552e39..9412657 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -373,7 +373,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
 	struct task_struct *p;
 	struct task_struct *task;

-	pr_info("[ pid ]   uid  euid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
+	pr_info("[   pid]   ppid   uid  euid total_vm      rss     swap score_adj name\n");
 	for_each_process(p) {
 		if (oom_unkillable_task(p, mem, nodemask))
 			continue;
@@ -388,12 +388,14 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
 			continue;
 		}

-		pr_info("[%5d] %5d %5d %5d %8lu %8lu %3u     %3d         %5d %s\n",
-			task->pid, task_uid(task), task_euid(task),
-			task->tgid,
-			task->mm->total_vm, get_mm_rss(task->mm),
-			task_cpu(task), task->signal->oom_adj,
-			task->signal->oom_score_adj, task->comm);
+		pr_info("[%6d] %6d %5d %5d %8lu %8lu %8lu %9d %s\n",
+			task_tgid_nr(task), task_tgid_nr(task->real_parent),
+			task_uid(task),	task_euid(task),
+			task->mm->total_vm,
+			get_mm_rss(task->mm) + task->mm->nr_ptes,
+			get_mm_counter(task->mm, MM_SWAPENTS),
+			task->signal->oom_score_adj,
+			task->comm);
 		task_unlock(task);
 	}
 }
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/6] oom: kill younger process first
  2011-06-22 10:45 [PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue) KOSAKI Motohiro
  2011-06-22 10:46 ` [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process KOSAKI Motohiro
  2011-06-22 10:47 ` [PATCH 2/6] oom: improve dump_tasks() show items KOSAKI Motohiro
@ 2011-06-22 10:47 ` KOSAKI Motohiro
  2011-06-22 23:01   ` David Rientjes
  2011-06-22 10:48 ` [PATCH 4/6] oom: oom-killer don't use proportion of system-ram internally KOSAKI Motohiro
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2011-06-22 10:47 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-mm, linux-kernel, akpm, caiqian, rientjes, hughd,
	kamezawa.hiroyu, minchan.kim, oleg

This patch introduces do_each_thread_reverse() and select_bad_process()
uses it. The benefits are two, 1) oom-killer can kill younger process
than older if they have a same oom score. Usually younger process is
less important. 2) younger task often have PF_EXITING because shell
script makes a lot of short lived processes. Reverse order search can
detect it faster.

Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/sched.h |   11 +++++++++++
 mm/oom_kill.c         |    2 +-
 2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e4e6d7b..392ff30 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2257,6 +2257,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
 #define next_task(p) \
 	list_entry_rcu((p)->tasks.next, struct task_struct, tasks)

+#define prev_task(p) \
+	list_entry((p)->tasks.prev, struct task_struct, tasks)
+
 #define for_each_process(p) \
 	for (p = &init_task ; (p = next_task(p)) != &init_task ; )

@@ -2269,6 +2272,14 @@ extern bool current_is_single_threaded(void);
 #define do_each_thread(g, t) \
 	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do

+/*
+ * Similar to do_each_thread(). but two difference are there.
+ *  - traverse tasks reverse order (i.e. younger to older)
+ *  - caller must hold tasklist_lock. rcu_read_lock isn't enough
+*/
+#define do_each_thread_reverse(g, t) \
+	for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
+
 #define while_each_thread(g, t) \
 	while ((t = next_thread(t)) != g)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9412657..797308b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -300,7 +300,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	struct task_struct *chosen = NULL;
 	*ppoints = 0;

-	do_each_thread(g, p) {
+	do_each_thread_reverse(g, p) {
 		unsigned int points;

 		if (!p->mm)
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/6] oom: oom-killer don't use proportion of system-ram internally
  2011-06-22 10:45 [PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue) KOSAKI Motohiro
                   ` (2 preceding siblings ...)
  2011-06-22 10:47 ` [PATCH 3/6] oom: kill younger process first KOSAKI Motohiro
@ 2011-06-22 10:48 ` KOSAKI Motohiro
  2011-06-22 23:16   ` David Rientjes
  2011-06-22 10:48 ` [PATCH 5/6] oom: don't kill random process KOSAKI Motohiro
  2011-06-22 10:49 ` [PATCH 6/6] oom: merge oom_kill_process() with oom_kill_task() KOSAKI Motohiro
  5 siblings, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2011-06-22 10:48 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-mm, linux-kernel, akpm, caiqian, rientjes, hughd,
	kamezawa.hiroyu, minchan.kim, oleg

CAI Qian reported his kernel did hang-up if he ran fork intensive
workload and then invoke oom-killer.

The problem is, current oom calculation uses 0-1000 normalized value
(The unit is a permillage of system-ram). Its low precision make
a lot of same oom score. IOW, in his case, all processes have smaller
oom score than 1 and internal calculation round it to 1.

Thus oom-killer kill ineligible process. This regression is caused by
commit a63d83f427 (oom: badness heuristic rewrite).

The solution is, the internal calculation just use number of pages
instead of permillage of system-ram. And convert it to permillage
value at displaying time.

This patch doesn't change any ABI (included  /proc/<pid>/oom_score_adj)
even though current logic has a lot of my dislike thing.

Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c      |   13 ++++++----
 include/linux/oom.h |    2 +-
 mm/oom_kill.c       |   60 ++++++++++++++++++++++++++++++--------------------
 3 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 14def99..4a10763 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -479,14 +479,17 @@ static const struct file_operations proc_lstats_operations = {

 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
-	unsigned long points = 0;
+	unsigned long points;
+	unsigned long ratio = 0;
+	unsigned long totalpages = totalram_pages + total_swap_pages + 1;

 	read_lock(&tasklist_lock);
-	if (pid_alive(task))
-		points = oom_badness(task, NULL, NULL,
-					totalram_pages + total_swap_pages);
+	if (pid_alive(task)) {
+		points = oom_badness(task, NULL, NULL, totalpages);
+		ratio = points * 1000 / totalpages;
+	}
 	read_unlock(&tasklist_lock);
-	return sprintf(buffer, "%lu\n", points);
+	return sprintf(buffer, "%lu\n", ratio);
 }

 struct limit_names {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 4952fb8..75b104c 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -42,7 +42,7 @@ enum oom_constraint {

 extern int test_set_oom_score_adj(int new_val);

-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 			const nodemask_t *nodemask, unsigned long totalpages);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 797308b..cff8000 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -159,10 +159,11 @@ static bool oom_unkillable_task(struct task_struct *p,
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 		      const nodemask_t *nodemask, unsigned long totalpages)
 {
-	int points;
+	unsigned long points;
+	unsigned long score_adj = 0;

 	if (oom_unkillable_task(p, mem, nodemask))
 		return 0;
@@ -194,33 +195,44 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 */
 	points = get_mm_rss(p->mm) + p->mm->nr_ptes;
 	points += get_mm_counter(p->mm, MM_SWAPENTS);
-
-	points *= 1000;
-	points /= totalpages;
 	task_unlock(p);

-	/*
-	 * Root processes get 3% bonus, just like the __vm_enough_memory()
-	 * implementation used by LSMs.
-	 */
-	if (task_euid(p) == 0)
-		points -= 30;
+	/* Root processes get 3% bonus. */
+	if (task_euid(p) == 0) {
+		if (points >= totalpages / 32)
+			points -= totalpages / 32;
+		else
+			points = 0;
+	}

 	/*
 	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
 	 * either completely disable oom killing or always prefer a certain
 	 * task.
 	 */
-	points += p->signal->oom_score_adj;
+	if (p->signal->oom_score_adj >= 0) {
+		score_adj = p->signal->oom_score_adj * (totalpages / 1000);
+		if (ULONG_MAX - points >= score_adj)
+			points += score_adj;
+		else
+			points = ULONG_MAX;
+	} else {
+		score_adj = -p->signal->oom_score_adj * (totalpages / 1000);
+		if (points >= score_adj)
+			points -= score_adj;
+		else
+			points = 0;
+	}

 	/*
 	 * Never return 0 for an eligible task that may be killed since it's
 	 * possible that no single user task uses more than 0.1% of memory and
 	 * no single admin tasks uses more than 3.0%.
 	 */
-	if (points <= 0)
-		return 1;
-	return (points < 1000) ? points : 1000;
+	if (!points)
+		points = 1;
+
+	return points;
 }

 /*
@@ -292,7 +304,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned int *ppoints,
+static struct task_struct *select_bad_process(unsigned long *ppoints,
 		unsigned long totalpages, struct mem_cgroup *mem,
 		const nodemask_t *nodemask)
 {
@@ -301,7 +313,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	*ppoints = 0;

 	do_each_thread_reverse(g, p) {
-		unsigned int points;
+		unsigned long points;

 		if (!p->mm)
 			continue;
@@ -332,7 +344,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 			 */
 			if (p == current) {
 				chosen = p;
-				*ppoints = 1000;
+				*ppoints = ULONG_MAX;
 			} else {
 				/*
 				 * If this task is not being ptraced on exit,
@@ -463,14 +475,14 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 #undef K

 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned int points, unsigned long totalpages,
+			    unsigned long points, unsigned long totalpages,
 			    struct mem_cgroup *mem, nodemask_t *nodemask,
 			    const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
 	struct task_struct *t = p;
-	unsigned int victim_points = 0;
+	unsigned long victim_points = 0;

 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);
@@ -485,7 +497,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}

 	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) points %lu or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);

@@ -497,7 +509,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
+			unsigned long child_points;

 			if (child->mm == p->mm)
 				continue;
@@ -544,7 +556,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
 	unsigned long limit;
-	unsigned int points = 0;
+	unsigned long points = 0;
 	struct task_struct *p;

 	/*
@@ -693,7 +705,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	struct task_struct *p;
 	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned long points;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;

-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/6] oom: don't kill random process
  2011-06-22 10:45 [PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue) KOSAKI Motohiro
                   ` (3 preceding siblings ...)
  2011-06-22 10:48 ` [PATCH 4/6] oom: oom-killer don't use proportion of system-ram internally KOSAKI Motohiro
@ 2011-06-22 10:48 ` KOSAKI Motohiro
  2011-06-22 23:22   ` David Rientjes
  2011-06-22 10:49 ` [PATCH 6/6] oom: merge oom_kill_process() with oom_kill_task() KOSAKI Motohiro
  5 siblings, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2011-06-22 10:48 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-mm, linux-kernel, akpm, caiqian, rientjes, hughd,
	kamezawa.hiroyu, minchan.kim, oleg

CAI Qian reported oom-killer killed all system daemons in his
system at first if he ran fork bomb as root. The problem is,
current logic give them bonus of 3% of system ram. Example,
he has 16GB machine, then root processes have ~500MB oom
immune. It bring us crazy bad result. _all_ processes have
oom-score=1 and then, oom killer ignore process memroy usage
and kill random process. This regression is caused by commit
a63d83f427 (oom: badness heuristic rewrite).

This patch changes select_bad_process() slightly. If oom points == 1,
it's a sign that the system have only root privileged processes or
similar. Thus, select_bad_process() calculate oom badness without
root bonus and select eligible process.

Also, this patch move finding sacrifice child logic into
select_bad_process(). It's necessary to implement adequate
no root bonus recalculation. and it makes good side effect,
current logic doesn't behave as the doc.

Documentation/sysctl/vm.txt says

    oom_kill_allocating_task

    If this is set to non-zero, the OOM killer simply kills the task that
    triggered the out-of-memory condition.  This avoids the expensive
    tasklist scan.

IOW, oom_kill_allocating_task shouldn't search sacrifice child.
This patch also fixes this issue.

Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c      |    2 +-
 include/linux/oom.h |    3 +-
 mm/oom_kill.c       |   89 ++++++++++++++++++++++++++++----------------------
 3 files changed, 53 insertions(+), 41 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4a10763..5e4a8a1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -485,7 +485,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

 	read_lock(&tasklist_lock);
 	if (pid_alive(task)) {
-		points = oom_badness(task, NULL, NULL, totalpages);
+		points = oom_badness(task, NULL, NULL, totalpages, 1);
 		ratio = points * 1000 / totalpages;
 	}
 	read_unlock(&tasklist_lock);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 75b104c..272e3bb 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -43,7 +43,8 @@ enum oom_constraint {
 extern int test_set_oom_score_adj(int new_val);

 extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-			const nodemask_t *nodemask, unsigned long totalpages);
+			const nodemask_t *nodemask, unsigned long totalpages,
+			int protect_root);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cff8000..cf48fd5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -160,7 +160,8 @@ static bool oom_unkillable_task(struct task_struct *p,
  * task consuming the most memory to avoid subsequent oom failures.
  */
 unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long totalpages)
+			 const nodemask_t *nodemask, unsigned long totalpages,
+			 int protect_root)
 {
 	unsigned long points;
 	unsigned long score_adj = 0;
@@ -198,7 +199,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	task_unlock(p);

 	/* Root processes get 3% bonus. */
-	if (task_euid(p) == 0) {
+	if (protect_root && task_euid(p) == 0) {
 		if (points >= totalpages / 32)
 			points -= totalpages / 32;
 		else
@@ -310,8 +311,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
-	*ppoints = 0;
+	int protect_root = 1;
+	unsigned long chosen_points = 0;
+	struct task_struct *child;

+ retry:
 	do_each_thread_reverse(g, p) {
 		unsigned long points;

@@ -344,7 +348,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			 */
 			if (p == current) {
 				chosen = p;
-				*ppoints = ULONG_MAX;
+				chosen_points = ULONG_MAX;
 			} else {
 				/*
 				 * If this task is not being ptraced on exit,
@@ -357,13 +361,49 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			}
 		}

-		points = oom_badness(p, mem, nodemask, totalpages);
-		if (points > *ppoints) {
+		points = oom_badness(p, mem, nodemask, totalpages, protect_root);
+		if (points > chosen_points) {
 			chosen = p;
-			*ppoints = points;
+			chosen_points = points;
 		}
 	} while_each_thread(g, p);

+	/*
+	 * chosen_point==1 may be a sign that root privilege bonus is too large
+	 * and we choose wrong task. Let's recalculate oom score without the
+	 * dubious bonus.
+	 */
+	if (protect_root && (chosen_points == 1)) {
+		protect_root = 0;
+		goto retry;
+	}
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	g = p = chosen;
+	do {
+		list_for_each_entry(child, &p->children, sibling) {
+			unsigned long child_points;
+
+			if (child->mm == p->mm)
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child, mem, nodemask,
+						   totalpages, protect_root);
+			if (child_points > chosen_points) {
+				chosen = child;
+				chosen_points = child_points;
+			}
+		}
+	} while_each_thread(g, p);
+
+	*ppoints = chosen_points;
 	return chosen;
 }

@@ -479,11 +519,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    struct mem_cgroup *mem, nodemask_t *nodemask,
 			    const char *message)
 {
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t = p;
-	unsigned long victim_points = 0;
-
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);

@@ -497,35 +532,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}

 	task_lock(p);
-	pr_err("%s: Kill process %d (%s) points %lu or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
+	pr_err("%s: Kill process %d (%s) points %lu\n",
+	       message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);

-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	do {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned long child_points;
-
-			if (child->mm == p->mm)
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, mem, nodemask,
-								totalpages);
-			if (child_points > victim_points) {
-				victim = child;
-				victim_points = child_points;
-			}
-		}
-	} while_each_thread(p, t);
-
-	return oom_kill_task(victim, mem);
+	return oom_kill_task(p, mem);
 }

 /*
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 6/6] oom: merge oom_kill_process() with oom_kill_task()
  2011-06-22 10:45 [PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue) KOSAKI Motohiro
                   ` (4 preceding siblings ...)
  2011-06-22 10:48 ` [PATCH 5/6] oom: don't kill random process KOSAKI Motohiro
@ 2011-06-22 10:49 ` KOSAKI Motohiro
  5 siblings, 0 replies; 12+ messages in thread
From: KOSAKI Motohiro @ 2011-06-22 10:49 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: linux-mm, linux-kernel, akpm, caiqian, rientjes, hughd,
	kamezawa.hiroyu, minchan.kim, oleg

Now, oom_kill_process() become almost empty function. Let's
merge it with oom_kill_task().

Also, this patch replace task_pid_nr() with task_tgid_nr().
Because 1) oom killer kill a process, not thread. 2) a userland
don't care thread id.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/oom_kill.c |   53 ++++++++++++++++++++++-------------------------------
 1 files changed, 22 insertions(+), 31 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cf48fd5..2fdbb96 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,11 +470,26 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }

 #define K(x) ((x) << (PAGE_SHIFT-10))
-static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
+static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+			    unsigned long points, unsigned long totalpages,
+			    struct mem_cgroup *mem, nodemask_t *nodemask,
+			    const char *message)
 {
 	struct task_struct *q;
 	struct mm_struct *mm;

+	if (printk_ratelimit())
+		dump_header(p, gfp_mask, order, mem, nodemask);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 */
+	if (p->flags & PF_EXITING) {
+		set_tsk_thread_flag(p, TIF_MEMDIE);
+		return 0;
+	}
+
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 1;
@@ -482,10 +497,11 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 	/* mm cannot be safely dereferenced after task_unlock(p) */
 	mm = p->mm;

-	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		task_pid_nr(p), p->comm, K(p->mm->total_vm),
-		K(get_mm_counter(p->mm, MM_ANONPAGES)),
-		K(get_mm_counter(p->mm, MM_FILEPAGES)));
+	pr_err("%s: Kill process %d (%s) points:%lu total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+	       message, task_tgid_nr(p), p->comm, points,
+	       K(p->mm->total_vm),
+	       K(get_mm_counter(p->mm, MM_ANONPAGES)),
+	       K(get_mm_counter(p->mm, MM_FILEPAGES)));
 	task_unlock(p);

 	/*
@@ -502,7 +518,7 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 		if (q->mm == mm && !same_thread_group(q, p)) {
 			task_lock(q);	/* Protect ->comm from prctl() */
 			pr_err("Kill process %d (%s) sharing same memory\n",
-				task_pid_nr(q), q->comm);
+				task_tgid_nr(q), q->comm);
 			task_unlock(q);
 			force_sig(SIGKILL, q);
 		}
@@ -514,31 +530,6 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 }
 #undef K

-static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned long points, unsigned long totalpages,
-			    struct mem_cgroup *mem, nodemask_t *nodemask,
-			    const char *message)
-{
-	if (printk_ratelimit())
-		dump_header(p, gfp_mask, order, mem, nodemask);
-
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	if (p->flags & PF_EXITING) {
-		set_tsk_thread_flag(p, TIF_MEMDIE);
-		return 0;
-	}
-
-	task_lock(p);
-	pr_err("%s: Kill process %d (%s) points %lu\n",
-	       message, task_pid_nr(p), p->comm, points);
-	task_unlock(p);
-
-	return oom_kill_task(p, mem);
-}
-
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process
  2011-06-22 10:46 ` [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process KOSAKI Motohiro
@ 2011-06-22 22:57   ` David Rientjes
  0 siblings, 0 replies; 12+ messages in thread
From: David Rientjes @ 2011-06-22 22:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Andrew Morton, caiqian, Hugh Dickins,
	KAMEZAWA Hiroyuki, Minchan Kim, Oleg Nesterov

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> Recently, many userland daemon prefer to use libcap-ng and drop
> all privilege just after startup. Because of (1) Almost privilege
> are necessary only when special file open, and aren't necessary
> read and write. (2) In general, privilege dropping brings better
> protection from exploit when bugs are found in the daemon.
> 

You could also say that dropping the capability drops the bonus it is 
given in the oom killer.  We've never promised any benefit in the oom 
killer badness scoring without the capability.

> But, it makes suboptimal oom-killer behavior. CAI Qian reported
> oom killer killed some important daemon at first on his fedora
> like distro. Because they've lost CAP_SYS_ADMIN.
> 

I disagree that we should be identifying "important daemons" by tying it 
the effective uid of the process and thus making some sort of inference 
because a thread was forked by root.  I think it is more clear to tie that 
to an actual capability that is present, such as CAP_SYS_ADMIN, or suggest 
that the user give the "important daemon" it's own bonus by tuning 
/proc/pid/oom_score_adj.

We already know that the kernel will not be able to identify critical 
processes perfectly, that's an assumption that we can live with.  We must 
rely on userspace to influence that decision by using the tunable.

If this patch were merged, I could easily imagine an argument in the 
reverse that would just simply revert it: it would be very easy to say 
that CAP_SYS_ADMIN has always given this bonus in recent memory so 
changing it would be a regression over the previous behavior and/or that 
giving the capability to a thread as it runs implies that it should have 
the bonus when the euid may not be 0.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] oom: improve dump_tasks() show items
  2011-06-22 10:47 ` [PATCH 2/6] oom: improve dump_tasks() show items KOSAKI Motohiro
@ 2011-06-22 22:59   ` David Rientjes
  0 siblings, 0 replies; 12+ messages in thread
From: David Rientjes @ 2011-06-22 22:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, akpm, caiqian, hughd, kamezawa.hiroyu,
	minchan.kim, oleg

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> Recently, oom internal logic was dramatically changed. Thus
> dump_tasks() doesn't show enough information for bug report
> analysis. it has some meaningless items and don't have some
> oom socre related items.
> 
> This patch adapt displaying fields to new oom logic.
> 
> details
> --------
> removed: pid (we always kill process. don't need thread id),
>          signal->oom_adj (we no longer uses it internally)
> 	 cpu (we no longer uses it)
> added:  ppid (we often kill sacrifice child process)
>         swap (it's accounted)
> modify: RSS (account mm->nr_ptes too)
> 
> <old>
> [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> [ 3886]     0  3886     2893      441   1       0             0 bash
> [ 3905]     0  3905    29361    25833   0       0             0 memtoy
> 
> <new>
> [   pid]   ppid   uid euid total_vm      rss     swap score_adj name
> [   417]      1     0    0     3298       12      184     -1000 udevd
> [   830]      1     0    0     1776       11       16         0 system-setup-ke
> [   973]      1     0    0    61179       35      116         0 rsyslogd
> [  1733]   1732     0    0  1052337   958582        0         0 memtoy
> 

I like this very much!  I'm always supportive of providing additional 
information that will allow users to investigate oom conditions more 
thoroughly.

I'm not sure that we should be exporting the euid, however, since I 
disagreed with using it in the badness heuristic of the first patch.  
Let's talk about it there and then perhaps it can be removed from the 
tasklist dump if we don't actually end up using it?

Otherwise, it looks good!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/6] oom: kill younger process first
  2011-06-22 10:47 ` [PATCH 3/6] oom: kill younger process first KOSAKI Motohiro
@ 2011-06-22 23:01   ` David Rientjes
  0 siblings, 0 replies; 12+ messages in thread
From: David Rientjes @ 2011-06-22 23:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Andrew Morton, caiqian, Hugh Dickins,
	KAMEZAWA Hiroyuki, Minchan Kim, Oleg Nesterov

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e4e6d7b..392ff30 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2257,6 +2257,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
>  #define next_task(p) \
>  	list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
> 
> +#define prev_task(p) \
> +	list_entry((p)->tasks.prev, struct task_struct, tasks)
> +
>  #define for_each_process(p) \
>  	for (p = &init_task ; (p = next_task(p)) != &init_task ; )
> 
> @@ -2269,6 +2272,14 @@ extern bool current_is_single_threaded(void);
>  #define do_each_thread(g, t) \
>  	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
> 
> +/*
> + * Similar to do_each_thread(). but two difference are there.
> + *  - traverse tasks reverse order (i.e. younger to older)
> + *  - caller must hold tasklist_lock. rcu_read_lock isn't enough
> +*/
> +#define do_each_thread_reverse(g, t) \
> +	for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
> +
>  #define while_each_thread(g, t) \
>  	while ((t = next_thread(t)) != g)
> 

I've already ack'd the patch, but if there is another version posted as a 
result of our discussion of using euid in the heuristic, I think it would 
be helpful to reiterate in this comment that, like do_each_thread(), 
do_each_thread_reverse() is not break-safe either.  It might end up 
preventing a bug down the road.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/6] oom: oom-killer don't use proportion of system-ram internally
  2011-06-22 10:48 ` [PATCH 4/6] oom: oom-killer don't use proportion of system-ram internally KOSAKI Motohiro
@ 2011-06-22 23:16   ` David Rientjes
  0 siblings, 0 replies; 12+ messages in thread
From: David Rientjes @ 2011-06-22 23:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, Andrew Morton, caiqian, Hugh Dickins,
	KAMEZAWA Hiroyuki, Minchan Kim, Oleg Nesterov

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> CAI Qian reported his kernel did hang-up if he ran fork intensive
> workload and then invoke oom-killer.
> 
> The problem is, current oom calculation uses 0-1000 normalized value
> (The unit is a permillage of system-ram). Its low precision make
> a lot of same oom score. IOW, in his case, all processes have smaller
> oom score than 1 and internal calculation round it to 1.
> 
> Thus oom-killer kill ineligible process. This regression is caused by
> commit a63d83f427 (oom: badness heuristic rewrite).
> 
> The solution is, the internal calculation just use number of pages
> instead of permillage of system-ram. And convert it to permillage
> value at displaying time.
> 

Ok, I agree this is better and I like that you've kept the userspace 
interfaces compatible.

> This patch doesn't change any ABI (included  /proc/<pid>/oom_score_adj)
> even though current logic has a lot of my dislike thing.
> 
> Reported-by: CAI Qian <caiqian@redhat.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  fs/proc/base.c      |   13 ++++++----
>  include/linux/oom.h |    2 +-
>  mm/oom_kill.c       |   60 ++++++++++++++++++++++++++++++--------------------
>  3 files changed, 45 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 14def99..4a10763 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -479,14 +479,17 @@ static const struct file_operations proc_lstats_operations = {
> 
>  static int proc_oom_score(struct task_struct *task, char *buffer)
>  {
> -	unsigned long points = 0;
> +	unsigned long points;
> +	unsigned long ratio = 0;
> +	unsigned long totalpages = totalram_pages + total_swap_pages + 1;
> 
>  	read_lock(&tasklist_lock);
> -	if (pid_alive(task))
> -		points = oom_badness(task, NULL, NULL,
> -					totalram_pages + total_swap_pages);
> +	if (pid_alive(task)) {
> +		points = oom_badness(task, NULL, NULL, totalpages);
> +		ratio = points * 1000 / totalpages;
> +	}
>  	read_unlock(&tasklist_lock);
> -	return sprintf(buffer, "%lu\n", points);
> +	return sprintf(buffer, "%lu\n", ratio);
>  }
> 
>  struct limit_names {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 4952fb8..75b104c 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -42,7 +42,7 @@ enum oom_constraint {
> 
>  extern int test_set_oom_score_adj(int new_val);
> 
> -extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> +extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>  			const nodemask_t *nodemask, unsigned long totalpages);
>  extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>  extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 797308b..cff8000 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -159,10 +159,11 @@ static bool oom_unkillable_task(struct task_struct *p,
>   * predictable as possible.  The goal is to return the highest value for the
>   * task consuming the most memory to avoid subsequent oom failures.
>   */
> -unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> +unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>  		      const nodemask_t *nodemask, unsigned long totalpages)
>  {
> -	int points;
> +	unsigned long points;
> +	unsigned long score_adj = 0;

Does this need to be initialized to 0?

> 
>  	if (oom_unkillable_task(p, mem, nodemask))
>  		return 0;

I was going to suggest changing the comment for oom_badness(), but then 
realized that it never stated that it returns a proportion in the first 
place!  I suggest, however, that you modify the comment to specify what 
the return value is: a value up to the point of totalpages that represents 
the amount of rss, swap, and ptes that the process is using.

> @@ -194,33 +195,44 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>  	 */
>  	points = get_mm_rss(p->mm) + p->mm->nr_ptes;
>  	points += get_mm_counter(p->mm, MM_SWAPENTS);
> -
> -	points *= 1000;
> -	points /= totalpages;
>  	task_unlock(p);
> 
> -	/*
> -	 * Root processes get 3% bonus, just like the __vm_enough_memory()
> -	 * implementation used by LSMs.
> -	 */
> -	if (task_euid(p) == 0)
> -		points -= 30;
> +	/* Root processes get 3% bonus. */
> +	if (task_euid(p) == 0) {
> +		if (points >= totalpages / 32)
> +			points -= totalpages / 32;
> +		else
> +			points = 0;
> +	}
> 
>  	/*
>  	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
>  	 * either completely disable oom killing or always prefer a certain
>  	 * task.
>  	 */
> -	points += p->signal->oom_score_adj;
> +	if (p->signal->oom_score_adj >= 0) {
> +		score_adj = p->signal->oom_score_adj * (totalpages / 1000);
> +		if (ULONG_MAX - points >= score_adj)
> +			points += score_adj;
> +		else
> +			points = ULONG_MAX;

Does points = max(points + score_adj, ULONG_MAX) work here?

> +	} else {
> +		score_adj = -p->signal->oom_score_adj * (totalpages / 1000);
> +		if (points >= score_adj)
> +			points -= score_adj;
> +		else
> +			points = 0;
> +	}
> 

points = min(points - score_adj, 0)?

>  	/*
>  	 * Never return 0 for an eligible task that may be killed since it's
>  	 * possible that no single user task uses more than 0.1% of memory and
>  	 * no single admin tasks uses more than 3.0%.
>  	 */
> -	if (points <= 0)
> -		return 1;
> -	return (points < 1000) ? points : 1000;
> +	if (!points)
> +		points = 1;
> +

Comment needs to be updated to say that an eligible task gets at least a 
charge of 1 page instead of 0.1% of memory.

Everything else looks good, thanks for looking at this KOSAKI!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 5/6] oom: don't kill random process
  2011-06-22 10:48 ` [PATCH 5/6] oom: don't kill random process KOSAKI Motohiro
@ 2011-06-22 23:22   ` David Rientjes
  0 siblings, 0 replies; 12+ messages in thread
From: David Rientjes @ 2011-06-22 23:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-kernel, akpm, caiqian, hughd, kamezawa.hiroyu,
	minchan.kim, oleg

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> CAI Qian reported oom-killer killed all system daemons in his
> system at first if he ran fork bomb as root. The problem is,
> current logic give them bonus of 3% of system ram. Example,
> he has 16GB machine, then root processes have ~500MB oom
> immune. It bring us crazy bad result. _all_ processes have
> oom-score=1 and then, oom killer ignore process memroy usage
> and kill random process. This regression is caused by commit
> a63d83f427 (oom: badness heuristic rewrite).
> 

Isn't it better to give admin processes a proportional bonus instead of a 
strict 3% bonus?  I suggested 1% per 10% of memory used earlier and I 
think it would work quite well as an alternative to this.  The highest 
bonus that would actually make any differences in which thread to kill 
would be 5% when an admin process is using 50% of memory: in that case, 
another non-admin thread would have to be using >45% of memory to be 
killed instead.

Would you be satisfied with something like

	points -= (points * 10 / totalpages);

be better?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-06-22 23:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-22 10:45 [PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue) KOSAKI Motohiro
2011-06-22 10:46 ` [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process KOSAKI Motohiro
2011-06-22 22:57   ` David Rientjes
2011-06-22 10:47 ` [PATCH 2/6] oom: improve dump_tasks() show items KOSAKI Motohiro
2011-06-22 22:59   ` David Rientjes
2011-06-22 10:47 ` [PATCH 3/6] oom: kill younger process first KOSAKI Motohiro
2011-06-22 23:01   ` David Rientjes
2011-06-22 10:48 ` [PATCH 4/6] oom: oom-killer don't use proportion of system-ram internally KOSAKI Motohiro
2011-06-22 23:16   ` David Rientjes
2011-06-22 10:48 ` [PATCH 5/6] oom: don't kill random process KOSAKI Motohiro
2011-06-22 23:22   ` David Rientjes
2011-06-22 10:49 ` [PATCH 6/6] oom: merge oom_kill_process() with oom_kill_task() KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).