All of lore.kernel.org
 help / color / mirror / Atom feed
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
To: kosaki.motohiro@jp.fujitsu.com
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, caiqian@redhat.com,
	rientjes@google.com, hughd@google.com,
	kamezawa.hiroyu@jp.fujitsu.com, minchan.kim@gmail.com,
	oleg@redhat.com
Subject: [PATCH 5/6] oom: don't kill random process
Date: Wed, 22 Jun 2011 19:48:46 +0900	[thread overview]
Message-ID: <4E01C88E.3070806@jp.fujitsu.com> (raw)
In-Reply-To: <4E01C7D5.3060603@jp.fujitsu.com>

CAI Qian reported oom-killer killed all system daemons in his
system at first if he ran fork bomb as root. The problem is,
current logic give them bonus of 3% of system ram. Example,
he has 16GB machine, then root processes have ~500MB oom
immune. It bring us crazy bad result. _all_ processes have
oom-score=1 and then, oom killer ignore process memroy usage
and kill random process. This regression is caused by commit
a63d83f427 (oom: badness heuristic rewrite).

This patch changes select_bad_process() slightly. If oom points == 1,
it's a sign that the system have only root privileged processes or
similar. Thus, select_bad_process() calculate oom badness without
root bonus and select eligible process.

Also, this patch move finding sacrifice child logic into
select_bad_process(). It's necessary to implement adequate
no root bonus recalculation. and it makes good side effect,
current logic doesn't behave as the doc.

Documentation/sysctl/vm.txt says

    oom_kill_allocating_task

    If this is set to non-zero, the OOM killer simply kills the task that
    triggered the out-of-memory condition.  This avoids the expensive
    tasklist scan.

IOW, oom_kill_allocating_task shouldn't search sacrifice child.
This patch also fixes this issue.

Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c      |    2 +-
 include/linux/oom.h |    3 +-
 mm/oom_kill.c       |   89 ++++++++++++++++++++++++++++----------------------
 3 files changed, 53 insertions(+), 41 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4a10763..5e4a8a1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -485,7 +485,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

 	read_lock(&tasklist_lock);
 	if (pid_alive(task)) {
-		points = oom_badness(task, NULL, NULL, totalpages);
+		points = oom_badness(task, NULL, NULL, totalpages, 1);
 		ratio = points * 1000 / totalpages;
 	}
 	read_unlock(&tasklist_lock);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 75b104c..272e3bb 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -43,7 +43,8 @@ enum oom_constraint {
 extern int test_set_oom_score_adj(int new_val);

 extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-			const nodemask_t *nodemask, unsigned long totalpages);
+			const nodemask_t *nodemask, unsigned long totalpages,
+			int protect_root);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cff8000..cf48fd5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -160,7 +160,8 @@ static bool oom_unkillable_task(struct task_struct *p,
  * task consuming the most memory to avoid subsequent oom failures.
  */
 unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long totalpages)
+			 const nodemask_t *nodemask, unsigned long totalpages,
+			 int protect_root)
 {
 	unsigned long points;
 	unsigned long score_adj = 0;
@@ -198,7 +199,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	task_unlock(p);

 	/* Root processes get 3% bonus. */
-	if (task_euid(p) == 0) {
+	if (protect_root && task_euid(p) == 0) {
 		if (points >= totalpages / 32)
 			points -= totalpages / 32;
 		else
@@ -310,8 +311,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
-	*ppoints = 0;
+	int protect_root = 1;
+	unsigned long chosen_points = 0;
+	struct task_struct *child;

+ retry:
 	do_each_thread_reverse(g, p) {
 		unsigned long points;

@@ -344,7 +348,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			 */
 			if (p == current) {
 				chosen = p;
-				*ppoints = ULONG_MAX;
+				chosen_points = ULONG_MAX;
 			} else {
 				/*
 				 * If this task is not being ptraced on exit,
@@ -357,13 +361,49 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			}
 		}

-		points = oom_badness(p, mem, nodemask, totalpages);
-		if (points > *ppoints) {
+		points = oom_badness(p, mem, nodemask, totalpages, protect_root);
+		if (points > chosen_points) {
 			chosen = p;
-			*ppoints = points;
+			chosen_points = points;
 		}
 	} while_each_thread(g, p);

+	/*
+	 * chosen_point==1 may be a sign that root privilege bonus is too large
+	 * and we choose wrong task. Let's recalculate oom score without the
+	 * dubious bonus.
+	 */
+	if (protect_root && (chosen_points == 1)) {
+		protect_root = 0;
+		goto retry;
+	}
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	g = p = chosen;
+	do {
+		list_for_each_entry(child, &p->children, sibling) {
+			unsigned long child_points;
+
+			if (child->mm == p->mm)
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child, mem, nodemask,
+						   totalpages, protect_root);
+			if (child_points > chosen_points) {
+				chosen = child;
+				chosen_points = child_points;
+			}
+		}
+	} while_each_thread(g, p);
+
+	*ppoints = chosen_points;
 	return chosen;
 }

@@ -479,11 +519,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    struct mem_cgroup *mem, nodemask_t *nodemask,
 			    const char *message)
 {
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t = p;
-	unsigned long victim_points = 0;
-
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);

@@ -497,35 +532,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}

 	task_lock(p);
-	pr_err("%s: Kill process %d (%s) points %lu or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
+	pr_err("%s: Kill process %d (%s) points %lu\n",
+	       message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);

-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	do {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned long child_points;
-
-			if (child->mm == p->mm)
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, mem, nodemask,
-								totalpages);
-			if (child_points > victim_points) {
-				victim = child;
-				victim_points = child_points;
-			}
-		}
-	} while_each_thread(p, t);
-
-	return oom_kill_task(victim, mem);
+	return oom_kill_task(p, mem);
 }

 /*
-- 
1.7.3.1




WARNING: multiple messages have this Message-ID (diff)
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
To: kosaki.motohiro@jp.fujitsu.com
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, caiqian@redhat.com,
	rientjes@google.com, hughd@google.com,
	kamezawa.hiroyu@jp.fujitsu.com, minchan.kim@gmail.com,
	oleg@redhat.com
Subject: [PATCH 5/6] oom: don't kill random process
Date: Wed, 22 Jun 2011 19:48:46 +0900	[thread overview]
Message-ID: <4E01C88E.3070806@jp.fujitsu.com> (raw)
In-Reply-To: <4E01C7D5.3060603@jp.fujitsu.com>

CAI Qian reported oom-killer killed all system daemons in his
system at first if he ran fork bomb as root. The problem is,
current logic give them bonus of 3% of system ram. Example,
he has 16GB machine, then root processes have ~500MB oom
immune. It bring us crazy bad result. _all_ processes have
oom-score=1 and then, oom killer ignore process memroy usage
and kill random process. This regression is caused by commit
a63d83f427 (oom: badness heuristic rewrite).

This patch changes select_bad_process() slightly. If oom points == 1,
it's a sign that the system have only root privileged processes or
similar. Thus, select_bad_process() calculate oom badness without
root bonus and select eligible process.

Also, this patch move finding sacrifice child logic into
select_bad_process(). It's necessary to implement adequate
no root bonus recalculation. and it makes good side effect,
current logic doesn't behave as the doc.

Documentation/sysctl/vm.txt says

    oom_kill_allocating_task

    If this is set to non-zero, the OOM killer simply kills the task that
    triggered the out-of-memory condition.  This avoids the expensive
    tasklist scan.

IOW, oom_kill_allocating_task shouldn't search sacrifice child.
This patch also fixes this issue.

Reported-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/proc/base.c      |    2 +-
 include/linux/oom.h |    3 +-
 mm/oom_kill.c       |   89 ++++++++++++++++++++++++++++----------------------
 3 files changed, 53 insertions(+), 41 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4a10763..5e4a8a1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -485,7 +485,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

 	read_lock(&tasklist_lock);
 	if (pid_alive(task)) {
-		points = oom_badness(task, NULL, NULL, totalpages);
+		points = oom_badness(task, NULL, NULL, totalpages, 1);
 		ratio = points * 1000 / totalpages;
 	}
 	read_unlock(&tasklist_lock);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 75b104c..272e3bb 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -43,7 +43,8 @@ enum oom_constraint {
 extern int test_set_oom_score_adj(int new_val);

 extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-			const nodemask_t *nodemask, unsigned long totalpages);
+			const nodemask_t *nodemask, unsigned long totalpages,
+			int protect_root);
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cff8000..cf48fd5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -160,7 +160,8 @@ static bool oom_unkillable_task(struct task_struct *p,
  * task consuming the most memory to avoid subsequent oom failures.
  */
 unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
-		      const nodemask_t *nodemask, unsigned long totalpages)
+			 const nodemask_t *nodemask, unsigned long totalpages,
+			 int protect_root)
 {
 	unsigned long points;
 	unsigned long score_adj = 0;
@@ -198,7 +199,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	task_unlock(p);

 	/* Root processes get 3% bonus. */
-	if (task_euid(p) == 0) {
+	if (protect_root && task_euid(p) == 0) {
 		if (points >= totalpages / 32)
 			points -= totalpages / 32;
 		else
@@ -310,8 +311,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
-	*ppoints = 0;
+	int protect_root = 1;
+	unsigned long chosen_points = 0;
+	struct task_struct *child;

+ retry:
 	do_each_thread_reverse(g, p) {
 		unsigned long points;

@@ -344,7 +348,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			 */
 			if (p == current) {
 				chosen = p;
-				*ppoints = ULONG_MAX;
+				chosen_points = ULONG_MAX;
 			} else {
 				/*
 				 * If this task is not being ptraced on exit,
@@ -357,13 +361,49 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			}
 		}

-		points = oom_badness(p, mem, nodemask, totalpages);
-		if (points > *ppoints) {
+		points = oom_badness(p, mem, nodemask, totalpages, protect_root);
+		if (points > chosen_points) {
 			chosen = p;
-			*ppoints = points;
+			chosen_points = points;
 		}
 	} while_each_thread(g, p);

+	/*
+	 * chosen_point==1 may be a sign that root privilege bonus is too large
+	 * and we choose wrong task. Let's recalculate oom score without the
+	 * dubious bonus.
+	 */
+	if (protect_root && (chosen_points == 1)) {
+		protect_root = 0;
+		goto retry;
+	}
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	g = p = chosen;
+	do {
+		list_for_each_entry(child, &p->children, sibling) {
+			unsigned long child_points;
+
+			if (child->mm == p->mm)
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child, mem, nodemask,
+						   totalpages, protect_root);
+			if (child_points > chosen_points) {
+				chosen = child;
+				chosen_points = child_points;
+			}
+		}
+	} while_each_thread(g, p);
+
+	*ppoints = chosen_points;
 	return chosen;
 }

@@ -479,11 +519,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    struct mem_cgroup *mem, nodemask_t *nodemask,
 			    const char *message)
 {
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t = p;
-	unsigned long victim_points = 0;
-
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);

@@ -497,35 +532,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}

 	task_lock(p);
-	pr_err("%s: Kill process %d (%s) points %lu or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
+	pr_err("%s: Kill process %d (%s) points %lu\n",
+	       message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);

-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	do {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned long child_points;
-
-			if (child->mm == p->mm)
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, mem, nodemask,
-								totalpages);
-			if (child_points > victim_points) {
-				victim = child;
-				victim_points = child_points;
-			}
-		}
-	} while_each_thread(p, t);
-
-	return oom_kill_task(victim, mem);
+	return oom_kill_task(p, mem);
 }

 /*
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2011-06-22 10:48 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-22 10:45 [PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue) KOSAKI Motohiro
2011-06-22 10:45 ` KOSAKI Motohiro
2011-06-22 10:46 ` [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process KOSAKI Motohiro
2011-06-22 10:46   ` KOSAKI Motohiro
2011-06-22 22:57   ` David Rientjes
2011-06-22 22:57     ` David Rientjes
2011-06-22 10:47 ` [PATCH 2/6] oom: improve dump_tasks() show items KOSAKI Motohiro
2011-06-22 10:47   ` KOSAKI Motohiro
2011-06-22 22:59   ` David Rientjes
2011-06-22 22:59     ` David Rientjes
2011-06-22 10:47 ` [PATCH 3/6] oom: kill younger process first KOSAKI Motohiro
2011-06-22 10:47   ` KOSAKI Motohiro
2011-06-22 23:01   ` David Rientjes
2011-06-22 23:01     ` David Rientjes
2011-06-22 10:48 ` [PATCH 4/6] oom: oom-killer don't use proportion of system-ram internally KOSAKI Motohiro
2011-06-22 10:48   ` KOSAKI Motohiro
2011-06-22 23:16   ` David Rientjes
2011-06-22 23:16     ` David Rientjes
2011-06-22 10:48 ` KOSAKI Motohiro [this message]
2011-06-22 10:48   ` [PATCH 5/6] oom: don't kill random process KOSAKI Motohiro
2011-06-22 23:22   ` David Rientjes
2011-06-22 23:22     ` David Rientjes
2011-06-22 10:49 ` [PATCH 6/6] oom: merge oom_kill_process() with oom_kill_task() KOSAKI Motohiro
2011-06-22 10:49   ` KOSAKI Motohiro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E01C88E.3070806@jp.fujitsu.com \
    --to=kosaki.motohiro@jp.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=caiqian@redhat.com \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=oleg@redhat.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.