[resend][patch 0/4 v3] oom: deadlock avoidance collection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [resend][patch 0/4 v3] oom: deadlock avoidance collection
@ 2011-04-11  5:29 KOSAKI Motohiro
  2011-04-11  5:30 ` [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name KOSAKI Motohiro
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2011-04-11  5:29 UTC (permalink / raw)
  To: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Andrew Morton, Linus Torvalds
  Cc: kosaki.motohiro

Hi,

Here is a resending of Andrey's oom livelock issue avoidance fixes.
Andrew, please let me know if you hope to route this one via my tree.

Thanks.



Changes from v2
 - no change.

KOSAKI Motohiro (4):
  vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
  remove boost_dying_task_prio()
  mm: introduce wait_on_page_locked_killable
  x86,mm: make pagefault killable

 arch/x86/mm/fault.c     |   12 +++++++++++-
 include/linux/mm.h      |    1 +
 include/linux/pagemap.h |    9 +++++++++
 mm/filemap.c            |   42 +++++++++++++++++++++++++++++++++++-------
 mm/oom_kill.c           |   28 ----------------------------
 mm/vmscan.c             |   24 +++++++++++++-----------
 6 files changed, 69 insertions(+), 47 deletions(-)

-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
  2011-04-11  5:29 [resend][patch 0/4 v3] oom: deadlock avoidance collection KOSAKI Motohiro
@ 2011-04-11  5:30 ` KOSAKI Motohiro
  2011-04-11 21:53   ` Andrew Morton
  2011-04-13 18:48   ` David Rientjes
  2011-04-11  5:31 ` [PATCH 2/4] remove boost_dying_task_prio() KOSAKI Motohiro
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2011-04-11  5:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Andrew Morton, Linus Torvalds

all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.

	2006 Sep 25; commit 408d8544; oom: use unreclaimable info

And it went through strange history. firstly, following commit broke
the logic unintentionally.

	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations

Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.

	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0

But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .

	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path

But, recently, Andrey Vagin found the new corner case. Look,

	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}

zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore zones can become a state
of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
current all_unreclaimable() return false even though
zone->all_unreclaimabe=1.

Is this ignorable minor issue? No. Unfortunatelly, x86 has very
small dma zone and it become zone->all_unreclamble=1 easily. and
if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
at all!

Eventually, oom-killer never works on such systems. That said, we
can't use zone->pages_scanned for this purpose. This patch restore
all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of
commit d1908362.

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/vmscan.c |   24 +++++++++++++-----------
 1 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0c5a3d6..468c2a2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -41,6 +41,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/oom.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1988,17 +1989,12 @@ static bool zone_reclaimable(struct zone *zone)
 	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
 }
 
-/*
- * As hibernation is going on, kswapd is freezed so that it can't mark
- * the zone into all_unreclaimable. It can't handle OOM during hibernation.
- * So let's check zone's unreclaimable in direct reclaim as well as kswapd.
- */
+/* All zones in zonelist are unreclaimable? */
 static bool all_unreclaimable(struct zonelist *zonelist,
 		struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
-	bool all_unreclaimable = true;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2006,13 +2002,11 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (zone_reclaimable(zone)) {
-			all_unreclaimable = false;
-			break;
-		}
+		if (!zone->all_unreclaimable)
+			return false;
 	}
 
-	return all_unreclaimable;
+	return true;
 }
 
 /*
@@ -2108,6 +2102,14 @@ out:
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;
 
+	/*
+	 * As hibernation is going on, kswapd is freezed so that it can't mark
+	 * the zone into all_unreclaimable. Thus bypassing all_unreclaimable
+	 * check.
+	 */
+	if (oom_killer_disabled)
+		return 0;
+
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] remove boost_dying_task_prio()
  2011-04-11  5:29 [resend][patch 0/4 v3] oom: deadlock avoidance collection KOSAKI Motohiro
  2011-04-11  5:30 ` [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name KOSAKI Motohiro
@ 2011-04-11  5:31 ` KOSAKI Motohiro
  2011-04-11 21:58   ` Andrew Morton
  2011-04-13 18:41   ` David Rientjes
  2011-04-11  5:31 ` [PATCH 3/4] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
  2011-04-11  5:32 ` [PATCH 4/4] x86,mm: make pagefault killable KOSAKI Motohiro
  3 siblings, 2 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2011-04-11  5:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Andrew Morton, Linus Torvalds

This is a almost revert commit 93b43fa (oom: give the dying
task a higher priority).

The commit dramatically improve oom killer logic when fork-bomb
occur. But, I've found it has nasty corner case. Now cpu cgroup
has strange default RT runtime. It's 0! That said, if a process
under cpu cgroup promote RT scheduling class, the process never
run at all.

Eventually, kernel may hang up when oom kill occur.
I and Luis who original author agreed to disable this logic at
once.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/oom_kill.c |   28 ----------------------------
 1 files changed, 0 insertions(+), 28 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6a819d1..83fb72c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -84,24 +84,6 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
 #endif /* CONFIG_NUMA */
 
 /*
- * If this is a system OOM (not a memcg OOM) and the task selected to be
- * killed is not already running at high (RT) priorities, speed up the
- * recovery by boosting the dying task to the lowest FIFO priority.
- * That helps with the recovery and avoids interfering with RT tasks.
- */
-static void boost_dying_task_prio(struct task_struct *p,
-				  struct mem_cgroup *mem)
-{
-	struct sched_param param = { .sched_priority = 1 };
-
-	if (mem)
-		return;
-
-	if (!rt_task(p))
-		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
-}
-
-/*
  * The process p may have detached its own ->mm while exiting or through
  * use_mm(), but one or more of its subthreads may still have a valid
  * pointer.  Return p, or any of its subthreads with a valid ->mm, with
@@ -452,13 +434,6 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 	set_tsk_thread_flag(p, TIF_MEMDIE);
 	force_sig(SIGKILL, p);
 
-	/*
-	 * We give our sacrificial lamb high priority and access to
-	 * all the memory it needs. That way it should be able to
-	 * exit() and clear out its resources quickly...
-	 */
-	boost_dying_task_prio(p, mem);
-
 	return 0;
 }
 #undef K
@@ -482,7 +457,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	if (p->flags & PF_EXITING) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
-		boost_dying_task_prio(p, mem);
 		return 0;
 	}
 
@@ -556,7 +530,6 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 	 */
 	if (fatal_signal_pending(current)) {
 		set_thread_flag(TIF_MEMDIE);
-		boost_dying_task_prio(current, NULL);
 		return;
 	}
 
@@ -712,7 +685,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	if (fatal_signal_pending(current)) {
 		set_thread_flag(TIF_MEMDIE);
-		boost_dying_task_prio(current, NULL);
 		return;
 	}
 
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] mm: introduce wait_on_page_locked_killable
  2011-04-11  5:29 [resend][patch 0/4 v3] oom: deadlock avoidance collection KOSAKI Motohiro
  2011-04-11  5:30 ` [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name KOSAKI Motohiro
  2011-04-11  5:31 ` [PATCH 2/4] remove boost_dying_task_prio() KOSAKI Motohiro
@ 2011-04-11  5:31 ` KOSAKI Motohiro
  2011-04-11  5:32 ` [PATCH 4/4] x86,mm: make pagefault killable KOSAKI Motohiro
  3 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2011-04-11  5:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Andrew Morton, Linus Torvalds

commit 2687a356 (Add lock_page_killable) introduced killable
lock_page(). Similarly this patch introdues killable
wait_on_page_locked().

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/pagemap.h |    9 +++++++++
 mm/filemap.c            |   11 +++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c119506..ea26808 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -357,6 +357,15 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
 
+extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+
+static inline int wait_on_page_locked_killable(struct page *page)
+{
+	if (PageLocked(page))
+		return wait_on_page_bit_killable(page, PG_locked);
+	return 0;
+}
+
 /* 
  * Wait for a page to be unlocked.
  *
diff --git a/mm/filemap.c b/mm/filemap.c
index 1c63865..507349d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -573,6 +573,17 @@ void wait_on_page_bit(struct page *page, int bit_nr)
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+int wait_on_page_bit_killable(struct page *page, int bit_nr)
+{
+	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+
+	if (!test_bit(bit_nr, &page->flags))
+		return 0;
+
+	return __wait_on_bit(page_waitqueue(page), &wait,
+			     sleep_on_page_killable, TASK_KILLABLE);
+}
+
 /**
  * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
  * @page: Page defining the wait queue of interest
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] x86,mm: make pagefault killable
  2011-04-11  5:29 [resend][patch 0/4 v3] oom: deadlock avoidance collection KOSAKI Motohiro
                   ` (2 preceding siblings ...)
  2011-04-11  5:31 ` [PATCH 3/4] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
@ 2011-04-11  5:32 ` KOSAKI Motohiro
  3 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2011-04-11  5:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Andrew Morton, Linus Torvalds

When oom killer occured, almost processes are getting stuck following
two points.

	1) __alloc_pages_nodemask
	2) __lock_page_or_retry

1) is not much problematic because TIF_MEMDIE lead to make allocation
failure and get out from page allocator. 2) is more problematic. When
OOM situation, Zones typically don't have page cache at all and Memory
starvation might lead to reduce IO performance largely. When fork bomb
occur, TIF_MEMDIE task don't die quickly mean fork bomb may create
new process quickly rather than oom-killer kill it. Then, the system
may become livelock.

This patch makes pagefault interruptible by SIGKILL.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 arch/x86/mm/fault.c |   12 +++++++++++-
 include/linux/mm.h  |    1 +
 mm/filemap.c        |   31 ++++++++++++++++++++++++-------
 3 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 20e3f87..57a9fce 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -964,7 +964,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	struct mm_struct *mm;
 	int fault;
 	int write = error_code & PF_WRITE;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY |
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
 					(write ? FAULT_FLAG_WRITE : 0);
 
 	tsk = current;
@@ -1138,6 +1138,16 @@ good_area:
 	}
 
 	/*
+	 * Pagefault was interrupted by SIGKILL. We have no reason to
+	 * continue pagefault.
+	 */
+	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
+		if (!(error_code & PF_USER))
+			no_context(regs, error_code, address);
+		return;
+	}
+
+	/*
 	 * Major/minor page fault accounting is only done on the
 	 * initial attempt. If we go through a retry, it is extremely
 	 * likely that the page will be found in page cache at that point.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 628b31c..9c41b32 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -152,6 +152,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
 #define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
+#define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/mm/filemap.c b/mm/filemap.c
index 507349d..df08f89 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -665,15 +665,32 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
-	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
-		__lock_page(page);
-		return 1;
-	} else {
-		if (!(flags & FAULT_FLAG_RETRY_NOWAIT)) {
-			up_read(&mm->mmap_sem);
+	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+		/*
+		 * CAUTION! In this case, mmap_sem is not released
+		 * even though return 0.
+		 */
+		if (flags & FAULT_FLAG_RETRY_NOWAIT)
+			return 0;
+
+		up_read(&mm->mmap_sem);
+		if (flags & FAULT_FLAG_KILLABLE)
+			wait_on_page_locked_killable(page);
+		else
 			wait_on_page_locked(page);
-		}
 		return 0;
+	} else {
+		if (flags & FAULT_FLAG_KILLABLE) {
+			int ret;
+
+			ret = __lock_page_killable(page);
+			if (ret) {
+				up_read(&mm->mmap_sem);
+				return 0;
+			}
+		} else
+			__lock_page(page);
+		return 1;
 	}
 }
 
-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
  2011-04-11  5:30 ` [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name KOSAKI Motohiro
@ 2011-04-11 21:53   ` Andrew Morton
  2011-04-12  1:04     ` KOSAKI Motohiro
  2011-04-13 18:48   ` David Rientjes
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2011-04-11 21:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Linus Torvalds

On Mon, 11 Apr 2011 14:30:31 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
> 
> 	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
> 
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
> 
> 	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> 				      costly-order allocations
> 
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
> 
> 	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> 				      return value when priority==0
> 
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
> 
> 	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> 				      in direct reclaim path
> 
> But, recently, Andrey Vagin found the new corner case. Look,
> 
> 	struct zone {
> 	  ..
> 	        int                     all_unreclaimable;
> 	  ..
> 	        unsigned long           pages_scanned;
> 	  ..
> 	}
> 
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore zones can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
> 
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> at all!
> 
> Eventually, oom-killer never works on such systems. That said, we
> can't use zone->pages_scanned for this purpose. This patch restore
> all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> to add oom_killer_disabled check to avoid reintroduce the issue of
> commit d1908362.

The above is a nice analysis of the bug and how it came to be
introduced.  But we don't actually have a bug description!  What was
the observeable problem which got fixed?

Such a description will help people understand the importance of the
patch and will help people (eg, distros) who are looking at a user's
bug report and wondering whether your patch will fix it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] remove boost_dying_task_prio()
  2011-04-11  5:31 ` [PATCH 2/4] remove boost_dying_task_prio() KOSAKI Motohiro
@ 2011-04-11 21:58   ` Andrew Morton
  2011-04-12  0:35     ` KOSAKI Motohiro
  2011-04-13 18:41   ` David Rientjes
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2011-04-11 21:58 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Linus Torvalds

On Mon, 11 Apr 2011 14:31:18 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> This is a almost revert commit 93b43fa (oom: give the dying
> task a higher priority).
> 
> The commit dramatically improve oom killer logic when fork-bomb
> occur. But, I've found it has nasty corner case. Now cpu cgroup
> has strange default RT runtime. It's 0! That said, if a process
> under cpu cgroup promote RT scheduling class, the process never
> run at all.

hm.  How did that happen?  I thought that sched_setscheduler() modifies
only a single thread, and that thread is in the process of exiting?

> Eventually, kernel may hang up when oom kill occur.
> I and Luis who original author agreed to disable this logic at
> once.
> 
> ...
>
> index 6a819d1..83fb72c1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -84,24 +84,6 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
>  #endif /* CONFIG_NUMA */
>  
>  /*
> - * If this is a system OOM (not a memcg OOM) and the task selected to be
> - * killed is not already running at high (RT) priorities, speed up the
> - * recovery by boosting the dying task to the lowest FIFO priority.
> - * That helps with the recovery and avoids interfering with RT tasks.
> - */
> -static void boost_dying_task_prio(struct task_struct *p,
> -				  struct mem_cgroup *mem)
> -{
> -	struct sched_param param = { .sched_priority = 1 };
> -
> -	if (mem)
> -		return;
> -
> -	if (!rt_task(p))
> -		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> -}

I'm rather glad to see that code go away though - SCHED_FIFO is
dangerous...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] remove boost_dying_task_prio()
  2011-04-11 21:58   ` Andrew Morton
@ 2011-04-12  0:35     ` KOSAKI Motohiro
  0 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2011-04-12  0:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Linus Torvalds

Hi

> On Mon, 11 Apr 2011 14:31:18 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > This is a almost revert commit 93b43fa (oom: give the dying
> > task a higher priority).
> > 
> > The commit dramatically improve oom killer logic when fork-bomb
> > occur. But, I've found it has nasty corner case. Now cpu cgroup
> > has strange default RT runtime. It's 0! That said, if a process
> > under cpu cgroup promote RT scheduling class, the process never
> > run at all.
> 
> hm.  How did that happen?  I thought that sched_setscheduler() modifies
> only a single thread, and that thread is in the process of exiting?

If admin insert !RT process into a cpu cgroup of setting rtruntime=0,
usually it run perfectly because !RT task isn't affected from rtruntime
knob, but If it promote RT task, by explicit setscheduler() syscall or
OOM, the task can't run at all.

In short, now oom killer don't work at all if admin are using cpu
cgroup and don't touch rtruntime knob.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
  2011-04-11 21:53   ` Andrew Morton
@ 2011-04-12  1:04     ` KOSAKI Motohiro
  2011-04-12  1:26       ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: KOSAKI Motohiro @ 2011-04-12  1:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Linus Torvalds

Hi

> > zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> > variables nor protected by lock. Therefore zones can become a state
> > of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> > current all_unreclaimable() return false even though
> > zone->all_unreclaimabe=1.
> > 
> > Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> > small dma zone and it become zone->all_unreclamble=1 easily. and
> > if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> > Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> > a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> > at all!
> > 
> > Eventually, oom-killer never works on such systems. That said, we
> > can't use zone->pages_scanned for this purpose. This patch restore
> > all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> > to add oom_killer_disabled check to avoid reintroduce the issue of
> > commit d1908362.
> 
> The above is a nice analysis of the bug and how it came to be
> introduced.  But we don't actually have a bug description!  What was
> the observeable problem which got fixed?

The above says "Eventually, oom-killer never works". Is this no enough?
The above says
  1) current logic have a race
  2) x86 increase a chance of the race by dma zone
  3) if race is happen, oom killer don't work

> 
> Such a description will help people understand the importance of the
> patch and will help people (eg, distros) who are looking at a user's
> bug report and wondering whether your patch will fix it.
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
  2011-04-12  1:04     ` KOSAKI Motohiro
@ 2011-04-12  1:26       ` Andrew Morton
  2011-04-12 10:55         ` KOSAKI Motohiro
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2011-04-12  1:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Linus Torvalds

On Tue, 12 Apr 2011 10:04:15 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi
> 
> > > zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> > > variables nor protected by lock. Therefore zones can become a state
> > > of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> > > current all_unreclaimable() return false even though
> > > zone->all_unreclaimabe=1.
> > > 
> > > Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> > > small dma zone and it become zone->all_unreclamble=1 easily. and
> > > if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> > > Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> > > a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> > > at all!
> > > 
> > > Eventually, oom-killer never works on such systems. That said, we
> > > can't use zone->pages_scanned for this purpose. This patch restore
> > > all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> > > to add oom_killer_disabled check to avoid reintroduce the issue of
> > > commit d1908362.
> > 
> > The above is a nice analysis of the bug and how it came to be
> > introduced.  But we don't actually have a bug description!  What was
> > the observeable problem which got fixed?
> 
> The above says "Eventually, oom-killer never works". Is this no enough?
> The above says
>   1) current logic have a race
>   2) x86 increase a chance of the race by dma zone
>   3) if race is happen, oom killer don't work

And the system hangs up, so it's a local DoS and I guess we should
backport the fix into -stable.  I added this:

: This resulted in the kernel hanging up when executing a loop of the form
: 
: 1. fork
: 2. mmap
: 3. touch memory
: 4. read memory
: 5. munmmap
: 
: as described in
: http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725

And the problems which the other patches in this series address are
pretty deadly as well.  Should we backport everything?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
  2011-04-12  1:26       ` Andrew Morton
@ 2011-04-12 10:55         ` KOSAKI Motohiro
  0 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2011-04-12 10:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, David Rientjes,
	Oleg Nesterov, Linus Torvalds

Hi

> > The above says "Eventually, oom-killer never works". Is this no enough?
> > The above says
> >   1) current logic have a race
> >   2) x86 increase a chance of the race by dma zone
> >   3) if race is happen, oom killer don't work
> 
> And the system hangs up, so it's a local DoS and I guess we should
> backport the fix into -stable.  I added this:
> 
> : This resulted in the kernel hanging up when executing a loop of the form
> : 
> : 1. fork
> : 2. mmap
> : 3. touch memory
> : 4. read memory
> : 5. munmmap
> : 
> : as described in
> : http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725
> 
> And the problems which the other patches in this series address are
> pretty deadly as well.  Should we backport everything?

patch [1/4] and [2/4] should be backported because they are regression fix.
But [3/4] and [4/4] are on borderline to me. they improve a recovery time 
from oom. some times it is very important, some times not. And it is not
regression fix. Our oom-killer is very weak from forkbomb attack since
very old days.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] remove boost_dying_task_prio()
  2011-04-11  5:31 ` [PATCH 2/4] remove boost_dying_task_prio() KOSAKI Motohiro
  2011-04-11 21:58   ` Andrew Morton
@ 2011-04-13 18:41   ` David Rientjes
  1 sibling, 0 replies; 13+ messages in thread
From: David Rientjes @ 2011-04-13 18:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, Linus Torvalds

On Mon, 11 Apr 2011, KOSAKI Motohiro wrote:

> This is a almost revert commit 93b43fa (oom: give the dying
> task a higher priority).
> 
> The commit dramatically improve oom killer logic when fork-bomb
> occur. But, I've found it has nasty corner case. Now cpu cgroup
> has strange default RT runtime. It's 0! That said, if a process
> under cpu cgroup promote RT scheduling class, the process never
> run at all.
> 
> Eventually, kernel may hang up when oom kill occur.
> I and Luis who original author agreed to disable this logic at
> once.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
  2011-04-11  5:30 ` [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name KOSAKI Motohiro
  2011-04-11 21:53   ` Andrew Morton
@ 2011-04-13 18:48   ` David Rientjes
  1 sibling, 0 replies; 13+ messages in thread
From: David Rientjes @ 2011-04-13 18:48 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrey Vagin, Minchan Kim, KAMEZAWA Hiroyuki,
	Luis Claudio R. Goncalves, LKML, linux-mm, Oleg Nesterov,
	Andrew Morton, Linus Torvalds

On Mon, 11 Apr 2011, KOSAKI Motohiro wrote:

> all_unreclaimable check in direct reclaim has been introduced at 2.6.19
> by following commit.
> 
> 	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
> 
> And it went through strange history. firstly, following commit broke
> the logic unintentionally.
> 
> 	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
> 				      costly-order allocations
> 
> Two years later, I've found obvious meaningless code fragment and
> restored original intention by following commit.
> 
> 	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
> 				      return value when priority==0
> 
> But, the logic didn't works when 32bit highmem system goes hibernation
> and Minchan slightly changed the algorithm and fixed it .
> 
> 	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
> 				      in direct reclaim path
> 
> But, recently, Andrey Vagin found the new corner case. Look,
> 
> 	struct zone {
> 	  ..
> 	        int                     all_unreclaimable;
> 	  ..
> 	        unsigned long           pages_scanned;
> 	  ..
> 	}
> 
> zone->all_unreclaimable and zone->pages_scanned are neigher atomic
> variables nor protected by lock. Therefore zones can become a state
> of zone->page_scanned=0 and zone->all_unreclaimable=1. In this case,
> current all_unreclaimable() return false even though
> zone->all_unreclaimabe=1.
> 
> Is this ignorable minor issue? No. Unfortunatelly, x86 has very
> small dma zone and it become zone->all_unreclamble=1 easily. and
> if it become all_unreclaimable=1, it never restore all_unreclaimable=0.
> Why? if all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
> a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan
> at all!
> 
> Eventually, oom-killer never works on such systems. That said, we
> can't use zone->pages_scanned for this purpose. This patch restore
> all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
> to add oom_killer_disabled check to avoid reintroduce the issue of
> commit d1908362.
> 
> Reported-by: Andrey Vagin <avagin@openvz.org>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: David Rientjes <rientjes@google.com>

Seems like it should be a candidate for stable inclusion as well, nice 
catch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-04-13 18:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-11  5:29 [resend][patch 0/4 v3] oom: deadlock avoidance collection KOSAKI Motohiro
2011-04-11  5:30 ` [PATCH 1/4] vmscan: all_unreclaimable() use zone->all_unreclaimable as a name KOSAKI Motohiro
2011-04-11 21:53   ` Andrew Morton
2011-04-12  1:04     ` KOSAKI Motohiro
2011-04-12  1:26       ` Andrew Morton
2011-04-12 10:55         ` KOSAKI Motohiro
2011-04-13 18:48   ` David Rientjes
2011-04-11  5:31 ` [PATCH 2/4] remove boost_dying_task_prio() KOSAKI Motohiro
2011-04-11 21:58   ` Andrew Morton
2011-04-12  0:35     ` KOSAKI Motohiro
2011-04-13 18:41   ` David Rientjes
2011-04-11  5:31 ` [PATCH 3/4] mm: introduce wait_on_page_locked_killable KOSAKI Motohiro
2011-04-11  5:32 ` [PATCH 4/4] x86,mm: make pagefault killable KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).