[PATCH 00 of 24] OOM related fixes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00 of 24] OOM related fixes
@ 2007-08-22 12:48 Andrea Arcangeli
  2007-08-22 12:48 ` [PATCH 01 of 24] remove nr_scan_inactive/active Andrea Arcangeli
                   ` (23 more replies)
  0 siblings, 24 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

This is a set of fixes done in the context of a quite evil workload reading
from nfs large files with big read buffers in parallel from many tasks at
the same time until the system goes oom. Mostly all of these fixes seems to be
required to fix the customer workload on top of an older sles kernel. The
forward port of the fixes has been already tested successfully on similar evil
workloads.

The oom deadlock detection triggers a couple of times against the PG_locked
deadlock:

Jun  8 13:51:19 kvm kernel: Killed process 3504 (recursive_readd)
Jun  8 13:51:19 kvm kernel: detected probable OOM deadlock, so killing another task
Jun  8 13:51:19 kvm kernel: Out of memory: kill process 3532 (recursive_readd)
score 1225 or a child

Example of stack trace of TIF_MEMDIE killed task (not literally verified that
this was the one with TIF_MEMDIE set but it's the same as before with the  
verified one):

recursive_rea D ffff810001056418     0  3548   3544 (NOTLB)
 ffff81000e57dba8 0000000000000082 ffff8100010af5e8 ffff8100148df730
 ffff81001ff3ea10 0000000000bd2e1b ffff8100148df908 0000000000000046
 ffff81001fd5f170 ffffffff8031c36d ffff81001fd5f170 ffff810001056418
Call Trace:
 [<ffffffff8031c36d>] __generic_unplug_device+0x13/0x24
 [<ffffffff80244163>] sync_page+0x0/0x40
 [<ffffffff804cdf5b>] io_schedule+0xf/0x17
 [<ffffffff8024419e>] sync_page+0x3b/0x40
 [<ffffffff804ce162>] __wait_on_bit_lock+0x36/0x65
 [<ffffffff80244150>] __lock_page+0x5e/0x64
 [<ffffffff802321f1>] wake_bit_function+0x0/0x23
 [<ffffffff802440c0>] find_get_page+0xe/0x40
 [<ffffffff80244a33>] do_generic_mapping_read+0x200/0x450
 [<ffffffff80243f26>] file_read_actor+0x0/0x11d
 [<ffffffff80247fd4>] get_page_from_freelist+0x2d3/0x36e
 [<ffffffff802464d0>] generic_file_aio_read+0x11d/0x159
 [<ffffffff80260bdc>] do_sync_read+0xc9/0x10c
 [<ffffffff80252adb>] vma_merge+0x10c/0x195
 [<ffffffff802321c3>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80253a06>] do_mmap_pgoff+0x5e1/0x74c
 [<ffffffff8026134d>] vfs_read+0xaa/0x132
 [<ffffffff80261662>] sys_read+0x45/0x6e
 [<ffffffff8020991e>] system_call+0x7e/0x83

At the end I merged David Rientjes's patches to adapt cpuset oom killing to
the new changes and to further improve it.

There's one patch that is controversial (remove_nr_scan) and that can be
deferred, though I guess if it slowdown AIM we should fix it in some other way
not by leaving that patch out. I'll do some local testing with AIM soon.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 01 of 24] remove nr_scan_inactive/active
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 11:44   ` Andrew Morton
  2007-08-22 12:48 ` [PATCH 02 of 24] avoid oom deadlock in nfs_create_request Andrea Arcangeli
                   ` (22 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778124 -7200
# Node ID c8ec651562ad6514753e408596e30d7d9e448a51
# Parent  b03dfad58a311488ec373c30fd5dc97dc03aecae
remove nr_scan_inactive/active

The older atomic_add/atomic_set were pointless (atomic_set vs atomic_add would
race), but removing them didn't actually remove the race, the race is still
there, for the same reasons atomic_add/set couldn't prevent it. This is really
the kind of code that I dislike because it's sort of buggy, and it shouldn't be
making any measurable difference and when it does something for real it can
only hurt!

The real focus is on shrink_zone (ignore the other places where it's being used
that are even less interesting). Assume two tasks adds to nr_scan_*active at
the same time (first line of the old buggy code), they'll effectively double their
scan rate, for no good reason. What can happen is that instead of scanning
nr_entries each, they'll scan nr_entries*2 each. The more CPUs the bigger the
race and the higher the multiplication effect and the harder it will be to
detect oom. In the case that nr_*active < sc->swap_cluster_max, regardless of
whatever future invocation of alloc_pages, we'll be going down in the
priorities in the current alloc_pages invocation if the DEF_PRIORITY was too
high to make any work, so again accumulating the nr_scan_*active doesn't seem
interesting even when it's smaller than sc->swap_cluster_max. Each task should
work for itself without much care of what the others are doing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -229,8 +229,6 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct list_head	active_list;
 	struct list_head	inactive_list;
-	unsigned long		nr_scan_active;
-	unsigned long		nr_scan_inactive;
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2959,8 +2959,6 @@ static void __meminit free_area_init_cor
 		zone_pcp_init(zone);
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
-		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
 		zap_zone_vm_stats(zone);
 		atomic_set(&zone->reclaim_in_progress, 0);
 		if (!size)
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1020,20 +1020,11 @@ static unsigned long shrink_zone(int pri
 	 * Add one to `nr_to_scan' just to make sure that the kernel will
 	 * slowly sift through the active list.
 	 */
-	zone->nr_scan_active +=
-		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-	nr_active = zone->nr_scan_active;
-	if (nr_active >= sc->swap_cluster_max)
-		zone->nr_scan_active = 0;
-	else
+	nr_active = zone_page_state(zone, NR_ACTIVE) >> priority;
+	if (nr_active < sc->swap_cluster_max)
 		nr_active = 0;
-
-	zone->nr_scan_inactive +=
-		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-	nr_inactive = zone->nr_scan_inactive;
-	if (nr_inactive >= sc->swap_cluster_max)
-		zone->nr_scan_inactive = 0;
-	else
+	nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority;
+	if (nr_inactive < sc->swap_cluster_max)
 		nr_inactive = 0;
 
 	while (nr_active || nr_inactive) {
@@ -1500,22 +1491,14 @@ static unsigned long shrink_all_zones(un
 
 		/* For pass = 0 we don't shrink the active list */
 		if (pass > 0) {
-			zone->nr_scan_active +=
-				(zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
-			if (zone->nr_scan_active >= nr_pages || pass > 3) {
-				zone->nr_scan_active = 0;
-				nr_to_scan = min(nr_pages,
-					zone_page_state(zone, NR_ACTIVE));
+			nr_to_scan = (zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
+			if (nr_to_scan >= nr_pages || pass > 3) {
 				shrink_active_list(nr_to_scan, zone, sc, prio);
 			}
 		}
 
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
-		if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
-			zone->nr_scan_inactive = 0;
-			nr_to_scan = min(nr_pages,
-				zone_page_state(zone, NR_INACTIVE));
+		nr_to_scan = (zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
+		if (nr_to_scan >= nr_pages || pass > 3) {
 			ret += shrink_inactive_list(nr_to_scan, zone, sc);
 			if (ret >= nr_pages)
 				return ret;
diff --git a/mm/vmstat.c b/mm/vmstat.c
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -555,7 +555,7 @@ static int zoneinfo_show(struct seq_file
 			   "\n        min      %lu"
 			   "\n        low      %lu"
 			   "\n        high     %lu"
-			   "\n        scanned  %lu (a: %lu i: %lu)"
+			   "\n        scanned  %lu"
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
 			   zone_page_state(zone, NR_FREE_PAGES),
@@ -563,7 +563,6 @@ static int zoneinfo_show(struct seq_file
 			   zone->pages_low,
 			   zone->pages_high,
 			   zone->pages_scanned,
-			   zone->nr_scan_active, zone->nr_scan_inactive,
 			   zone->spanned_pages,
 			   zone->present_pages);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 02 of 24] avoid oom deadlock in nfs_create_request
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
  2007-08-22 12:48 ` [PATCH 01 of 24] remove nr_scan_inactive/active Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 23:54   ` Christoph Lameter
  2007-08-22 12:48 ` [PATCH 03 of 24] prevent oom deadlocks during read/write operations Andrea Arcangeli
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778124 -7200
# Node ID 90afd499e8ca0dfd2e0284372dca50f2e6149700
# Parent  c8ec651562ad6514753e408596e30d7d9e448a51
avoid oom deadlock in nfs_create_request

When sigkill is pending after the oom killer set TIF_MEMDIE, the task
must go away or the VM will malfunction.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -61,16 +61,20 @@ nfs_create_request(struct nfs_open_conte
 	struct nfs_server *server = NFS_SERVER(inode);
 	struct nfs_page		*req;
 
-	for (;;) {
-		/* try to allocate the request struct */
-		req = nfs_page_alloc();
-		if (req != NULL)
-			break;
-
-		if (signalled() && (server->flags & NFS_MOUNT_INTR))
-			return ERR_PTR(-ERESTARTSYS);
-		yield();
-	}
+	/* try to allocate the request struct */
+	req = nfs_page_alloc();
+	if (unlikely(!req)) {
+		/*
+		 * -ENOMEM will be returned only when TIF_MEMDIE is set
+		 * so userland shouldn't risk to get confused by a new
+		 * unhandled ENOMEM errno.
+		 */
+		WARN_ON(!test_thread_flag(TIF_MEMDIE));
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (signalled() && (server->flags & NFS_MOUNT_INTR))
+		return ERR_PTR(-ERESTARTSYS);
 
 	/* Initialize the request struct. Initially, we assume a
 	 * long write-back delay. This will be adjusted in

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 03 of 24] prevent oom deadlocks during read/write operations
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
  2007-08-22 12:48 ` [PATCH 01 of 24] remove nr_scan_inactive/active Andrea Arcangeli
  2007-08-22 12:48 ` [PATCH 02 of 24] avoid oom deadlock in nfs_create_request Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 11:56   ` Andrew Morton
  2007-08-22 12:48 ` [PATCH 04 of 24] serialize oom killer Andrea Arcangeli
                   ` (20 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778124 -7200
# Node ID 5566f2af006a171cd47d596c6654f51beca74203
# Parent  90afd499e8ca0dfd2e0284372dca50f2e6149700
prevent oom deadlocks during read/write operations

We need to react to SIGKILL during read/write with huge buffers or it
becomes too easy to prevent a SIGKILLED task to run do_exit promptly
after it has been selected for oom-killage.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -925,6 +925,13 @@ page_ok:
 			goto out;
 		}
 
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL)))
+			/*
+			 * Must not hang almost forever in D state in presence of sigkill
+			 * and lots of ram/swap (think during OOM).
+			 */
+			break;
+
 		/* nr is the maximum number of bytes to copy from this page */
 		nr = PAGE_CACHE_SIZE;
 		if (index == end_index) {
@@ -1868,6 +1875,13 @@ generic_file_buffered_write(struct kiocb
 		unsigned long index;
 		unsigned long offset;
 		size_t copied;
+
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL)))
+			/*
+			 * Must not hang almost forever in D state in presence of sigkill
+			 * and lots of ram/swap (think during OOM).
+			 */
+			break;
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 04 of 24] serialize oom killer
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 03 of 24] prevent oom deadlocks during read/write operations Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 12:02   ` Andrew Morton
  2007-09-13  0:09   ` Christoph Lameter
  2007-08-22 12:48 ` [PATCH 05 of 24] avoid selecting already killed tasks Andrea Arcangeli
                   ` (19 subsequent siblings)
  23 siblings, 2 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID 871b7a4fd566de0811207628b74abea0a73341f6
# Parent  5566f2af006a171cd47d596c6654f51beca74203
serialize oom killer

It's risky and useless to run two oom killers in parallel, let serialize it to
reduce the probability of spurious oom-killage.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -401,12 +401,15 @@ void out_of_memory(struct zonelist *zone
 	unsigned long points = 0;
 	unsigned long freed = 0;
 	int constraint;
+	static DECLARE_MUTEX(OOM_lock);
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
 		/* Got some memory back in the last second. */
 		return;
 
+	if (down_trylock(&OOM_lock))
+		return;
 	if (printk_ratelimit()) {
 		printk(KERN_WARNING "%s invoked oom-killer: "
 			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
@@ -473,4 +476,6 @@ out:
 	 */
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
-}
+
+	up(&OOM_lock);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 05 of 24] avoid selecting already killed tasks
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 04 of 24] serialize oom killer Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-13  0:13   ` Christoph Lameter
  2007-08-22 12:48 ` [PATCH 06 of 24] reduce the probability of an OOM livelock Andrea Arcangeli
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID de62eb332b1dfee7e493043b20e560283ef42f67
# Parent  871b7a4fd566de0811207628b74abea0a73341f6
avoid selecting already killed tasks

If the killed task doesn't go away because it's waiting on some other
task who needs to allocate memory, to release the i_sem or some other
lock, we must fallback to killing some other task in order to kill the
original selected and already oomkilled task, but the logic that kills
the childs first, would deadlock, if the already oom-killed task was
actually the first child of the newly oom-killed task.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,6 +367,12 @@ static int oom_kill_process(struct task_
 		c = list_entry(tsk, struct task_struct, sibling);
 		if (c->mm == p->mm)
 			continue;
+		/*
+		 * We cannot select tasks with TIF_MEMDIE already set
+		 * or we'll hard deadlock.
+		 */
+		if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE)))
+			continue;
 		if (!oom_kill_task(c))
 			return 0;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 06 of 24] reduce the probability of an OOM livelock
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 05 of 24] avoid selecting already killed tasks Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 12:17   ` Andrew Morton
  2007-08-22 12:48 ` [PATCH 07 of 24] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID 49e2d90eb0d7b1021b1e1e841bef22fdc647766e
# Parent  de62eb332b1dfee7e493043b20e560283ef42f67
reduce the probability of an OOM livelock

There's no need to loop way too many times over the lrus in order to
declare defeat and decide to kill a task. The more loops we do the more
likely there we'll run in a livelock with a page bouncing back and
forth between tasks. The maximum number of entries to check in a loop
that returns less than swap-cluster-max pages freed, should be the size
of the list (or at most twice the size of the list if you want to be
really paranoid about the PG_referenced bit).

Our objective there is to know reliably when it's time that we kill a
task, tring to free a few more pages at that already ciritical point is
worthless.

This seems to have the effect of reducing the "hang" time during oom
killing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1112,7 +1112,7 @@ unsigned long try_to_free_pages(struct z
 	int priority;
 	int ret = 0;
 	unsigned long total_scanned = 0;
-	unsigned long nr_reclaimed = 0;
+	unsigned long nr_reclaimed;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
 	int i;
@@ -1141,12 +1141,12 @@ unsigned long try_to_free_pages(struct z
 		sc.nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zones, &sc);
+		nr_reclaimed = shrink_zones(priority, zones, &sc);
+		if (reclaim_state)
+			reclaim_state->reclaimed_slab = 0;
 		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
-		if (reclaim_state) {
+		if (reclaim_state)
 			nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
-		}
 		total_scanned += sc.nr_scanned;
 		if (nr_reclaimed >= sc.swap_cluster_max) {
 			ret = 1;
@@ -1238,7 +1238,6 @@ static unsigned long balance_pgdat(pg_da
 
 loop_again:
 	total_scanned = 0;
-	nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
@@ -1293,6 +1292,7 @@ loop_again:
 		 * pages behind kswapd's direction of progress, which would
 		 * cause too much scanning of the lower zones.
 		 */
+		nr_reclaimed = 0;
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 07 of 24] balance_pgdat doesn't return the number of pages freed
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 06 of 24] reduce the probability of an OOM livelock Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 12:18   ` Andrew Morton
  2007-08-22 12:48 ` [PATCH 08 of 24] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID b66d8470c04ed836787f69c7578d5fea4f18c322
# Parent  49e2d90eb0d7b1021b1e1e841bef22fdc647766e
balance_pgdat doesn't return the number of pages freed

nr_reclaimed would be the number of pages freed in the last pass.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1198,8 +1198,6 @@ out:
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
  *
- * Returns the number of pages which were actually freed.
- *
  * There is special handling here for zones which are full of pinned pages.
  * This can happen if the pages are all mlocked, or if they are all used by
  * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
@@ -1215,7 +1213,7 @@ out:
  * the page allocator fallback scheme to ensure that aging of pages is balanced
  * across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static void balance_pgdat(pg_data_t *pgdat, int order)
 {
 	int all_zones_ok;
 	int priority;
@@ -1366,8 +1364,6 @@ out:
 
 		goto loop_again;
 	}
-
-	return nr_reclaimed;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 08 of 24] don't depend on PF_EXITING tasks to go away
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 07 of 24] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 12:20   ` Andrew Morton
  2007-08-22 12:48 ` [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
                   ` (15 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID ffdc30241856d7155ceedd4132eef684f7cc7059
# Parent  b66d8470c04ed836787f69c7578d5fea4f18c322
don't depend on PF_EXITING tasks to go away

A PF_EXITING task don't have TIF_MEMDIE set so it might get stuck in
memory allocations without access to the PF_MEMALLOC pool (said that
ideally do_exit would better not require memory allocations, especially
not before calling exit_mm). The same way we raise its privilege to
TIF_MEMDIE if it's the current task, we should do it even if it's not
the current task to speedup oom killing.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -234,27 +234,13 @@ static struct task_struct *select_bad_pr
 		 * Note: this may have a chance of deadlock if it gets
 		 * blocked waiting for another task which itself is waiting
 		 * for memory. Is there a better alternative?
+		 *
+		 * Better not to skip PF_EXITING tasks, since they
+		 * don't have access to the PF_MEMALLOC pool until
+		 * we select them here first.
 		 */
 		if (test_tsk_thread_flag(p, TIF_MEMDIE))
 			return ERR_PTR(-1UL);
-
-		/*
-		 * This is in the process of releasing memory so wait for it
-		 * to finish before killing some other task by mistake.
-		 *
-		 * However, if p is the current task, we allow the 'kill' to
-		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
-		 * which will allow it to gain access to memory reserves in
-		 * the process of exiting and releasing its resources.
-		 * Otherwise we could get an easy OOM deadlock.
-		 */
-		if (p->flags & PF_EXITING) {
-			if (p != current)
-				return ERR_PTR(-1UL);
-
-			chosen = p;
-			*ppoints = ULONG_MAX;
-		}
 
 		if (p->oomkilladj == OOM_DISABLE)
 			continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't go away
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 08 of 24] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 12:30   ` Andrew Morton
  2007-08-22 12:48 ` [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID 9bf6a66eab3c52327daa831ef101d7802bc71791
# Parent  ffdc30241856d7155ceedd4132eef684f7cc7059
fallback killing more tasks if tif-memdie doesn't go away

Waiting indefinitely for a TIF_MEMDIE task to go away will deadlock. Two
tasks reading from the same inode at the same time and both going out of
memory inside a read(largebuffer) syscall, will even deadlock through
contention over the PG_locked bitflag. The task holding the semaphore
detects oom but the oom killer decides to kill the task blocked in
wait_on_page_locked(). The task holding the semaphore will hang inside
alloc_pages that will never return because it will wait the TIF_MEMDIE
task to go away, but the TIF_MEMDIE task can't go away until the task
holding the semaphore is killed in the first place.

It's quite unpractical to teach the oom killer the locking dependencies
across running tasks, so the feasible fix is to develop a logic that
after waiting a long time for a TIF_MEMDIE tasks goes away, fallbacks
on killing one more task. This also eliminates the possibility of
suprious oom killage (i.e. two tasks killed despite only one had to be
killed). It's not a math guarantee because we can't demonstrate that if
a TIF_MEMDIE SIGKILLED task didn't mange to complete do_exit within
10sec, it never will. But the current probability of suprious oom
killing is sure much higher than the probability of suprious oom killing
with this patch applied.

The whole locking is around the tasklist_lock. On one side do_exit reads
TIF_MEMDIE and clears VM_is_OOM under the lock, on the other side the
oom killer accesses VM_is_OOM and TIF_MEMDIE under the lock. This is a
read_lock in the oom killer but it's actually a write lock thanks to the
OOM_lock semaphore running one oom killer at once (the locking rule is,
either use write_lock_irq or read_lock+OOM_lock).

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -849,6 +849,15 @@ static void exit_notify(struct task_stru
 	if (tsk->exit_signal == -1 && likely(!tsk->ptrace))
 		state = EXIT_DEAD;
 	tsk->exit_state = state;
+
+	/*
+	 * Read TIF_MEMDIE and set VM_is_OOM to 0 atomically inside
+	 * the tasklist_lock_lock.
+	 */
+	if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) {
+		extern unsigned long VM_is_OOM;
+		clear_bit(0, &VM_is_OOM);
+	}
 
 	write_unlock_irq(&tasklist_lock);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -29,6 +29,9 @@ int sysctl_panic_on_oom;
 int sysctl_panic_on_oom;
 /* #define DEBUG */
 
+unsigned long VM_is_OOM;
+static unsigned long last_tif_memdie_jiffies;
+
 /**
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
@@ -226,21 +229,14 @@ static struct task_struct *select_bad_pr
 		if (is_init(p))
 			continue;
 
-		/*
-		 * This task already has access to memory reserves and is
-		 * being killed. Don't allow any other task access to the
-		 * memory reserve.
-		 *
-		 * Note: this may have a chance of deadlock if it gets
-		 * blocked waiting for another task which itself is waiting
-		 * for memory. Is there a better alternative?
-		 *
-		 * Better not to skip PF_EXITING tasks, since they
-		 * don't have access to the PF_MEMALLOC pool until
-		 * we select them here first.
-		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
-			return ERR_PTR(-1UL);
+		if (unlikely(test_tsk_thread_flag(p, TIF_MEMDIE))) {
+			/*
+			 * Either we already waited long enough,
+			 * or exit_mm already run, so we must
+			 * try to kill another task.
+			 */
+			continue;
+		}
 
 		if (p->oomkilladj == OOM_DISABLE)
 			continue;
@@ -277,13 +273,16 @@ static void __oom_kill_task(struct task_
 	if (verbose)
 		printk(KERN_ERR "Killed process %d (%s)\n", p->pid, p->comm);
 
+	if (!test_and_set_tsk_thread_flag(p, TIF_MEMDIE)) {
+		last_tif_memdie_jiffies = jiffies;
+		set_bit(0, &VM_is_OOM);
+	}
 	/*
 	 * We give our sacrificial lamb high priority and access to
 	 * all the memory it needs. That way it should be able to
 	 * exit() and clear out its resources quickly...
 	 */
 	p->time_slice = HZ;
-	set_tsk_thread_flag(p, TIF_MEMDIE);
 
 	force_sig(SIGKILL, p);
 }
@@ -420,6 +419,18 @@ void out_of_memory(struct zonelist *zone
 	constraint = constrained_alloc(zonelist, gfp_mask);
 	cpuset_lock();
 	read_lock(&tasklist_lock);
+
+	/*
+	 * This holds the down(OOM_lock)+read_lock(tasklist_lock), so it's
+	 * equivalent to write_lock_irq(tasklist_lock) as far as VM_is_OOM
+	 * is concerned.
+	 */
+	if (unlikely(test_bit(0, &VM_is_OOM))) {
+		if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
+			goto out;
+		printk("detected probable OOM deadlock, so killing another task\n");
+		last_tif_memdie_jiffies = jiffies;
+	}
 
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
@@ -441,10 +452,6 @@ retry:
 		 * issues we may have.
 		 */
 		p = select_bad_process(&points);
-
-		if (PTR_ERR(p) == -1UL)
-			goto out;
-
 		/* Found nothing?!?! Either we hang forever, or we panic. */
 		if (!p) {
 			read_unlock(&tasklist_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 12:42   ` Andrew Morton
  2007-09-21 19:10   ` David Rientjes
  2007-08-22 12:48 ` [PATCH 11 of 24] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
                   ` (13 subsequent siblings)
  23 siblings, 2 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID edb3af3e0d4f2c083c8ddd9857073a3c8393ab8e
# Parent  9bf6a66eab3c52327daa831ef101d7802bc71791
stop useless vm trashing while we wait the TIF_MEMDIE task to exit

There's no point in trying to free memory if we're oom.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -159,6 +159,8 @@ struct swap_list_t {
 #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
 
 /* linux/mm/oom_kill.c */
+extern unsigned long VM_is_OOM;
+#define is_VM_OOM() unlikely(test_bit(0, &VM_is_OOM))
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1028,6 +1028,8 @@ static unsigned long shrink_zone(int pri
 		nr_inactive = 0;
 
 	while (nr_active || nr_inactive) {
+		if (is_VM_OOM())
+			break;
 		if (nr_active) {
 			nr_to_scan = min(nr_active,
 					(unsigned long)sc->swap_cluster_max);
@@ -1138,6 +1140,17 @@ unsigned long try_to_free_pages(struct z
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+		if (is_VM_OOM()) {
+			if (!test_thread_flag(TIF_MEMDIE)) {
+				/* get out of the way */
+				schedule_timeout_interruptible(1);
+				/* don't waste cpu if we're still oom */
+				if (is_VM_OOM())
+					goto out;
+			} else
+				goto out;
+		}
+
 		sc.nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 11 of 24] the oom schedule timeout isn't needed with the VM_is_OOM logic
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 12:44   ` Andrew Morton
  2007-08-22 12:48 ` [PATCH 12 of 24] show mem information only when a task is actually being killed Andrea Arcangeli
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID adf88d0ba0d17beaceee47f7b8e0acbd97ddc320
# Parent  edb3af3e0d4f2c083c8ddd9857073a3c8393ab8e
the oom schedule timeout isn't needed with the VM_is_OOM logic

VM_is_OOM whole point is to give a proper time to the TIF_MEMDIE task
in order to exit.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -469,12 +469,5 @@ out:
 	read_unlock(&tasklist_lock);
 	cpuset_unlock();
 
-	/*
-	 * Give "p" a good chance of killing itself before we
-	 * retry to allocate memory unless "p" is current
-	 */
-	if (!test_thread_flag(TIF_MEMDIE))
-		schedule_timeout_uninterruptible(1);
-
 	up(&OOM_lock);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 12 of 24] show mem information only when a task is actually being killed
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 11 of 24] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
@ 2007-08-22 12:48 ` Andrea Arcangeli
  2007-09-12 12:49   ` Andrew Morton
  2007-08-22 12:49 ` [PATCH 13 of 24] simplify oom heuristics Andrea Arcangeli
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:48 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID 1473d573b9ba8a913bafa42da2cac5dcca274204
# Parent  adf88d0ba0d17beaceee47f7b8e0acbd97ddc320
show mem information only when a task is actually being killed

Don't show garbage while VM_is_OOM and the timeout didn't trigger.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -287,7 +287,7 @@ static void __oom_kill_task(struct task_
 	force_sig(SIGKILL, p);
 }
 
-static int oom_kill_task(struct task_struct *p)
+static int oom_kill_task(struct task_struct *p, gfp_t gfp_mask, int order)
 {
 	struct mm_struct *mm;
 	struct task_struct *g, *q;
@@ -314,93 +314,6 @@ static int oom_kill_task(struct task_str
 			return 1;
 	} while_each_thread(g, q);
 
-	__oom_kill_task(p, 1);
-
-	/*
-	 * kill all processes that share the ->mm (i.e. all threads),
-	 * but are in a different thread group. Don't let them have access
-	 * to memory reserves though, otherwise we might deplete all memory.
-	 */
-	do_each_thread(g, q) {
-		if (q->mm == mm && q->tgid != p->tgid)
-			force_sig(SIGKILL, q);
-	} while_each_thread(g, q);
-
-	return 0;
-}
-
-static int oom_kill_process(struct task_struct *p, unsigned long points,
-		const char *message)
-{
-	struct task_struct *c;
-	struct list_head *tsk;
-
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	if (p->flags & PF_EXITING) {
-		__oom_kill_task(p, 0);
-		return 0;
-	}
-
-	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
-					message, p->pid, p->comm, points);
-
-	/* Try to kill a child first */
-	list_for_each(tsk, &p->children) {
-		c = list_entry(tsk, struct task_struct, sibling);
-		if (c->mm == p->mm)
-			continue;
-		/*
-		 * We cannot select tasks with TIF_MEMDIE already set
-		 * or we'll hard deadlock.
-		 */
-		if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE)))
-			continue;
-		if (!oom_kill_task(c))
-			return 0;
-	}
-	return oom_kill_task(p);
-}
-
-static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
-
-int register_oom_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_register(&oom_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(register_oom_notifier);
-
-int unregister_oom_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_unregister(&oom_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(unregister_oom_notifier);
-
-/**
- * out_of_memory - kill the "best" process when we run out of memory
- *
- * If we run out of memory, we have the choice between either
- * killing a random task (bad), letting the system crash (worse)
- * OR try to be smart about which process to kill. Note that we
- * don't have to be perfect here, we just have to be good.
- */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
-{
-	struct task_struct *p;
-	unsigned long points = 0;
-	unsigned long freed = 0;
-	int constraint;
-	static DECLARE_MUTEX(OOM_lock);
-
-	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-	if (freed > 0)
-		/* Got some memory back in the last second. */
-		return;
-
-	if (down_trylock(&OOM_lock))
-		return;
 	if (printk_ratelimit()) {
 		printk(KERN_WARNING "%s invoked oom-killer: "
 			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
@@ -409,6 +322,94 @@ void out_of_memory(struct zonelist *zone
 		show_mem();
 	}
 
+	__oom_kill_task(p, 1);
+
+	/*
+	 * kill all processes that share the ->mm (i.e. all threads),
+	 * but are in a different thread group. Don't let them have access
+	 * to memory reserves though, otherwise we might deplete all memory.
+	 */
+	do_each_thread(g, q) {
+		if (q->mm == mm && q->tgid != p->tgid)
+			force_sig(SIGKILL, q);
+	} while_each_thread(g, q);
+
+	return 0;
+}
+
+static int oom_kill_process(struct task_struct *p, unsigned long points,
+			    const char *message, gfp_t gfp_mask, int order)
+{
+	struct task_struct *c;
+	struct list_head *tsk;
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 */
+	if (p->flags & PF_EXITING) {
+		__oom_kill_task(p, 0);
+		return 0;
+	}
+
+	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
+					message, p->pid, p->comm, points);
+
+	/* Try to kill a child first */
+	list_for_each(tsk, &p->children) {
+		c = list_entry(tsk, struct task_struct, sibling);
+		if (c->mm == p->mm)
+			continue;
+		/*
+		 * We cannot select tasks with TIF_MEMDIE already set
+		 * or we'll hard deadlock.
+		 */
+		if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE)))
+			continue;
+		if (!oom_kill_task(c, gfp_mask, order))
+			return 0;
+	}
+	return oom_kill_task(p, gfp_mask, order);
+}
+
+static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
+
+int register_oom_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&oom_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_oom_notifier);
+
+int unregister_oom_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&oom_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_oom_notifier);
+
+/**
+ * out_of_memory - kill the "best" process when we run out of memory
+ *
+ * If we run out of memory, we have the choice between either
+ * killing a random task (bad), letting the system crash (worse)
+ * OR try to be smart about which process to kill. Note that we
+ * don't have to be perfect here, we just have to be good.
+ */
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+{
+	struct task_struct *p;
+	unsigned long points = 0;
+	unsigned long freed = 0;
+	int constraint;
+	static DECLARE_MUTEX(OOM_lock);
+
+	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
+	if (freed > 0)
+		/* Got some memory back in the last second. */
+		return;
+
+	if (down_trylock(&OOM_lock))
+		return;
+
 	if (sysctl_panic_on_oom == 2)
 		panic("out of memory. Compulsory panic_on_oom is selected.\n");
 
@@ -435,12 +436,12 @@ void out_of_memory(struct zonelist *zone
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
 		oom_kill_process(current, points,
-				"No available memory (MPOL_BIND)");
+				 "No available memory (MPOL_BIND)", gfp_mask, order);
 		break;
 
 	case CONSTRAINT_CPUSET:
 		oom_kill_process(current, points,
-				"No available memory in cpuset");
+				 "No available memory in cpuset", gfp_mask, order);
 		break;
 
 	case CONSTRAINT_NONE:
@@ -459,7 +460,7 @@ retry:
 			panic("Out of memory and no killable processes...\n");
 		}
 
-		if (oom_kill_process(p, points, "Out of memory"))
+		if (oom_kill_process(p, points, "Out of memory", gfp_mask, order))
 			goto retry;
 
 		break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 13 of 24] simplify oom heuristics
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2007-08-22 12:48 ` [PATCH 12 of 24] show mem information only when a task is actually being killed Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-12 12:52   ` Andrew Morton
  2007-08-22 12:49 ` [PATCH 14 of 24] oom select should only take rss into account Andrea Arcangeli
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID cd70d64570b9add8072f7abe952b34fe57c60086
# Parent  1473d573b9ba8a913bafa42da2cac5dcca274204
simplify oom heuristics

Over time somebody had the good idea to remove the rcvd_sigterm points,
this removes more of them. The selected task should be the one that if
we don't kill, it will turn the system oom again sooner than later.
These informations tell us nothing about which task is best to kill so
they should be removed.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -53,7 +53,7 @@ static unsigned long last_tif_memdie_jif
 
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
-	unsigned long points, cpu_time, run_time, s;
+	unsigned long points;
 	struct mm_struct *mm;
 	struct task_struct *child;
 
@@ -94,26 +94,6 @@ unsigned long badness(struct task_struct
 			points += child->mm->total_vm/2 + 1;
 		task_unlock(child);
 	}
-
-	/*
-	 * CPU time is in tens of seconds and run time is in thousands
-         * of seconds. There is no particular reason for this other than
-         * that it turned out to work very well in practice.
-	 */
-	cpu_time = (cputime_to_jiffies(p->utime) + cputime_to_jiffies(p->stime))
-		>> (SHIFT_HZ + 3);
-
-	if (uptime >= p->start_time.tv_sec)
-		run_time = (uptime - p->start_time.tv_sec) >> 10;
-	else
-		run_time = 0;
-
-	s = int_sqrt(cpu_time);
-	if (s)
-		points /= s;
-	s = int_sqrt(int_sqrt(run_time));
-	if (s)
-		points /= s;
 
 	/*
 	 * Niced processes are most likely less important, so double

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 14 of 24] oom select should only take rss into account
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 13 of 24] simplify oom heuristics Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-13  0:43   ` Christoph Lameter
  2007-08-22 12:49 ` [PATCH 15 of 24] limit reclaim if enough pages have been freed Andrea Arcangeli
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID dde19626aa495cd8a6fa6b14a4f195438c2039ba
# Parent  cd70d64570b9add8072f7abe952b34fe57c60086
oom select should only take rss into account

Running workloads where many tasks grow their virtual memory
simultaneously, so they all have a relatively small virtual memory when
oom triggers (if compared to innocent longstanding tasks), the oom
killer then selects mysql/apache and other things with very large VM but
very small RSS. RSS is the only thing that matters, killing a task with
huge VM but zero RSS is not useful. Many apps tend to have large VM but
small RSS in the first place (regardless of swapping activity) and they
shouldn't be penalized like this.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -67,7 +67,7 @@ unsigned long badness(struct task_struct
 	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
-	points = mm->total_vm;
+	points = get_mm_rss(mm);
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -91,7 +91,7 @@ unsigned long badness(struct task_struct
 	list_for_each_entry(child, &p->children, sibling) {
 		task_lock(child);
 		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
+			points += get_mm_rss(child->mm)/2 + 1;
 		task_unlock(child);
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 15 of 24] limit reclaim if enough pages have been freed
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 14 of 24] oom select should only take rss into account Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-12 12:57   ` Andrew Morton
  2007-09-12 12:58   ` Andrew Morton
  2007-08-22 12:49 ` [PATCH 16 of 24] avoid some lock operation in vm fast path Andrea Arcangeli
                   ` (8 subsequent siblings)
  23 siblings, 2 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID 94686cfcd27347e83a6aa145c77457ca6455366d
# Parent  dde19626aa495cd8a6fa6b14a4f195438c2039ba
limit reclaim if enough pages have been freed

No need to wipe out an huge chunk of the cache.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1043,6 +1043,8 @@ static unsigned long shrink_zone(int pri
 			nr_inactive -= nr_to_scan;
 			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
 								sc);
+			if (nr_reclaimed >= sc->swap_cluster_max)
+				break;
 		}
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 16 of 24] avoid some lock operation in vm fast path
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 15 of 24] limit reclaim if enough pages have been freed Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-12 12:59   ` Andrew Morton
  2007-08-22 12:49 ` [PATCH 17 of 24] apply the anti deadlock features only to global oom Andrea Arcangeli
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID b343d1056f356d60de868bd92422b33290e3c514
# Parent  94686cfcd27347e83a6aa145c77457ca6455366d
avoid some lock operation in vm fast path

Let's not bloat the kernel for numa. Not nice, but at least this way
perhaps somebody will clean it up instead of hiding the inefficiency in
there.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -232,8 +232,10 @@ struct zone {
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
+#ifdef CONFIG_NUMA
 	/* A count of how many reclaimers are scanning this zone */
 	atomic_t		reclaim_in_progress;
+#endif
 
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2960,7 +2960,9 @@ static void __meminit free_area_init_cor
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
 		zap_zone_vm_stats(zone);
+#ifdef CONFIG_NUMA
 		atomic_set(&zone->reclaim_in_progress, 0);
+#endif
 		if (!size)
 			continue;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1014,7 +1014,9 @@ static unsigned long shrink_zone(int pri
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
 
+#ifdef CONFIG_NUMA
 	atomic_inc(&zone->reclaim_in_progress);
+#endif
 
 	/*
 	 * Add one to `nr_to_scan' just to make sure that the kernel will
@@ -1050,7 +1052,9 @@ static unsigned long shrink_zone(int pri
 
 	throttle_vm_writeout(sc->gfp_mask);
 
+#ifdef CONFIG_NUMA
 	atomic_dec(&zone->reclaim_in_progress);
+#endif
 	return nr_reclaimed;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 17 of 24] apply the anti deadlock features only to global oom
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 16 of 24] avoid some lock operation in vm fast path Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-12 13:02   ` Andrew Morton
  2007-09-13  0:52   ` Christoph Lameter
  2007-08-22 12:49 ` [PATCH 18 of 24] run panic the same way in both places Andrea Arcangeli
                   ` (6 subsequent siblings)
  23 siblings, 2 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID efd1da1efb392cc4e015740d088ea9c6235901e0
# Parent  b343d1056f356d60de868bd92422b33290e3c514
apply the anti deadlock features only to global oom

Cc: Christoph Lameter <clameter@sgi.com>
The local numa oom will keep killing the current task hoping that's it's
not an innocent task and it won't alter the behavior of the rest of the
VM. The global oom will not wait for TIF_MEMDIE tasks anymore, so this
will be a really local event, not like before when the local-TIF_MEMDIE
was effectively a global flag that the global oom would depend on too.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -387,9 +387,6 @@ void out_of_memory(struct zonelist *zone
 		/* Got some memory back in the last second. */
 		return;
 
-	if (down_trylock(&OOM_lock))
-		return;
-
 	if (sysctl_panic_on_oom == 2)
 		panic("out of memory. Compulsory panic_on_oom is selected.\n");
 
@@ -399,32 +396,39 @@ void out_of_memory(struct zonelist *zone
 	 */
 	constraint = constrained_alloc(zonelist, gfp_mask);
 	cpuset_lock();
-	read_lock(&tasklist_lock);
-
-	/*
-	 * This holds the down(OOM_lock)+read_lock(tasklist_lock), so it's
-	 * equivalent to write_lock_irq(tasklist_lock) as far as VM_is_OOM
-	 * is concerned.
-	 */
-	if (unlikely(test_bit(0, &VM_is_OOM))) {
-		if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
-			goto out;
-		printk("detected probable OOM deadlock, so killing another task\n");
-		last_tif_memdie_jiffies = jiffies;
-	}
 
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
+		read_lock(&tasklist_lock);
 		oom_kill_process(current, points,
 				 "No available memory (MPOL_BIND)", gfp_mask, order);
+		read_unlock(&tasklist_lock);
 		break;
 
 	case CONSTRAINT_CPUSET:
+		read_lock(&tasklist_lock);
 		oom_kill_process(current, points,
 				 "No available memory in cpuset", gfp_mask, order);
+		read_unlock(&tasklist_lock);
 		break;
 
 	case CONSTRAINT_NONE:
+		if (down_trylock(&OOM_lock))
+			break;
+		read_lock(&tasklist_lock);
+
+		/*
+		 * This holds the down(OOM_lock)+read_lock(tasklist_lock),
+		 * so it's equivalent to write_lock_irq(tasklist_lock) as
+		 * far as VM_is_OOM is concerned.
+		 */
+		if (unlikely(test_bit(0, &VM_is_OOM))) {
+			if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
+				goto out;
+			printk("detected probable OOM deadlock, so killing another task\n");
+			last_tif_memdie_jiffies = jiffies;
+		}
+
 		if (sysctl_panic_on_oom)
 			panic("out of memory. panic_on_oom is selected\n");
 retry:
@@ -443,12 +447,11 @@ retry:
 		if (oom_kill_process(p, points, "Out of memory", gfp_mask, order))
 			goto retry;
 
+	out:
+		read_unlock(&tasklist_lock);
+		up(&OOM_lock);
 		break;
 	}
 
-out:
-	read_unlock(&tasklist_lock);
 	cpuset_unlock();
-
-	up(&OOM_lock);
-}
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 18 of 24] run panic the same way in both places
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 17 of 24] apply the anti deadlock features only to global oom Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-13  0:54   ` Christoph Lameter
  2007-08-22 12:49 ` [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing Andrea Arcangeli
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID 040cab5c8aafe1efcb6fc21d1f268c11202dac02
# Parent  efd1da1efb392cc4e015740d088ea9c6235901e0
run panic the same way in both places

The other panic is called after releasing some core global lock, that
sounds safe to have for both panics (just in case panic tries to do
anything more than oops does).

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -429,8 +429,11 @@ void out_of_memory(struct zonelist *zone
 			last_tif_memdie_jiffies = jiffies;
 		}
 
-		if (sysctl_panic_on_oom)
+		if (sysctl_panic_on_oom) {
+			read_unlock(&tasklist_lock);
+			cpuset_unlock();
 			panic("out of memory. panic_on_oom is selected\n");
+		}
 retry:
 		/*
 		 * Rambo mode: Shoot down a process and hope it solves whatever
@@ -438,7 +441,7 @@ retry:
 		 */
 		p = select_bad_process(&points);
 		/* Found nothing?!?! Either we hang forever, or we panic. */
-		if (!p) {
+		if (unlikely(!p)) {
 			read_unlock(&tasklist_lock);
 			cpuset_unlock();
 			panic("Out of memory and no killable processes...\n");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 18 of 24] run panic the same way in both places Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-12 13:02   ` Andrew Morton
  2007-08-22 12:49 ` [PATCH 20 of 24] extract deadlock helper function Andrea Arcangeli
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1187778125 -7200
# Node ID be2fc447cec06990a2a31658b166f0c909777260
# Parent  040cab5c8aafe1efcb6fc21d1f268c11202dac02
cacheline align VM_is_OOM to prevent false sharing

This is better to be cacheline aligned in smp kernels just in case.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -29,7 +29,7 @@ int sysctl_panic_on_oom;
 int sysctl_panic_on_oom;
 /* #define DEBUG */
 
-unsigned long VM_is_OOM;
+unsigned long VM_is_OOM __cacheline_aligned_in_smp;
 static unsigned long last_tif_memdie_jiffies;
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 20 of 24] extract deadlock helper function
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-08-22 12:49 ` [PATCH 21 of 24] select process to kill for cpusets Andrea Arcangeli
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User David Rientjes <rientjes@google.com>
# Date 1187778125 -7200
# Node ID 2c9417ab4c1ff81a77bca4767207338e43b5cd69
# Parent  be2fc447cec06990a2a31658b166f0c909777260
extract deadlock helper function

Extracts the jiffies comparison operation, the assignment of the
last_tif_memdie actual, and diagnostic message to its own function.

Cc: Andrea Arcangeli <andrea@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   27 +++++++++++++++++++++------
 1 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -29,6 +29,8 @@ int sysctl_panic_on_oom;
 int sysctl_panic_on_oom;
 /* #define DEBUG */
 
+#define OOM_DEADLOCK_TIMEOUT	(10*HZ)
+
 unsigned long VM_is_OOM __cacheline_aligned_in_smp;
 static unsigned long last_tif_memdie_jiffies;
 
@@ -366,6 +368,22 @@ int unregister_oom_notifier(struct notif
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+/*
+ * Returns 1 if the OOM killer is deadlocked, meaning more than
+ * OOM_DEADLOCK_TIMEOUT time has elapsed since the last task was set to
+ * TIF_MEMDIE.  If it is deadlocked, the actual is updated to jiffies to check
+ * for future timeouts.  Otherwise, return 0.
+ */
+static int oom_is_deadlocked(unsigned long *last_tif_memdie)
+{
+	if (unlikely(time_before(jiffies, *last_tif_memdie +
+					  OOM_DEADLOCK_TIMEOUT)))
+		return 0;
+	*last_tif_memdie = jiffies;
+	printk("detected probable OOM deadlock, so killing another task\n");
+	return 1;
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  *
@@ -422,12 +440,9 @@ void out_of_memory(struct zonelist *zone
 		 * so it's equivalent to write_lock_irq(tasklist_lock) as
 		 * far as VM_is_OOM is concerned.
 		 */
-		if (unlikely(test_bit(0, &VM_is_OOM))) {
-			if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
-				goto out;
-			printk("detected probable OOM deadlock, so killing another task\n");
-			last_tif_memdie_jiffies = jiffies;
-		}
+		if (unlikely(test_bit(0, &VM_is_OOM)) &&
+		    !oom_is_deadlocked(&last_tif_memdie_jiffies))
+			goto out;
 
 		if (sysctl_panic_on_oom) {
 			read_unlock(&tasklist_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 21 of 24] select process to kill for cpusets
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 20 of 24] extract deadlock helper function Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-12 13:05   ` Andrew Morton
  2007-08-22 12:49 ` [PATCH 22 of 24] extract select helper function Andrea Arcangeli
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User David Rientjes <rientjes@google.com>
# Date 1187778125 -7200
# Node ID 855dc37d74ab151d7a0c640d687b34ee05996235
# Parent  2c9417ab4c1ff81a77bca4767207338e43b5cd69
select process to kill for cpusets

Passes the memory allocation constraint into select_bad_process() so
that, in the CONSTRAINT_CPUSET case, we can exclude tasks that do not
overlap nodes with the triggering task's cpuset.

The OOM killer now invokes select_bad_process() even in the cpuset case
to select a rogue task to kill instead of simply using current.  Although
killing current is guaranteed to help alleviate the OOM condition, it is
by no means guaranteed to be the "best" process to kill.  The
select_bad_process() heuristics will do a much better job of determining
that.

As an added bonus, this also addresses an issue whereas current could be
set to OOM_DISABLE and is not respected for the CONSTRAINT_CPUSET case.
Currently we loop back out to __alloc_pages() waiting for another cpuset
task to trigger the OOM killer that hopefully won't be OOM_DISABLE.  With
this patch, we're guaranteed to find a task to kill that is not
OOM_DISABLE if it matches our eligibility requirements the first time.

If we cannot find any tasks to kill in the cpuset case, we simply make
the entire OOM killer a no-op since it's better for one cpuset to fail
memory allocations repeatedly instead of panicing the entire system.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   25 +++++++++++++++++--------
 1 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -188,9 +188,13 @@ static inline int constrained_alloc(stru
  * Simple selection loop. We chose the process with the highest
  * number of 'points'. We expect the caller will lock the tasklist.
  *
+ * If constraint is CONSTRAINT_CPUSET, then only choose a task that overlaps
+ * the nodes of the task that triggered the OOM killer.
+ *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned long *ppoints)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+					      int constraint)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -221,6 +225,9 @@ static struct task_struct *select_bad_pr
 		}
 
 		if (p->oomkilladj == OOM_DISABLE)
+			continue;
+		if (constraint == CONSTRAINT_CPUSET &&
+		    !cpuset_excl_nodes_overlap(p))
 			continue;
 
 		points = badness(p, uptime.tv_sec);
@@ -424,12 +431,6 @@ void out_of_memory(struct zonelist *zone
 		break;
 
 	case CONSTRAINT_CPUSET:
-		read_lock(&tasklist_lock);
-		oom_kill_process(current, points,
-				 "No available memory in cpuset", gfp_mask, order);
-		read_unlock(&tasklist_lock);
-		break;
-
 	case CONSTRAINT_NONE:
 		if (down_trylock(&OOM_lock))
 			break;
@@ -454,9 +455,17 @@ retry:
 		 * Rambo mode: Shoot down a process and hope it solves whatever
 		 * issues we may have.
 		 */
-		p = select_bad_process(&points);
+		p = select_bad_process(&points, constraint);
 		/* Found nothing?!?! Either we hang forever, or we panic. */
 		if (unlikely(!p)) {
+			/*
+			 * We shouldn't panic the entire system if we can't
+			 * find any eligible tasks to kill in a
+			 * cpuset-constrained OOM condition.  Instead, we do
+			 * nothing and allow other cpusets to continue.
+			 */
+			if (constraint == CONSTRAINT_CPUSET)
+				goto out;
 			read_unlock(&tasklist_lock);
 			cpuset_unlock();
 			panic("Out of memory and no killable processes...\n");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 22 of 24] extract select helper function
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 21 of 24] select process to kill for cpusets Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-08-22 12:49 ` [PATCH 23 of 24] serialize for cpusets Andrea Arcangeli
  2007-08-22 12:49 ` [PATCH 24 of 24] add oom_kill_asking_task flag Andrea Arcangeli
  23 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User David Rientjes <rientjes@google.com>
# Date 1187778125 -7200
# Node ID 8807a4d14b241b2d1132fde7f83834603b6cf093
# Parent  855dc37d74ab151d7a0c640d687b34ee05996235
extract select helper function

Extracts the call to select_bad_process() and the corresponding check for
a NULL return value or call to oom_kill_process() to its own function.
This will be used later for the cpuset case where we will require
different locking mechanisms than the generic case.

Cc: Andrea Arcangeli <andrea@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   53 ++++++++++++++++++++++++++++-------------------------
 1 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -391,6 +391,32 @@ static int oom_is_deadlocked(unsigned lo
 	return 1;
 }
 
+static void select_and_kill_process(gfp_t gfp_mask, int order, int constraint)
+{
+	struct task_struct *p;
+	unsigned long points = 0;
+
+retry:
+	p = select_bad_process(&points, constraint);
+	/* Found nothing?!?! Either we hang forever, or we panic. */
+	if (unlikely(!p)) {
+		/*
+		 * We shouldn't panic the entire system if we can't find any
+		 * eligible tasks to kill in a cpuset-constrained OOM
+		 * condition.  Instead, we do nothing and allow other cpusets
+		 * to continue.
+		 */
+		if (constraint == CONSTRAINT_CPUSET)
+			return;
+		read_unlock(&tasklist_lock);
+		cpuset_unlock();
+		panic("Out of memory and no killable processes...\n");
+	}
+
+	if (oom_kill_process(p, points, "Out of memory", gfp_mask, order))
+		goto retry;
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  *
@@ -401,8 +427,6 @@ static int oom_is_deadlocked(unsigned lo
  */
 void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 {
-	struct task_struct *p;
-	unsigned long points = 0;
 	unsigned long freed = 0;
 	int constraint;
 	static DECLARE_MUTEX(OOM_lock);
@@ -425,7 +449,7 @@ void out_of_memory(struct zonelist *zone
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
 		read_lock(&tasklist_lock);
-		oom_kill_process(current, points,
+		oom_kill_process(current, 0,
 				 "No available memory (MPOL_BIND)", gfp_mask, order);
 		read_unlock(&tasklist_lock);
 		break;
@@ -450,29 +474,8 @@ void out_of_memory(struct zonelist *zone
 			cpuset_unlock();
 			panic("out of memory. panic_on_oom is selected\n");
 		}
-retry:
-		/*
-		 * Rambo mode: Shoot down a process and hope it solves whatever
-		 * issues we may have.
-		 */
-		p = select_bad_process(&points, constraint);
-		/* Found nothing?!?! Either we hang forever, or we panic. */
-		if (unlikely(!p)) {
-			/*
-			 * We shouldn't panic the entire system if we can't
-			 * find any eligible tasks to kill in a
-			 * cpuset-constrained OOM condition.  Instead, we do
-			 * nothing and allow other cpusets to continue.
-			 */
-			if (constraint == CONSTRAINT_CPUSET)
-				goto out;
-			read_unlock(&tasklist_lock);
-			cpuset_unlock();
-			panic("Out of memory and no killable processes...\n");
-		}
-
-		if (oom_kill_process(p, points, "Out of memory", gfp_mask, order))
-			goto retry;
+
+		select_and_kill_process(gfp_mask, order, constraint);
 
 	out:
 		read_unlock(&tasklist_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 23 of 24] serialize for cpusets
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (21 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 22 of 24] extract select helper function Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-12 13:10   ` Andrew Morton
  2007-08-22 12:49 ` [PATCH 24 of 24] add oom_kill_asking_task flag Andrea Arcangeli
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User David Rientjes <rientjes@google.com>
# Date 1187778125 -7200
# Node ID a3d679df54ebb1f977b97ab6b3e501134bf9e7ef
# Parent  8807a4d14b241b2d1132fde7f83834603b6cf093
serialize for cpusets

Adds a last_tif_memdie_jiffies field to struct cpuset to store the
jiffies value at the last OOM kill.  This will detect deadlocks in the
CONSTRAINT_CPUSET case and kill another task if its detected.

Adds a CS_OOM bit to struct cpuset's flags field.  This will be tested,
set, and cleared atomically to denote a cpuset that currently has an
attached task exiting as a result of the OOM killer.  We are required to
take p->alloc_lock to dereference p->cpuset so this cannot be implemented
as a simple trylock.

As a result, we cannot allow the detachment of a task from a cpuset that
is currently OOM killing one of its tasks.  If we did, we would end up
clearing the CS_OOM bit in the wrong cpuset upon that task's exit.

sysctl's panic_on_oom is now only effected in the non-cpuset-constrained
case.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/cpuset.h |   19 ++++++++++++++
 kernel/cpuset.c        |   65 +++++++++++++++++++++++++++++++++++++++++++++---
 kernel/exit.c          |    1 +
 mm/oom_kill.c          |   21 ++++++++++++++-
 4 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -46,6 +46,12 @@ static int inline cpuset_zone_allowed_ha
 }
 
 extern int cpuset_excl_nodes_overlap(const struct task_struct *p);
+
+extern int cpuset_get_last_tif_memdie(struct task_struct *task);
+extern void cpuset_set_last_tif_memdie(struct task_struct *task,
+				       unsigned long last_tif_memdie);
+extern int cpuset_set_oom(struct task_struct *task);
+extern void cpuset_clear_oom(struct task_struct *task);
 
 #define cpuset_memory_pressure_bump() 				\
 	do {							\
@@ -118,6 +124,19 @@ static inline int cpuset_excl_nodes_over
 	return 1;
 }
 
+static inline int cpuset_get_last_tif_memdie(struct task_struct *task)
+{
+	return jiffies;
+}
+static inline void cpuset_set_last_tif_memdie(struct task_struct *task,
+					      unsigned long last_tif_memdie) {}
+
+static inline int cpuset_set_oom(struct task_struct *task)
+{
+	return 0;
+}
+static inline void cpuset_clear_oom(struct task_struct *task) {}
+
 static inline void cpuset_memory_pressure_bump(void) {}
 
 static inline char *cpuset_task_status_allowed(struct task_struct *task,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -98,6 +98,12 @@ struct cpuset {
 	int mems_generation;
 
 	struct fmeter fmeter;		/* memory_pressure filter */
+
+	/*
+	 * The jiffies at the last time TIF_MEMDIE was set for a task
+	 * associated with this cpuset.
+	 */
+	unsigned long last_tif_memdie_jiffies;
 };
 
 /* bits in struct cpuset flags field */
@@ -109,6 +115,7 @@ typedef enum {
 	CS_NOTIFY_ON_RELEASE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_OOM,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -145,6 +152,11 @@ static inline int is_spread_slab(const s
 static inline int is_spread_slab(const struct cpuset *cs)
 {
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
+}
+
+static inline int is_oom(const struct cpuset *cs)
+{
+	return test_bit(CS_OOM, &cs->flags);
 }
 
 /*
@@ -1251,10 +1263,16 @@ static int attach_task(struct cpuset *cs
 	 * then fail this attach_task(), to avoid breaking top_cpuset.count.
 	 */
 	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		mutex_unlock(&callback_mutex);
-		put_task_struct(tsk);
-		return -ESRCH;
+		retval = -ESRCH;
+		goto error;
+	}
+	/*
+	 * If the task's cpuset is currently in the OOM killer, we cannot
+	 * move it or we'll clear the CS_OOM flag in the new cpuset.
+	 */
+	if (unlikely(is_oom(oldcs))) {
+		retval = -EBUSY;
+		goto error;
 	}
 	atomic_inc(&cs->count);
 	rcu_assign_pointer(tsk->cpuset, cs);
@@ -1281,6 +1299,12 @@ static int attach_task(struct cpuset *cs
 	if (atomic_dec_and_test(&oldcs->count))
 		check_for_release(oldcs, ppathbuf);
 	return 0;
+
+error:
+	task_unlock(tsk);
+	mutex_unlock(&callback_mutex);
+	put_task_struct(tsk);
+	return retval;
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -2603,6 +2627,39 @@ done:
 	return overlap;
 }
 
+int cpuset_get_last_tif_memdie(struct task_struct *task)
+{
+	unsigned long ret;
+	task_lock(task);
+	ret = task->cpuset->last_tif_memdie_jiffies;
+	task_unlock(task);
+	return ret;
+}
+
+void cpuset_set_last_tif_memdie(struct task_struct *task,
+				unsigned long last_tif_memdie)
+{
+	task_lock(task);
+	task->cpuset->last_tif_memdie_jiffies = last_tif_memdie;
+	task_unlock(task);
+}
+
+int cpuset_set_oom(struct task_struct *task)
+{
+	int ret;
+	task_lock(task);
+	ret = test_and_set_bit(CS_OOM, &task->cpuset->flags);
+	task_unlock(task);
+	return ret;
+}
+
+void cpuset_clear_oom(struct task_struct *task)
+{
+	task_lock(task);
+	clear_bit(CS_OOM, &task->cpuset->flags);
+	task_unlock(task);
+}
+
 /*
  * Collection of memory_pressure is suppressed unless
  * this flag is enabled by writing "1" to the special
diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -857,6 +857,7 @@ static void exit_notify(struct task_stru
 	if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) {
 		extern unsigned long VM_is_OOM;
 		clear_bit(0, &VM_is_OOM);
+		cpuset_clear_oom(tsk);
 	}
 
 	write_unlock_irq(&tasklist_lock);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -428,6 +428,7 @@ void out_of_memory(struct zonelist *zone
 void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 {
 	unsigned long freed = 0;
+	unsigned long last_tif_memdie;
 	int constraint;
 	static DECLARE_MUTEX(OOM_lock);
 
@@ -455,6 +456,22 @@ void out_of_memory(struct zonelist *zone
 		break;
 
 	case CONSTRAINT_CPUSET:
+		read_lock(&tasklist_lock);
+		last_tif_memdie = cpuset_get_last_tif_memdie(current);
+		/*
+		 * If current's cpuset is already in the OOM killer or its killed
+		 * task has not yet exited and a deadlock hasn't been detected, then
+		 * do nothing.
+		 */
+		if (unlikely(cpuset_set_oom(current)) &&
+		    !oom_is_deadlocked(&last_tif_memdie))
+			goto out_cpuset;
+		cpuset_set_last_tif_memdie(current, last_tif_memdie);
+		select_and_kill_process(gfp_mask, order, constraint);
+
+	out_cpuset:
+		read_unlock(&tasklist_lock);
+		break;
 	case CONSTRAINT_NONE:
 		if (down_trylock(&OOM_lock))
 			break;
@@ -467,7 +484,7 @@ void out_of_memory(struct zonelist *zone
 		 */
 		if (unlikely(test_bit(0, &VM_is_OOM)) &&
 		    !oom_is_deadlocked(&last_tif_memdie_jiffies))
-			goto out;
+			goto out_none;
 
 		if (sysctl_panic_on_oom) {
 			read_unlock(&tasklist_lock);
@@ -477,7 +494,7 @@ void out_of_memory(struct zonelist *zone
 
 		select_and_kill_process(gfp_mask, order, constraint);
 
-	out:
+	out_none:
 		read_unlock(&tasklist_lock);
 		up(&OOM_lock);
 		break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 24 of 24] add oom_kill_asking_task flag
  2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
                   ` (22 preceding siblings ...)
  2007-08-22 12:49 ` [PATCH 23 of 24] serialize for cpusets Andrea Arcangeli
@ 2007-08-22 12:49 ` Andrea Arcangeli
  2007-09-12 13:11   ` Andrew Morton
  23 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-08-22 12:49 UTC (permalink / raw)
  To: linux-mm; +Cc: David Rientjes

# HG changeset patch
# User David Rientjes <rientjes@google.com>
# Date 1187778125 -7200
# Node ID 96b5899e730ecaa2078883f75e86765fa1a36431
# Parent  a3d679df54ebb1f977b97ab6b3e501134bf9e7ef
add oom_kill_asking_task flag

Adds an oom_kill_asking_task flag to cpusets.  If unset (by default), we
iterate through the task list via select_bad_process() during a
cpuset-constrained OOM to find the best candidate task to kill.  If set,
we simply kill current to avoid the overhead which is needed for some
customers with a large number of threads or heavy workload.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cpusets.txt |    3 +++
 include/linux/cpuset.h    |    5 +++++
 kernel/cpuset.c           |   39 ++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c             |    7 +++++++
 4 files changed, 53 insertions(+), 1 deletions(-)

diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -181,6 +181,9 @@ containing the following files describin
  - tasks: list of tasks (by pid) attached to that cpuset
  - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
  - memory_pressure: measure of how much paging pressure in cpuset
+ - oom_kill_asking_task flag: when this cpuset OOM's, should we kill
+	the task that asked for the memory or should we iterate through
+	the task list to find the best task to kill (can be expensive)?
 
 In addition, the root cpuset only has the following file:
  - memory_pressure_enabled flag: compute memory_pressure?
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -52,6 +52,7 @@ extern void cpuset_set_last_tif_memdie(s
 				       unsigned long last_tif_memdie);
 extern int cpuset_set_oom(struct task_struct *task);
 extern void cpuset_clear_oom(struct task_struct *task);
+extern int cpuset_oom_kill_asking_task(struct task_struct *task);
 
 #define cpuset_memory_pressure_bump() 				\
 	do {							\
@@ -136,6 +137,10 @@ static inline int cpuset_set_oom(struct 
 	return 0;
 }
 static inline void cpuset_clear_oom(struct task_struct *task) {}
+static inline int cpuset_oom_kill_asking_task(struct task_struct *task)
+{
+	return 0;
+}
 
 static inline void cpuset_memory_pressure_bump(void) {}
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -116,6 +116,7 @@ typedef enum {
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
 	CS_OOM,
+	CS_OOM_KILL_ASKING_TASK,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -157,6 +158,11 @@ static inline int is_oom(const struct cp
 static inline int is_oom(const struct cpuset *cs)
 {
 	return test_bit(CS_OOM, &cs->flags);
+}
+
+static inline int is_oom_kill_asking_task(const struct cpuset *cs)
+{
+	return test_bit(CS_OOM_KILL_ASKING_TASK, &cs->flags);
 }
 
 /*
@@ -1068,7 +1074,8 @@ static int update_memory_pressure_enable
  * update_flag - read a 0 or a 1 in a file and update associated flag
  * bit:	the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
  *				CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE,
- *				CS_SPREAD_PAGE, CS_SPREAD_SLAB)
+ *				CS_SPREAD_PAGE, CS_SPREAD_SLAB,
+ *				CS_OOM_KILL_ASKING_TASK)
  * cs:	the cpuset to update
  * buf:	the buffer where we read the 0 or 1
  *
@@ -1320,6 +1327,7 @@ typedef enum {
 	FILE_NOTIFY_ON_RELEASE,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
+	FILE_OOM_KILL_ASKING_TASK,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
 	FILE_TASKLIST,
@@ -1382,6 +1390,9 @@ static ssize_t cpuset_common_file_write(
 	case FILE_MEMORY_PRESSURE:
 		retval = -EACCES;
 		break;
+	case FILE_OOM_KILL_ASKING_TASK:
+		retval = update_flag(CS_OOM_KILL_ASKING_TASK, cs, buffer);
+		break;
 	case FILE_SPREAD_PAGE:
 		retval = update_flag(CS_SPREAD_PAGE, cs, buffer);
 		cs->mems_generation = cpuset_mems_generation++;
@@ -1499,6 +1510,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_MEMORY_PRESSURE:
 		s += sprintf(s, "%d", fmeter_getrate(&cs->fmeter));
 		break;
+	case FILE_OOM_KILL_ASKING_TASK:
+		*s++ = is_oom_kill_asking_task(cs) ? '1' : '0';
+		break;
 	case FILE_SPREAD_PAGE:
 		*s++ = is_spread_page(cs) ? '1' : '0';
 		break;
@@ -1861,6 +1875,11 @@ static struct cftype cft_memory_pressure
 static struct cftype cft_memory_pressure = {
 	.name = "memory_pressure",
 	.private = FILE_MEMORY_PRESSURE,
+};
+
+static struct cftype cft_oom_kill_asking_task = {
+	.name = "oom_kill_asking_task",
+	.private = FILE_OOM_KILL_ASKING_TASK,
 };
 
 static struct cftype cft_spread_page = {
@@ -1891,6 +1910,8 @@ static int cpuset_populate_dir(struct de
 		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_memory_pressure)) < 0)
 		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_oom_kill_asking_task)) < 0)
+		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_spread_page)) < 0)
 		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_spread_slab)) < 0)
@@ -1923,6 +1944,8 @@ static long cpuset_create(struct cpuset 
 	cs->flags = 0;
 	if (notify_on_release(parent))
 		set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
+	if (is_oom_kill_asking_task(parent))
+		set_bit(CS_OOM_KILL_ASKING_TASK, &cs->flags);
 	if (is_spread_page(parent))
 		set_bit(CS_SPREAD_PAGE, &cs->flags);
 	if (is_spread_slab(parent))
@@ -2661,6 +2684,20 @@ void cpuset_clear_oom(struct task_struct
 }
 
 /*
+ * Returns 1 if current should simply be killed when a cpuset-constrained OOM
+ * occurs.  Otherwise, we iterate through the task list and select the best
+ * candidate we can find.
+ */
+int cpuset_oom_kill_asking_task(struct task_struct *task)
+{
+	int ret;
+	task_lock(task);
+	ret = is_oom_kill_asking_task(task->cpuset);
+	task_unlock(task);
+	return ret;
+}
+
+/*
  * Collection of memory_pressure is suppressed unless
  * this flag is enabled by writing "1" to the special
  * cpuset file 'memory_pressure_enabled' in the root cpuset.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -457,6 +457,13 @@ void out_of_memory(struct zonelist *zone
 
 	case CONSTRAINT_CPUSET:
 		read_lock(&tasklist_lock);
+		if (cpuset_oom_kill_asking_task(current)) {
+			oom_kill_process(current, 0,
+					 "No available memory in cpuset", gfp_mask,
+					 order);
+			goto out_cpuset;
+		}
+
 		last_tif_memdie = cpuset_get_last_tif_memdie(current);
 		/*
 		 * If current's cpuset is already in the OOM killer or its killed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03 of 24] prevent oom deadlocks during read/write operations
  2007-09-12 11:56   ` Andrew Morton
@ 2007-09-12  2:18     ` Nick Piggin
  2008-01-03  0:53     ` Andrea Arcangeli
  1 sibling, 0 replies; 113+ messages in thread
From: Nick Piggin @ 2007-09-12  2:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, David Rientjes

On Wednesday 12 September 2007 21:56, Andrew Morton wrote:

> I had to rejig this code quite a lot on top of the stuff which is pending
> in -mm and I might have missed a path.  Nick, can you please review this
> closely?

I think it looks OK. Is -ENOMEM the right thing to return here? I guess
userspace won't see it if they have a SIGKILL pending? (EINTR or
something may be more logical, but maybe the call chain can't
handle it?)

>
> The patch adds sixty-odd bytes of text to some of the most-used code in the
> kernel.  Based on the above problem description I'm doubting that this is
> justified.  Please tell us more?
>
> diff -puN
> mm/filemap.c~oom-handling-prevent-oom-deadlocks-during-read-write-operation
>s mm/filemap.c ---
> a/mm/filemap.c~oom-handling-prevent-oom-deadlocks-during-read-write-operati
>ons +++ a/mm/filemap.c
> @@ -916,6 +916,15 @@ page_ok:
>  			goto out;
>  		}
>
> +		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
> +			/*
> +			 * Must not hang almost forever in D state in presence
> +			 * of sigkill and lots of ram/swap (think during OOM).
> +			 */
> +			page_cache_release(page);
> +			goto out;
> +		}
> +
>  		/* nr is the maximum number of bytes to copy from this page */
>  		nr = PAGE_CACHE_SIZE;
>  		if (index == end_index) {
> @@ -2050,6 +2059,15 @@ static ssize_t generic_perform_write_2co
>  			break;
>  		}
>
> +		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
> +			/*
> +			 * Must not hang almost forever in D state in presence
> +			 * of sigkill and lots of ram/swap (think during OOM).
> +			 */
> +			status = -ENOMEM;
> +			break;
> +		}
> +
>  		page = __grab_cache_page(mapping, index);
>  		if (!page) {
>  			status = -ENOMEM;
> @@ -2220,6 +2238,15 @@ again:
>  			break;
>  		}
>
> +		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
> +			/*
> +			 * Must not hang almost forever in D state in presence
> +			 * of sigkill and lots of ram/swap (think during OOM).
> +			 */
> +			status = -ENOMEM;
> +			break;
> +		}
> +
>  		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
>  						&page, &fsdata);
>  		if (unlikely(status))
> _

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01 of 24] remove nr_scan_inactive/active
  2007-08-22 12:48 ` [PATCH 01 of 24] remove nr_scan_inactive/active Andrea Arcangeli
@ 2007-09-12 11:44   ` Andrew Morton
  2008-01-02 17:50     ` Andrea Arcangeli
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 11:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:48:48 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778124 -7200
> # Node ID c8ec651562ad6514753e408596e30d7d9e448a51
> # Parent  b03dfad58a311488ec373c30fd5dc97dc03aecae
> remove nr_scan_inactive/active
> 
> The older atomic_add/atomic_set were pointless (atomic_set vs atomic_add would
> race), but removing them didn't actually remove the race, the race is still
> there, for the same reasons atomic_add/set couldn't prevent it. This is really
> the kind of code that I dislike because it's sort of buggy, and it shouldn't be
> making any measurable difference and when it does something for real it can
> only hurt!
> 
> The real focus is on shrink_zone (ignore the other places where it's being used
> that are even less interesting). Assume two tasks adds to nr_scan_*active at
> the same time (first line of the old buggy code), they'll effectively double their
> scan rate, for no good reason. What can happen is that instead of scanning
> nr_entries each, they'll scan nr_entries*2 each. The more CPUs the bigger the
> race and the higher the multiplication effect and the harder it will be to
> detect oom. In the case that nr_*active < sc->swap_cluster_max, regardless of
> whatever future invocation of alloc_pages, we'll be going down in the
> priorities in the current alloc_pages invocation if the DEF_PRIORITY was too
> high to make any work, so again accumulating the nr_scan_*active doesn't seem
> interesting even when it's smaller than sc->swap_cluster_max. Each task should
> work for itself without much care of what the others are doing.

You're coming at this from the wrong end of town.  The code in there is to
address small zones (actually small LRU lists) at "easy" scanning
priorities.  I suspect you just broke it in that region of operation.

Does that above text describe something which you've observed and measured
in practice, or is it theoretical-from-code-inspection?

> ...
>

We go from this:

	/*
	 * Add one to `nr_to_scan' just to make sure that the kernel will
	 * slowly sift through the active list.
	 */
	zone->nr_scan_active +=
		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
	nr_active = zone->nr_scan_active;
	if (nr_active >= sc->swap_cluster_max)
		zone->nr_scan_active = 0;
	else
		nr_active = 0;

	zone->nr_scan_inactive +=
		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
	nr_inactive = zone->nr_scan_inactive;
	if (nr_inactive >= sc->swap_cluster_max)
		zone->nr_scan_inactive = 0;
	else
		nr_inactive = 0;

	while (nr_active || nr_inactive) {

to this:

	/*
	 * Add one to `nr_to_scan' just to make sure that the kernel will
	 * slowly sift through the active list.
	 */
	nr_active = zone_page_state(zone, NR_ACTIVE) >> priority;
	if (nr_active < sc->swap_cluster_max)
		nr_active = 0;
	nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority;
	if (nr_inactive < sc->swap_cluster_max)
		nr_inactive = 0;

	while (nr_active || nr_inactive) {

I have issues.

The old code took care of the situtaion where zone_page_state(zone,
NR_ACTIVE) is smaller than (1 << priority): do a bit of reclaim in that
case anyway.  This is a minor issue, as we'll at least perform some
scanning when priority is low.  But you should have depeted the now-wrong
comment.

More serious issue: the logic in there takes care of balancing a small LRU
list.  If (zone_page_state(zone, NR_ACTIVE)>>priority) is, umm, "3" then
we'll add "3" into zone->nr_scan_active and then leave the zone alone. 
Once we've done this enough times, the "3"s will add up to something which
is larger than swap_cluster_max and then we'll do a round of scanning for
real.

Your change breaks that logic and there is potential that a small LRU will
be underscanned, especially when reclaim is not under distress.

I don't know how serious this change is, but it's a change for the worse
and it would take quite a bit of thought and careful testing to be able to
justify this change.

According to the above-described logic, one would think that it would be
more accurate to replace the existing

	if (nr_active >= sc->swap_cluster_max)
		zone->nr_scan_active = 0;

with

	if (nr_active >= sc->swap_cluster_max)
		zone->nr_scan_active -= sc->swap_cluster_max;

and for twelve seconds on 12 March 2004 we were partially doing that, but
then I merged this:

commit 4d5e349b89e4017ddbdbd06345e94c59e8b851b7
Author: akpm <akpm>
Date:   Fri Mar 12 16:25:24 2004 +0000

    [PATCH] fix vm-batch-inactive-scanning.patch

    - prevent nr_scan_inactive from going negative

    - compare `count' with SWAP_CLUSTER_MAX, not `max_scan'

    - Use ">= SWAP_CLUSTER_MAX", not "> SWAP_CLUSTER_MAX".

    BKrev: 4051e474u37Zwj2o6Q5o5NeVCL-5kQ

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb86cb2..65824df 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -757,14 +757,14 @@ shrink_zone(struct zone *zone, int max_s
 	ratio = (unsigned long)SWAP_CLUSTER_MAX * zone->nr_active /
 				((zone->nr_inactive | 1) * 2);
 	atomic_add(ratio+1, &zone->nr_scan_active);
-	if (atomic_read(&zone->nr_scan_active) > SWAP_CLUSTER_MAX) {
+	count = atomic_read(&zone->nr_scan_active);
+	if (count >= SWAP_CLUSTER_MAX) {
 		/*
 		 * Don't try to bring down too many pages in one attempt.
 		 * If this fails, the caller will increase `priority' and
 		 * we'll try again, with an increased chance of reclaiming
 		 * mapped memory.
 		 */
-		count = atomic_read(&zone->nr_scan_active);
 		if (count > SWAP_CLUSTER_MAX * 4)
 			count = SWAP_CLUSTER_MAX * 4;
 		atomic_set(&zone->nr_scan_active, 0);
@@ -773,8 +773,8 @@ shrink_zone(struct zone *zone, int max_s

 	atomic_add(max_scan, &zone->nr_scan_inactive);
 	count = atomic_read(&zone->nr_scan_inactive);
-	if (max_scan > SWAP_CLUSTER_MAX) {
-		atomic_sub(count, &zone->nr_scan_inactive);
+	if (count >= SWAP_CLUSTER_MAX) {
+		atomic_set(&zone->nr_scan_inactive, 0);
 		return shrink_cache(zone, gfp_mask, count, total_scanned);
 	}
 	return 0;

which made both the inactive and active list scanning the same (and
inaccurate).

So I'm thinking that a correct fix to all these problems is to go back to
atomics and to not just set the counters to zero, but to subtract the
number-of-scanned-pages from them as we're supposed to so.

An alternative approach might be to only touch nr_scanned_[in]active at all
when (zone_page_state(zone, NR_ACTIVE) >> priority) is less than
(1<<priority).  So most of the time we'll just go in there and scan the
full swap_cluster_max pages.  And the nr_scan_[in]active counters are
purely used as "fractional" counters to prevent the underscanning in the
corner cases to which I referred above.

Yet another alternative approach would be to remove the batching
altogether.  If (zone_page_state(zone, NR_ACTIVE) >> priority) evaluates to
"3", well, just go in and scan three pages.  That should address any
accuracy problems and it will address the problem which you're addressing,
but it will add unknown-but-probably-small computational cost.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 03 of 24] prevent oom deadlocks during read/write operations
  2007-08-22 12:48 ` [PATCH 03 of 24] prevent oom deadlocks during read/write operations Andrea Arcangeli
@ 2007-09-12 11:56   ` Andrew Morton
  2007-09-12  2:18     ` Nick Piggin
  2008-01-03  0:53     ` Andrea Arcangeli
  0 siblings, 2 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 11:56 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes, Nick Piggin

On Wed, 22 Aug 2007 14:48:50 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778124 -7200
> # Node ID 5566f2af006a171cd47d596c6654f51beca74203
> # Parent  90afd499e8ca0dfd2e0284372dca50f2e6149700
> prevent oom deadlocks during read/write operations
> 
> We need to react to SIGKILL during read/write with huge buffers or it
> becomes too easy to prevent a SIGKILLED task to run do_exit promptly
> after it has been selected for oom-killage.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -925,6 +925,13 @@ page_ok:
>  			goto out;
>  		}
>  
> +		if (unlikely(sigismember(&current->pending.signal, SIGKILL)))
> +			/*
> +			 * Must not hang almost forever in D state in presence of sigkill
> +			 * and lots of ram/swap (think during OOM).
> +			 */
> +			break;
> +

please try to keep the code inside 80 cols.

this code leaks a page ref.

>  		/* nr is the maximum number of bytes to copy from this page */
>  		nr = PAGE_CACHE_SIZE;
>  		if (index == end_index) {
> @@ -1868,6 +1875,13 @@ generic_file_buffered_write(struct kiocb
>  		unsigned long index;
>  		unsigned long offset;
>  		size_t copied;
> +
> +		if (unlikely(sigismember(&current->pending.signal, SIGKILL)))
> +			/*
> +			 * Must not hang almost forever in D state in presence of sigkill
> +			 * and lots of ram/swap (think during OOM).
> +			 */
> +			break;
>  
>  		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
>  		index = pos >> PAGE_CACHE_SHIFT;
> 

I had to rejig this code quite a lot on top of the stuff which is pending
in -mm and I might have missed a path.  Nick, can you please review this
closely?

The patch adds sixty-odd bytes of text to some of the most-used code in the
kernel.  Based on the above problem description I'm doubting that this is
justified.  Please tell us more?

diff -puN mm/filemap.c~oom-handling-prevent-oom-deadlocks-during-read-write-operations mm/filemap.c
--- a/mm/filemap.c~oom-handling-prevent-oom-deadlocks-during-read-write-operations
+++ a/mm/filemap.c
@@ -916,6 +916,15 @@ page_ok:
 			goto out;
 		}
 
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
+			/*
+			 * Must not hang almost forever in D state in presence
+			 * of sigkill and lots of ram/swap (think during OOM).
+			 */
+			page_cache_release(page);
+			goto out;
+		}
+
 		/* nr is the maximum number of bytes to copy from this page */
 		nr = PAGE_CACHE_SIZE;
 		if (index == end_index) {
@@ -2050,6 +2059,15 @@ static ssize_t generic_perform_write_2co
 			break;
 		}
 
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
+			/*
+			 * Must not hang almost forever in D state in presence
+			 * of sigkill and lots of ram/swap (think during OOM).
+			 */
+			status = -ENOMEM;
+			break;
+		}
+
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
 			status = -ENOMEM;
@@ -2220,6 +2238,15 @@ again:
 			break;
 		}
 
+		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
+			/*
+			 * Must not hang almost forever in D state in presence
+			 * of sigkill and lots of ram/swap (think during OOM).
+			 */
+			status = -ENOMEM;
+			break;
+		}
+
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
 		if (unlikely(status))
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-08-22 12:48 ` [PATCH 04 of 24] serialize oom killer Andrea Arcangeli
@ 2007-09-12 12:02   ` Andrew Morton
  2007-09-12 12:04     ` Andrew Morton
  2008-01-03  0:55     ` Andrea Arcangeli
  2007-09-13  0:09   ` Christoph Lameter
  1 sibling, 2 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:48:51 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID 871b7a4fd566de0811207628b74abea0a73341f6
> # Parent  5566f2af006a171cd47d596c6654f51beca74203
> serialize oom killer
> 
> It's risky and useless to run two oom killers in parallel, let serialize it to
> reduce the probability of spurious oom-killage.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -401,12 +401,15 @@ void out_of_memory(struct zonelist *zone
>  	unsigned long points = 0;
>  	unsigned long freed = 0;
>  	int constraint;
> +	static DECLARE_MUTEX(OOM_lock);
>  
>  	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
>  	if (freed > 0)
>  		/* Got some memory back in the last second. */
>  		return;
>  
> +	if (down_trylock(&OOM_lock))
> +		return;
>  	if (printk_ratelimit()) {
>  		printk(KERN_WARNING "%s invoked oom-killer: "
>  			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
> @@ -473,4 +476,6 @@ out:
>  	 */
>  	if (!test_thread_flag(TIF_MEMDIE))
>  		schedule_timeout_uninterruptible(1);
> -}
> +
> +	up(&OOM_lock);
> +}

Please use mutexes, not semaphores.  I'll make this change.

I think this patch needs more explanation/justification.

What problems were observed, and what effect did this change have upon the
system behaviour?

What happens to all the tasks which fail to grab the lock?  Do they return
to sleep in congestion_wait() for a bit?  If so, OK.  But are there
scenarios in which they'll go nuts consuming CPU?  Because if there are, a
non-preemptible uniproc kernel could wedge up forever: the task which holds
OOM_lock is not running and the task which is trying to get it never gives
up the CPU?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-12 12:02   ` Andrew Morton
@ 2007-09-12 12:04     ` Andrew Morton
  2007-09-12 12:11       ` Andrea Arcangeli
  2008-01-03  0:55     ` Andrea Arcangeli
  1 sibling, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:04 UTC (permalink / raw)
  To: Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 12 Sep 2007 05:02:05 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> > +	up(&OOM_lock);
> > +}
> 
> Please use mutexes, not semaphores.  I'll make this change.

gargh, shit, OOM_lock is all over the patch series.

Is there some reason why it had to be a semaphore?  Does it get upped by
tasks which didn't down it?  Does the semaphore counting feature get used?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-12 12:04     ` Andrew Morton
@ 2007-09-12 12:11       ` Andrea Arcangeli
  0 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-09-12 12:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

On Wed, Sep 12, 2007 at 05:04:47AM -0700, Andrew Morton wrote:
> Is there some reason why it had to be a semaphore?  Does it get upped by
> tasks which didn't down it?  Does the semaphore counting feature get used?

No you're right this can be a mutex. the reason this is a semaphore is
that those bugs had to be fixed against a 2.6.5 kernel first ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 06 of 24] reduce the probability of an OOM livelock
  2007-08-22 12:48 ` [PATCH 06 of 24] reduce the probability of an OOM livelock Andrea Arcangeli
@ 2007-09-12 12:17   ` Andrew Morton
  2008-01-03  1:03     ` Andrea Arcangeli
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:17 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:48:53 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID 49e2d90eb0d7b1021b1e1e841bef22fdc647766e
> # Parent  de62eb332b1dfee7e493043b20e560283ef42f67
> reduce the probability of an OOM livelock
> 
> There's no need to loop way too many times over the lrus in order to
> declare defeat and decide to kill a task. The more loops we do the more
> likely there we'll run in a livelock with a page bouncing back and
> forth between tasks. The maximum number of entries to check in a loop
> that returns less than swap-cluster-max pages freed, should be the size
> of the list (or at most twice the size of the list if you want to be
> really paranoid about the PG_referenced bit).
> 
> Our objective there is to know reliably when it's time that we kill a
> task, tring to free a few more pages at that already ciritical point is
> worthless.
> 
> This seems to have the effect of reducing the "hang" time during oom
> killing.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1112,7 +1112,7 @@ unsigned long try_to_free_pages(struct z
>  	int priority;
>  	int ret = 0;
>  	unsigned long total_scanned = 0;
> -	unsigned long nr_reclaimed = 0;
> +	unsigned long nr_reclaimed;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	unsigned long lru_pages = 0;
>  	int i;
> @@ -1141,12 +1141,12 @@ unsigned long try_to_free_pages(struct z
>  		sc.nr_scanned = 0;
>  		if (!priority)
>  			disable_swap_token();
> -		nr_reclaimed += shrink_zones(priority, zones, &sc);
> +		nr_reclaimed = shrink_zones(priority, zones, &sc);
> +		if (reclaim_state)
> +			reclaim_state->reclaimed_slab = 0;
>  		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
> -		if (reclaim_state) {
> +		if (reclaim_state)
>  			nr_reclaimed += reclaim_state->reclaimed_slab;
> -			reclaim_state->reclaimed_slab = 0;
> -		}
>  		total_scanned += sc.nr_scanned;
>  		if (nr_reclaimed >= sc.swap_cluster_max) {
>  			ret = 1;

I don't get it.  This code changes try_to_free_pages() so that it will only
bale out when a single scan of the zone at a particular priority reclaimed
more than swap_cluster_max pages.  Previously we'd include the results of all the
lower-priority scanning in that comparison too.

So this patch will make try_to_free_pages() do _more_ scanning than it used
to, in some situations.  Which seems opposite to what you're trying to do
here.

> @@ -1238,7 +1238,6 @@ static unsigned long balance_pgdat(pg_da
>  
>  loop_again:
>  	total_scanned = 0;
> -	nr_reclaimed = 0;
>  	sc.may_writepage = !laptop_mode;
>  	count_vm_event(PAGEOUTRUN);
>  
> @@ -1293,6 +1292,7 @@ loop_again:
>  		 * pages behind kswapd's direction of progress, which would
>  		 * cause too much scanning of the lower zones.
>  		 */
> +		nr_reclaimed = 0;
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
>  			int nr_slab;
> 

A similar situation exists with this change.

Your changelog made no mention of the change to balance_pgdat() and I'm
struggling a bit to see what it's doing in there.


In both places, the definition of local variable nr_reclaimed can be moved
into a more inner scope.  This makes the code easier to follow.  Please
watch out for cleanup opportunities like that.

I'll skip this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07 of 24] balance_pgdat doesn't return the number of pages freed
  2007-08-22 12:48 ` [PATCH 07 of 24] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
@ 2007-09-12 12:18   ` Andrew Morton
  2007-09-13  0:26     ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:48:54 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID b66d8470c04ed836787f69c7578d5fea4f18c322
> # Parent  49e2d90eb0d7b1021b1e1e841bef22fdc647766e
> balance_pgdat doesn't return the number of pages freed
> 
> nr_reclaimed would be the number of pages freed in the last pass.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1198,8 +1198,6 @@ out:
>   * For kswapd, balance_pgdat() will work across all this node's zones until
>   * they are all at pages_high.
>   *
> - * Returns the number of pages which were actually freed.
> - *
>   * There is special handling here for zones which are full of pinned pages.
>   * This can happen if the pages are all mlocked, or if they are all used by
>   * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
> @@ -1215,7 +1213,7 @@ out:
>   * the page allocator fallback scheme to ensure that aging of pages is balanced
>   * across the zones.
>   */
> -static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> +static void balance_pgdat(pg_data_t *pgdat, int order)
>  {
>  	int all_zones_ok;
>  	int priority;
> @@ -1366,8 +1364,6 @@ out:
>  
>  		goto loop_again;
>  	}
> -
> -	return nr_reclaimed;
>  }
>  

I'll skip this due to its dependency on
[PATCH 06 of 24] reduce the probability of an OOM livelock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08 of 24] don't depend on PF_EXITING tasks to go away
  2007-08-22 12:48 ` [PATCH 08 of 24] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
@ 2007-09-12 12:20   ` Andrew Morton
  2008-01-03  0:56     ` Andrea Arcangeli
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:48:55 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID ffdc30241856d7155ceedd4132eef684f7cc7059
> # Parent  b66d8470c04ed836787f69c7578d5fea4f18c322
> don't depend on PF_EXITING tasks to go away
> 
> A PF_EXITING task don't have TIF_MEMDIE set so it might get stuck in
> memory allocations without access to the PF_MEMALLOC pool (said that
> ideally do_exit would better not require memory allocations, especially
> not before calling exit_mm). The same way we raise its privilege to
> TIF_MEMDIE if it's the current task, we should do it even if it's not
> the current task to speedup oom killing.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -234,27 +234,13 @@ static struct task_struct *select_bad_pr
>  		 * Note: this may have a chance of deadlock if it gets
>  		 * blocked waiting for another task which itself is waiting
>  		 * for memory. Is there a better alternative?
> +		 *
> +		 * Better not to skip PF_EXITING tasks, since they
> +		 * don't have access to the PF_MEMALLOC pool until
> +		 * we select them here first.
>  		 */
>  		if (test_tsk_thread_flag(p, TIF_MEMDIE))
>  			return ERR_PTR(-1UL);
> -
> -		/*
> -		 * This is in the process of releasing memory so wait for it
> -		 * to finish before killing some other task by mistake.
> -		 *
> -		 * However, if p is the current task, we allow the 'kill' to
> -		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
> -		 * which will allow it to gain access to memory reserves in
> -		 * the process of exiting and releasing its resources.
> -		 * Otherwise we could get an easy OOM deadlock.
> -		 */
> -		if (p->flags & PF_EXITING) {
> -			if (p != current)
> -				return ERR_PTR(-1UL);
> -
> -			chosen = p;
> -			*ppoints = ULONG_MAX;
> -		}
>  
>  		if (p->oomkilladj == OOM_DISABLE)
>  			continue;
> 

hm, I'll believe you.

Does this address any problem which was actually observed in real life?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't go away
  2007-08-22 12:48 ` [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
@ 2007-09-12 12:30   ` Andrew Morton
  2007-09-12 12:34     ` Andrew Morton
  2008-01-03  1:06     ` Andrea Arcangeli
  0 siblings, 2 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:30 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:48:56 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID 9bf6a66eab3c52327daa831ef101d7802bc71791
> # Parent  ffdc30241856d7155ceedd4132eef684f7cc7059
> fallback killing more tasks if tif-memdie doesn't go away
> 
> Waiting indefinitely for a TIF_MEMDIE task to go away will deadlock. Two
> tasks reading from the same inode at the same time and both going out of
> memory inside a read(largebuffer) syscall, will even deadlock through
> contention over the PG_locked bitflag. The task holding the semaphore
> detects oom but the oom killer decides to kill the task blocked in
> wait_on_page_locked(). The task holding the semaphore will hang inside
> alloc_pages that will never return because it will wait the TIF_MEMDIE
> task to go away, but the TIF_MEMDIE task can't go away until the task
> holding the semaphore is killed in the first place.

hrm, OK, that's not nice

> It's quite unpractical to teach the oom killer the locking dependencies
> across running tasks, so the feasible fix is to develop a logic that
> after waiting a long time for a TIF_MEMDIE tasks goes away, fallbacks
> on killing one more task. This also eliminates the possibility of
> suprious oom killage (i.e. two tasks killed despite only one had to be
> killed). It's not a math guarantee because we can't demonstrate that if
> a TIF_MEMDIE SIGKILLED task didn't mange to complete do_exit within
> 10sec, it never will. But the current probability of suprious oom
> killing is sure much higher than the probability of suprious oom killing
> with this patch applied.
> 
> The whole locking is around the tasklist_lock. On one side do_exit reads
> TIF_MEMDIE and clears VM_is_OOM under the lock, on the other side the
> oom killer accesses VM_is_OOM and TIF_MEMDIE under the lock. This is a
> read_lock in the oom killer but it's actually a write lock thanks to the
> OOM_lock semaphore running one oom killer at once (the locking rule is,
> either use write_lock_irq or read_lock+OOM_lock).
> 


> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -849,6 +849,15 @@ static void exit_notify(struct task_stru
>  	if (tsk->exit_signal == -1 && likely(!tsk->ptrace))
>  		state = EXIT_DEAD;
>  	tsk->exit_state = state;
> +
> +	/*
> +	 * Read TIF_MEMDIE and set VM_is_OOM to 0 atomically inside
> +	 * the tasklist_lock_lock.
> +	 */
> +	if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) {
> +		extern unsigned long VM_is_OOM;
> +		clear_bit(0, &VM_is_OOM);
> +	}

Please, no externs-in-C, ever.

>  	write_unlock_irq(&tasklist_lock);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -29,6 +29,9 @@ int sysctl_panic_on_oom;
>  int sysctl_panic_on_oom;
>  /* #define DEBUG */
>  
> +unsigned long VM_is_OOM;

what's with the studlycaps, btw?

> +static unsigned long last_tif_memdie_jiffies;
> +
>  /**
>   * badness - calculate a numeric value for how bad this task has been
>   * @p: task struct of which task we should calculate
> @@ -226,21 +229,14 @@ static struct task_struct *select_bad_pr
>  		if (is_init(p))
>  			continue;
>  
> -		/*
> -		 * This task already has access to memory reserves and is
> -		 * being killed. Don't allow any other task access to the
> -		 * memory reserve.
> -		 *
> -		 * Note: this may have a chance of deadlock if it gets
> -		 * blocked waiting for another task which itself is waiting
> -		 * for memory. Is there a better alternative?
> -		 *
> -		 * Better not to skip PF_EXITING tasks, since they
> -		 * don't have access to the PF_MEMALLOC pool until
> -		 * we select them here first.
> -		 */
> -		if (test_tsk_thread_flag(p, TIF_MEMDIE))
> -			return ERR_PTR(-1UL);
> +		if (unlikely(test_tsk_thread_flag(p, TIF_MEMDIE))) {
> +			/*
> +			 * Either we already waited long enough,
> +			 * or exit_mm already run, so we must
> +			 * try to kill another task.
> +			 */
> +			continue;
> +		}
>  
>  		if (p->oomkilladj == OOM_DISABLE)
>  			continue;
> @@ -277,13 +273,16 @@ static void __oom_kill_task(struct task_
>  	if (verbose)
>  		printk(KERN_ERR "Killed process %d (%s)\n", p->pid, p->comm);
>  
> +	if (!test_and_set_tsk_thread_flag(p, TIF_MEMDIE)) {
> +		last_tif_memdie_jiffies = jiffies;
> +		set_bit(0, &VM_is_OOM);
> +	}

Do we actually need the bitops?  Wouldn't a simple old foo=0 suffice here??

>  	/*
>  	 * We give our sacrificial lamb high priority and access to
>  	 * all the memory it needs. That way it should be able to
>  	 * exit() and clear out its resources quickly...
>  	 */
>  	p->time_slice = HZ;
> -	set_tsk_thread_flag(p, TIF_MEMDIE);
>  
>  	force_sig(SIGKILL, p);
>  }
> @@ -420,6 +419,18 @@ void out_of_memory(struct zonelist *zone
>  	constraint = constrained_alloc(zonelist, gfp_mask);
>  	cpuset_lock();
>  	read_lock(&tasklist_lock);
> +
> +	/*
> +	 * This holds the down(OOM_lock)+read_lock(tasklist_lock), so it's
> +	 * equivalent to write_lock_irq(tasklist_lock) as far as VM_is_OOM
> +	 * is concerned.
> +	 */
> +	if (unlikely(test_bit(0, &VM_is_OOM))) {
> +		if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
> +			goto out;
> +		printk("detected probable OOM deadlock, so killing another task\n");

Please include a facility level in all printks

> +		last_tif_memdie_jiffies = jiffies;
> +	}
>  
>  	switch (constraint) {
>  	case CONSTRAINT_MEMORY_POLICY:
> @@ -441,10 +452,6 @@ retry:
>  		 * issues we may have.
>  		 */
>  		p = select_bad_process(&points);
> -
> -		if (PTR_ERR(p) == -1UL)
> -			goto out;
>  		/* Found nothing?!?! Either we hang forever, or we panic. */
>  		if (!p) {
>  			read_unlock(&tasklist_lock);

Something like this...


 include/linux/swap.h |    1 +
 kernel/exit.c        |   11 +++++------
 mm/oom_kill.c        |   11 ++++++-----
 3 files changed, 12 insertions(+), 11 deletions(-)

diff -puN kernel/exit.c~oom-handling-fallback-killing-more-tasks-if-tif-memdie-doesnt-go-away-fix kernel/exit.c
--- a/kernel/exit.c~oom-handling-fallback-killing-more-tasks-if-tif-memdie-doesnt-go-away-fix
+++ a/kernel/exit.c
@@ -30,6 +30,7 @@
 #include <linux/kthread.h>
 #include <linux/mempolicy.h>
 #include <linux/taskstats_kern.h>
+#include <linux/swap.h>
 #include <linux/delayacct.h>
 #include <linux/freezer.h>
 #include <linux/cpuset.h>
@@ -851,13 +852,11 @@ static void exit_notify(struct task_stru
 	tsk->exit_state = state;
 
 	/*
-	 * Read TIF_MEMDIE and set VM_is_OOM to 0 atomically inside
-	 * the tasklist_lock_lock.
+	 * Read TIF_MEMDIE and set vm_is_oom to 0 atomically inside
+	 * the tasklist_lock.
 	 */
-	if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) {
-		extern unsigned long VM_is_OOM;
-		clear_bit(0, &VM_is_OOM);
-	}
+	if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE)))
+		clear_bit(0, &vm_is_oom);
 
 	write_unlock_irq(&tasklist_lock);
 
diff -puN mm/oom_kill.c~oom-handling-fallback-killing-more-tasks-if-tif-memdie-doesnt-go-away-fix mm/oom_kill.c
--- a/mm/oom_kill.c~oom-handling-fallback-killing-more-tasks-if-tif-memdie-doesnt-go-away-fix
+++ a/mm/oom_kill.c
@@ -29,7 +29,7 @@
 int sysctl_panic_on_oom;
 /* #define DEBUG */
 
-unsigned long VM_is_OOM;
+unsigned long vm_is_oom;
 static unsigned long last_tif_memdie_jiffies;
 
 /**
@@ -268,7 +268,7 @@ static void __oom_kill_task(struct task_
 
 	if (!test_and_set_tsk_thread_flag(p, TIF_MEMDIE)) {
 		last_tif_memdie_jiffies = jiffies;
-		set_bit(0, &VM_is_OOM);
+		set_bit(0, &vm_is_oom);
 	}
 	/*
 	 * We give our sacrificial lamb high priority and access to
@@ -415,13 +415,14 @@ void out_of_memory(struct zonelist *zone
 
 	/*
 	 * This holds the down(OOM_lock)+read_lock(tasklist_lock), so it's
-	 * equivalent to write_lock_irq(tasklist_lock) as far as VM_is_OOM
+	 * equivalent to write_lock_irq(tasklist_lock) as far as vm_is_oom
 	 * is concerned.
 	 */
-	if (unlikely(test_bit(0, &VM_is_OOM))) {
+	if (unlikely(test_bit(0, &
 		if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
 			goto out;
-		printk("detected probable OOM deadlock, so killing another task\n");
+		printk(KERN_ERR "detected probable OOM deadlock, so killing "
+				"another task\n");
 		last_tif_memdie_jiffies = jiffies;
 	}
 
diff -puN include/linux/swap.h~oom-handling-fallback-killing-more-tasks-if-tif-memdie-doesnt-go-away-fix include/linux/swap.h
--- a/include/linux/swap.h~oom-handling-fallback-killing-more-tasks-if-tif-memdie-doesnt-go-away-fix
+++ a/include/linux/swap.h
@@ -209,6 +209,7 @@ static inline int zone_reclaim(struct zo
 #endif
 
 extern int kswapd_run(int nid);
+extern unsigned long vm_is_oom;
 
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't go away
  2007-09-12 12:30   ` Andrew Morton
@ 2007-09-12 12:34     ` Andrew Morton
  2008-01-03  1:06     ` Andrea Arcangeli
  1 sibling, 0 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:34 UTC (permalink / raw)
  To: Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 12 Sep 2007 05:30:22 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> Something like this...
> 
> 
>  include/linux/swap.h |    1 +
>  kernel/exit.c        |   11 +++++------
>  mm/oom_kill.c        |   11 ++++++-----
>  3 files changed, 12 insertions(+), 11 deletions(-)

urgh, that caused a great mess in later patches which I can't be assed fixing
up right now.

It's really much much easier if people just get trivial stuff like this right
first time :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-08-22 12:48 ` [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
@ 2007-09-12 12:42   ` Andrew Morton
  2007-09-13  0:36     ` Christoph Lameter
  2007-09-21 19:10   ` David Rientjes
  1 sibling, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:42 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:48:57 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID edb3af3e0d4f2c083c8ddd9857073a3c8393ab8e
> # Parent  9bf6a66eab3c52327daa831ef101d7802bc71791
> stop useless vm trashing while we wait the TIF_MEMDIE task to exit
> 
> There's no point in trying to free memory if we're oom.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -159,6 +159,8 @@ struct swap_list_t {
>  #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
>  
>  /* linux/mm/oom_kill.c */
> +extern unsigned long VM_is_OOM;
> +#define is_VM_OOM() unlikely(test_bit(0, &VM_is_OOM))

argh!  Why didn't the first patch do this?

Now we have open-coded test_bit(&VM_is_OOM) calls in exit.c and oom_kill.c
which could use this "function".

Please prefer to use inline C functions where possible.  I think it's
possible here...

>  extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
>  extern int register_oom_notifier(struct notifier_block *nb);
>  extern int unregister_oom_notifier(struct notifier_block *nb);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1028,6 +1028,8 @@ static unsigned long shrink_zone(int pri
>  		nr_inactive = 0;
>  
>  	while (nr_active || nr_inactive) {
> +		if (is_VM_OOM())
> +			break;
>  		if (nr_active) {
>  			nr_to_scan = min(nr_active,
>  					(unsigned long)sc->swap_cluster_max);
> @@ -1138,6 +1140,17 @@ unsigned long try_to_free_pages(struct z
>  	}
>  
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> +		if (is_VM_OOM()) {
> +			if (!test_thread_flag(TIF_MEMDIE)) {
> +				/* get out of the way */
> +				schedule_timeout_interruptible(1);

If the calling task has signal_pending(), this sleep won't do anything.

> +				/* don't waste cpu if we're still oom */
> +				if (is_VM_OOM())
> +					goto out;
> +			} else
> +				goto out;
> +		}
> +

The change kinda makes sense, but what if, say, a great bunch of writes
just completed?  Then memory becomes reclaimable.

Also, what if the oom-killing was due to a shortage in a particular zone,
but there's plenty of reclaimable memory in other zones, memory which this
task can use?

Also, the oom-killer is cpuset aware.  Won't this change cause an
oom-killing in cpuset A to needlessly disrupt processes running in cpuset
B?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11 of 24] the oom schedule timeout isn't needed with the VM_is_OOM logic
  2007-08-22 12:48 ` [PATCH 11 of 24] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
@ 2007-09-12 12:44   ` Andrew Morton
  0 siblings, 0 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:48:58 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID adf88d0ba0d17beaceee47f7b8e0acbd97ddc320
> # Parent  edb3af3e0d4f2c083c8ddd9857073a3c8393ab8e
> the oom schedule timeout isn't needed with the VM_is_OOM logic
> 
> VM_is_OOM whole point is to give a proper time to the TIF_MEMDIE task
> in order to exit.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -469,12 +469,5 @@ out:

It's a shame that mercurial has inherited `diff -p's stupid handling of labels.
I guess it uses diff directly.  ho hum.

>  	read_unlock(&tasklist_lock);
>  	cpuset_unlock();
>  
> -	/*
> -	 * Give "p" a good chance of killing itself before we
> -	 * retry to allocate memory unless "p" is current
> -	 */
> -	if (!test_thread_flag(TIF_MEMDIE))
> -		schedule_timeout_uninterruptible(1);
> -
>  	up(&OOM_lock);
>  }
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 12 of 24] show mem information only when a task is actually being killed
  2007-08-22 12:48 ` [PATCH 12 of 24] show mem information only when a task is actually being killed Andrea Arcangeli
@ 2007-09-12 12:49   ` Andrew Morton
  0 siblings, 0 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:49 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:48:59 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID 1473d573b9ba8a913bafa42da2cac5dcca274204
> # Parent  adf88d0ba0d17beaceee47f7b8e0acbd97ddc320
> show mem information only when a task is actually being killed
> 
> Don't show garbage while VM_is_OOM and the timeout didn't trigger.
> 

whoa, now that's weird.

The diff you sent has:

 oom_kill.c |  184 ++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 92 insertions(+), 92 deletions(-)

but when I apply it and rediff it, I get

 oom_kill.c |   29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

which is below.  It's the same change, only the diff came out much better.

_does_ mercurial have its own diff?



> diff -puN mm/oom_kill.c~oom-handling-show-mem-information-only-when-a-task-is-actually-being-killed mm/oom_kill.c
> --- a/mm/oom_kill.c~oom-handling-show-mem-information-only-when-a-task-is-actually-being-killed
> +++ a/mm/oom_kill.c
> @@ -280,7 +280,7 @@ static void __oom_kill_task(struct task_
>  	force_sig(SIGKILL, p);
>  }
>  
> -static int oom_kill_task(struct task_struct *p)
> +static int oom_kill_task(struct task_struct *p, gfp_t gfp_mask, int order)
>  {
>  	struct mm_struct *mm;
>  	struct task_struct *g, *q;
> @@ -307,6 +307,14 @@ static int oom_kill_task(struct task_str
>  			return 1;
>  	} while_each_thread(g, q);
>  
> +	if (printk_ratelimit()) {
> +		printk(KERN_WARNING "%s invoked oom-killer: "
> +			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
> +			current->comm, gfp_mask, order, current->oomkilladj);
> +		dump_stack();
> +		show_mem();
> +	}
> +
>  	__oom_kill_task(p, 1);
>  
>  	/*
> @@ -323,7 +331,7 @@ static int oom_kill_task(struct task_str
>  }
>  
>  static int oom_kill_process(struct task_struct *p, unsigned long points,
> -		const char *message)
> +			    const char *message, gfp_t gfp_mask, int order)
>  {
>  	struct task_struct *c;
>  	struct list_head *tsk;
> @@ -351,10 +359,10 @@ static int oom_kill_process(struct task_
>  		 */
>  		if (unlikely(test_tsk_thread_flag(c, TIF_MEMDIE)))
>  			continue;
> -		if (!oom_kill_task(c))
> +		if (!oom_kill_task(c, gfp_mask, order))
>  			return 0;
>  	}
> -	return oom_kill_task(p);
> +	return oom_kill_task(p, gfp_mask, order);
>  }
>  
>  static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
> @@ -394,13 +402,6 @@ void out_of_memory(struct zonelist *zone
>  
>  	if (down_trylock(&OOM_lock))
>  		return;
> -	if (printk_ratelimit()) {
> -		printk(KERN_WARNING "%s invoked oom-killer: "
> -			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
> -			current->comm, gfp_mask, order, current->oomkilladj);
> -		dump_stack();
> -		show_mem();
> -	}
>  
>  	if (sysctl_panic_on_oom == 2)
>  		panic("out of memory. Compulsory panic_on_oom is selected.\n");
> @@ -428,12 +429,12 @@ void out_of_memory(struct zonelist *zone
>  	switch (constraint) {
>  	case CONSTRAINT_MEMORY_POLICY:
>  		oom_kill_process(current, points,
> -				"No available memory (MPOL_BIND)");
> +				 "No available memory (MPOL_BIND)", gfp_mask, order);
>  		break;
>  
>  	case CONSTRAINT_CPUSET:
>  		oom_kill_process(current, points,
> -				"No available memory in cpuset");
> +				 "No available memory in cpuset", gfp_mask, order);
>  		break;
>  
>  	case CONSTRAINT_NONE:
> @@ -452,7 +453,7 @@ retry:
>  			panic("Out of memory and no killable processes...\n");
>  		}
>  
> -		if (oom_kill_process(p, points, "Out of memory"))
> +		if (oom_kill_process(p, points, "Out of memory", gfp_mask, order))
>  			goto retry;
>  
>  		break;

I don't really understand this change.  A better changelog which more fully
describes the problem whcih is being addressed would help, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13 of 24] simplify oom heuristics
  2007-08-22 12:49 ` [PATCH 13 of 24] simplify oom heuristics Andrea Arcangeli
@ 2007-09-12 12:52   ` Andrew Morton
  2007-09-12 13:40     ` Andrea Arcangeli
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:49:00 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID cd70d64570b9add8072f7abe952b34fe57c60086
> # Parent  1473d573b9ba8a913bafa42da2cac5dcca274204
> simplify oom heuristics
> 
> Over time somebody had the good idea to remove the rcvd_sigterm points,
> this removes more of them. The selected task should be the one that if
> we don't kill, it will turn the system oom again sooner than later.
> These informations tell us nothing about which task is best to kill so
> they should be removed.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -53,7 +53,7 @@ static unsigned long last_tif_memdie_jif
>  
>  unsigned long badness(struct task_struct *p, unsigned long uptime)
>  {
> -	unsigned long points, cpu_time, run_time, s;
> +	unsigned long points;
>  	struct mm_struct *mm;
>  	struct task_struct *child;
>  
> @@ -94,26 +94,6 @@ unsigned long badness(struct task_struct
>  			points += child->mm->total_vm/2 + 1;
>  		task_unlock(child);
>  	}
> -
> -	/*
> -	 * CPU time is in tens of seconds and run time is in thousands
> -         * of seconds. There is no particular reason for this other than
> -         * that it turned out to work very well in practice.
> -	 */
> -	cpu_time = (cputime_to_jiffies(p->utime) + cputime_to_jiffies(p->stime))
> -		>> (SHIFT_HZ + 3);
> -
> -	if (uptime >= p->start_time.tv_sec)
> -		run_time = (uptime - p->start_time.tv_sec) >> 10;
> -	else
> -		run_time = 0;
> -
> -	s = int_sqrt(cpu_time);
> -	if (s)
> -		points /= s;
> -	s = int_sqrt(int_sqrt(run_time));
> -	if (s)
> -		points /= s;
>  
>  	/*
>  	 * Niced processes are most likely less important, so double
> 

I think the idea behind the code which you're removing is to avoid killing
a computationally-expensive task which we've already invested a lot of CPU
time in.  IOW, kill the job which has been running for three seconds in
preference to the one which has been running three weeks.

That seems like a good strategy to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15 of 24] limit reclaim if enough pages have been freed
  2007-08-22 12:49 ` [PATCH 15 of 24] limit reclaim if enough pages have been freed Andrea Arcangeli
@ 2007-09-12 12:57   ` Andrew Morton
  2008-01-03  1:12     ` Andrea Arcangeli
  2007-09-12 12:58   ` Andrew Morton
  1 sibling, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:57 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:49:02 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID 94686cfcd27347e83a6aa145c77457ca6455366d
> # Parent  dde19626aa495cd8a6fa6b14a4f195438c2039ba
> limit reclaim if enough pages have been freed
> 
> No need to wipe out an huge chunk of the cache.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1043,6 +1043,8 @@ static unsigned long shrink_zone(int pri
>  			nr_inactive -= nr_to_scan;
>  			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
>  								sc);
> +			if (nr_reclaimed >= sc->swap_cluster_max)
> +				break;
>  		}
>  	}

whoa, that's a huge change to the scanning logic.  Suppose we've decided to
scan 1,000,000 active pages and 42 inactive pages.  With this change we'll
bale out after scanning the 42 inactive pages.  The change to the
inactive/active balancing logic is potentially large.

Will need more than a one-line changelog, that one will ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15 of 24] limit reclaim if enough pages have been freed
  2007-08-22 12:49 ` [PATCH 15 of 24] limit reclaim if enough pages have been freed Andrea Arcangeli
  2007-09-12 12:57   ` Andrew Morton
@ 2007-09-12 12:58   ` Andrew Morton
  2007-09-12 13:38     ` Andrea Arcangeli
  1 sibling, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:58 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:49:02 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID 94686cfcd27347e83a6aa145c77457ca6455366d
> # Parent  dde19626aa495cd8a6fa6b14a4f195438c2039ba
> limit reclaim if enough pages have been freed
> 
> No need to wipe out an huge chunk of the cache.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1043,6 +1043,8 @@ static unsigned long shrink_zone(int pri
>  			nr_inactive -= nr_to_scan;
>  			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
>  								sc);
> +			if (nr_reclaimed >= sc->swap_cluster_max)
> +				break;
>  		}
>  	}

Also, this has nothing to do with oom-killing, which is the subject of this
patch series?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 16 of 24] avoid some lock operation in vm fast path
  2007-08-22 12:49 ` [PATCH 16 of 24] avoid some lock operation in vm fast path Andrea Arcangeli
@ 2007-09-12 12:59   ` Andrew Morton
  2007-09-13  0:49     ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 12:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:49:03 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID b343d1056f356d60de868bd92422b33290e3c514
> # Parent  94686cfcd27347e83a6aa145c77457ca6455366d
> avoid some lock operation in vm fast path
> 
> Let's not bloat the kernel for numa. Not nice, but at least this way
> perhaps somebody will clean it up instead of hiding the inefficiency in
> there.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -232,8 +232,10 @@ struct zone {
>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	int			all_unreclaimable; /* All pages pinned */
>  
> +#ifdef CONFIG_NUMA
>  	/* A count of how many reclaimers are scanning this zone */
>  	atomic_t		reclaim_in_progress;
> +#endif
>  
>  	/* Zone statistics */
>  	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2960,7 +2960,9 @@ static void __meminit free_area_init_cor
>  		INIT_LIST_HEAD(&zone->active_list);
>  		INIT_LIST_HEAD(&zone->inactive_list);
>  		zap_zone_vm_stats(zone);
> +#ifdef CONFIG_NUMA
>  		atomic_set(&zone->reclaim_in_progress, 0);
> +#endif
>  		if (!size)
>  			continue;
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1014,7 +1014,9 @@ static unsigned long shrink_zone(int pri
>  	unsigned long nr_to_scan;
>  	unsigned long nr_reclaimed = 0;
>  
> +#ifdef CONFIG_NUMA
>  	atomic_inc(&zone->reclaim_in_progress);
> +#endif
>  
>  	/*
>  	 * Add one to `nr_to_scan' just to make sure that the kernel will
> @@ -1050,7 +1052,9 @@ static unsigned long shrink_zone(int pri
>  
>  	throttle_vm_writeout(sc->gfp_mask);
>  
> +#ifdef CONFIG_NUMA
>  	atomic_dec(&zone->reclaim_in_progress);
> +#endif
>  	return nr_reclaimed;
>  }

OK, but we'd normally do this via some little wrapper functions which are
empty-if-not-numa.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 17 of 24] apply the anti deadlock features only to global oom
  2007-08-22 12:49 ` [PATCH 17 of 24] apply the anti deadlock features only to global oom Andrea Arcangeli
@ 2007-09-12 13:02   ` Andrew Morton
  2007-09-13  0:53     ` Christoph Lameter
  2007-09-13  0:52   ` Christoph Lameter
  1 sibling, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 13:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:49:04 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID efd1da1efb392cc4e015740d088ea9c6235901e0
> # Parent  b343d1056f356d60de868bd92422b33290e3c514
> apply the anti deadlock features only to global oom
> 
> Cc: Christoph Lameter <clameter@sgi.com>
> The local numa oom will keep killing the current task hoping that's it's
> not an innocent task and it won't alter the behavior of the rest of the
> VM. The global oom will not wait for TIF_MEMDIE tasks anymore, so this
> will be a really local event, not like before when the local-TIF_MEMDIE
> was effectively a global flag that the global oom would depend on too.
> 

ok, I'm starting to get lost here.  Let's apply it unreviewed and if it
breaks, that'll teach the numa weenies about the value of code review ;)

> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -387,9 +387,6 @@ void out_of_memory(struct zonelist *zone
>  		/* Got some memory back in the last second. */
>  		return;
>  
> -	if (down_trylock(&OOM_lock))
> -		return;
> -
>  	if (sysctl_panic_on_oom == 2)
>  		panic("out of memory. Compulsory panic_on_oom is selected.\n");
>  
> @@ -399,32 +396,39 @@ void out_of_memory(struct zonelist *zone
>  	 */
>  	constraint = constrained_alloc(zonelist, gfp_mask);
>  	cpuset_lock();
> -	read_lock(&tasklist_lock);
> -
> -	/*
> -	 * This holds the down(OOM_lock)+read_lock(tasklist_lock), so it's
> -	 * equivalent to write_lock_irq(tasklist_lock) as far as VM_is_OOM
> -	 * is concerned.
> -	 */
> -	if (unlikely(test_bit(0, &VM_is_OOM))) {
> -		if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
> -			goto out;
> -		printk("detected probable OOM deadlock, so killing another task\n");
> -		last_tif_memdie_jiffies = jiffies;
> -	}
>  
>  	switch (constraint) {
>  	case CONSTRAINT_MEMORY_POLICY:
> +		read_lock(&tasklist_lock);
>  		oom_kill_process(current, points,
>  				 "No available memory (MPOL_BIND)", gfp_mask, order);
> +		read_unlock(&tasklist_lock);
>  		break;
>  
>  	case CONSTRAINT_CPUSET:
> +		read_lock(&tasklist_lock);
>  		oom_kill_process(current, points,
>  				 "No available memory in cpuset", gfp_mask, order);
> +		read_unlock(&tasklist_lock);
>  		break;
>  
>  	case CONSTRAINT_NONE:
> +		if (down_trylock(&OOM_lock))
> +			break;
> +		read_lock(&tasklist_lock);
> +
> +		/*
> +		 * This holds the down(OOM_lock)+read_lock(tasklist_lock),
> +		 * so it's equivalent to write_lock_irq(tasklist_lock) as
> +		 * far as VM_is_OOM is concerned.
> +		 */
> +		if (unlikely(test_bit(0, &VM_is_OOM))) {

We have a helper macro-should-be-function for that.

> +			if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
> +				goto out;
> +			printk("detected probable OOM deadlock, so killing another task\n");
> +			last_tif_memdie_jiffies = jiffies;
> +		}
> +
>  		if (sysctl_panic_on_oom)
>  			panic("out of memory. panic_on_oom is selected\n");
>  retry:
> @@ -443,12 +447,11 @@ retry:
>  		if (oom_kill_process(p, points, "Out of memory", gfp_mask, order))
>  			goto retry;
>  
> +	out:
> +		read_unlock(&tasklist_lock);
> +		up(&OOM_lock);
>  		break;
>  	}
>  
> -out:
> -	read_unlock(&tasklist_lock);
>  	cpuset_unlock();
> -
> -	up(&OOM_lock);
> -}
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing
  2007-08-22 12:49 ` [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing Andrea Arcangeli
@ 2007-09-12 13:02   ` Andrew Morton
  2007-09-12 13:36     ` Andrea Arcangeli
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 13:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007 14:49:06 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User Andrea Arcangeli <andrea@suse.de>
> # Date 1187778125 -7200
> # Node ID be2fc447cec06990a2a31658b166f0c909777260
> # Parent  040cab5c8aafe1efcb6fc21d1f268c11202dac02
> cacheline align VM_is_OOM to prevent false sharing
> 
> This is better to be cacheline aligned in smp kernels just in case.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -29,7 +29,7 @@ int sysctl_panic_on_oom;
>  int sysctl_panic_on_oom;
>  /* #define DEBUG */
>  
> -unsigned long VM_is_OOM;
> +unsigned long VM_is_OOM __cacheline_aligned_in_smp;
>  static unsigned long last_tif_memdie_jiffies;
>  

I'd suggest __read_mostly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 21 of 24] select process to kill for cpusets
  2007-08-22 12:49 ` [PATCH 21 of 24] select process to kill for cpusets Andrea Arcangeli
@ 2007-09-12 13:05   ` Andrew Morton
  2007-09-13  0:59     ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 13:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, David Rientjes, Christoph Lameter, Paul Jackson

On Wed, 22 Aug 2007 14:49:08 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User David Rientjes <rientjes@google.com>
> # Date 1187778125 -7200
> # Node ID 855dc37d74ab151d7a0c640d687b34ee05996235
> # Parent  2c9417ab4c1ff81a77bca4767207338e43b5cd69
> select process to kill for cpusets
> 
> Passes the memory allocation constraint into select_bad_process() so
> that, in the CONSTRAINT_CPUSET case, we can exclude tasks that do not
> overlap nodes with the triggering task's cpuset.
> 
> The OOM killer now invokes select_bad_process() even in the cpuset case
> to select a rogue task to kill instead of simply using current.  Although
> killing current is guaranteed to help alleviate the OOM condition, it is
> by no means guaranteed to be the "best" process to kill.  The
> select_bad_process() heuristics will do a much better job of determining
> that.
> 
> As an added bonus, this also addresses an issue whereas current could be
> set to OOM_DISABLE and is not respected for the CONSTRAINT_CPUSET case.
> Currently we loop back out to __alloc_pages() waiting for another cpuset
> task to trigger the OOM killer that hopefully won't be OOM_DISABLE.  With
> this patch, we're guaranteed to find a task to kill that is not
> OOM_DISABLE if it matches our eligibility requirements the first time.
> 
> If we cannot find any tasks to kill in the cpuset case, we simply make
> the entire OOM killer a no-op since it's better for one cpuset to fail
> memory allocations repeatedly instead of panicing the entire system.
> 
> Cc: Andrea Arcangeli <andrea@suse.de>
> Cc: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   25 +++++++++++++++++--------
>  1 files changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -188,9 +188,13 @@ static inline int constrained_alloc(stru
>   * Simple selection loop. We chose the process with the highest
>   * number of 'points'. We expect the caller will lock the tasklist.
>   *
> + * If constraint is CONSTRAINT_CPUSET, then only choose a task that overlaps
> + * the nodes of the task that triggered the OOM killer.
> + *
>   * (not docbooked, we don't want this one cluttering up the manual)
>   */
> -static struct task_struct *select_bad_process(unsigned long *ppoints)
> +static struct task_struct *select_bad_process(unsigned long *ppoints,
> +					      int constraint)
>  {
>  	struct task_struct *g, *p;
>  	struct task_struct *chosen = NULL;
> @@ -221,6 +225,9 @@ static struct task_struct *select_bad_pr
>  		}
>  
>  		if (p->oomkilladj == OOM_DISABLE)
> +			continue;
> +		if (constraint == CONSTRAINT_CPUSET &&
> +		    !cpuset_excl_nodes_overlap(p))
>  			continue;
>  
>  		points = badness(p, uptime.tv_sec);
> @@ -424,12 +431,6 @@ void out_of_memory(struct zonelist *zone
>  		break;
>  
>  	case CONSTRAINT_CPUSET:
> -		read_lock(&tasklist_lock);
> -		oom_kill_process(current, points,
> -				 "No available memory in cpuset", gfp_mask, order);
> -		read_unlock(&tasklist_lock);
> -		break;
> -
>  	case CONSTRAINT_NONE:
>  		if (down_trylock(&OOM_lock))
>  			break;
> @@ -454,9 +455,17 @@ retry:
>  		 * Rambo mode: Shoot down a process and hope it solves whatever
>  		 * issues we may have.
>  		 */
> -		p = select_bad_process(&points);
> +		p = select_bad_process(&points, constraint);
>  		/* Found nothing?!?! Either we hang forever, or we panic. */
>  		if (unlikely(!p)) {
> +			/*
> +			 * We shouldn't panic the entire system if we can't
> +			 * find any eligible tasks to kill in a
> +			 * cpuset-constrained OOM condition.  Instead, we do
> +			 * nothing and allow other cpusets to continue.
> +			 */
> +			if (constraint == CONSTRAINT_CPUSET)
> +				goto out;
>  			read_unlock(&tasklist_lock);
>  			cpuset_unlock();
>  			panic("Out of memory and no killable processes...\n");

Seems sensible, but it would be nice to get some thought cycles from pj &
Christoph, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 23 of 24] serialize for cpusets
  2007-08-22 12:49 ` [PATCH 23 of 24] serialize for cpusets Andrea Arcangeli
@ 2007-09-12 13:10   ` Andrew Morton
  2007-09-12 13:34     ` Andrea Arcangeli
                       ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 13:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, David Rientjes, Christoph Lameter, Paul Jackson

On Wed, 22 Aug 2007 14:49:10 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User David Rientjes <rientjes@google.com>
> # Date 1187778125 -7200
> # Node ID a3d679df54ebb1f977b97ab6b3e501134bf9e7ef
> # Parent  8807a4d14b241b2d1132fde7f83834603b6cf093
> serialize for cpusets
> 
> Adds a last_tif_memdie_jiffies field to struct cpuset to store the
> jiffies value at the last OOM kill.  This will detect deadlocks in the
> CONSTRAINT_CPUSET case and kill another task if its detected.
> 
> Adds a CS_OOM bit to struct cpuset's flags field.  This will be tested,
> set, and cleared atomically to denote a cpuset that currently has an
> attached task exiting as a result of the OOM killer.  We are required to
> take p->alloc_lock to dereference p->cpuset so this cannot be implemented
> as a simple trylock.
> 
> As a result, we cannot allow the detachment of a task from a cpuset that
> is currently OOM killing one of its tasks.  If we did, we would end up
> clearing the CS_OOM bit in the wrong cpuset upon that task's exit.
> 
> sysctl's panic_on_oom is now only effected in the non-cpuset-constrained
> case.
> 
> Cc: Andrea Arcangeli <andrea@suse.de>
> Cc: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

I understand that SGI's HPC customers care rather a lot about oom handling
in cpusets.  It'd be nice if people@sgi could carefully review-and-test
changes such as this before we go and break stuff for them, please.


>  
> +int cpuset_get_last_tif_memdie(struct task_struct *task)
> +{
> +	unsigned long ret;
> +	task_lock(task);
> +	ret = task->cpuset->last_tif_memdie_jiffies;
> +	task_unlock(task);
> +	return ret;
> +}
> +
> +void cpuset_set_last_tif_memdie(struct task_struct *task,
> +				unsigned long last_tif_memdie)
> +{
> +	task_lock(task);
> +	task->cpuset->last_tif_memdie_jiffies = last_tif_memdie;
> +	task_unlock(task);
> +}
> +
> +int cpuset_set_oom(struct task_struct *task)
> +{
> +	int ret;
> +	task_lock(task);
> +	ret = test_and_set_bit(CS_OOM, &task->cpuset->flags);
> +	task_unlock(task);
> +	return ret;
> +}
> +
> +void cpuset_clear_oom(struct task_struct *task)
> +{
> +	task_lock(task);
> +	clear_bit(CS_OOM, &task->cpuset->flags);
> +	task_unlock(task);
> +}

Seems strange to do a spinlock around a single already-atomic bitop?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24 of 24] add oom_kill_asking_task flag
  2007-08-22 12:49 ` [PATCH 24 of 24] add oom_kill_asking_task flag Andrea Arcangeli
@ 2007-09-12 13:11   ` Andrew Morton
  0 siblings, 0 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 13:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, David Rientjes, Christoph Lameter, Paul Jackson

On Wed, 22 Aug 2007 14:49:11 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> # HG changeset patch
> # User David Rientjes <rientjes@google.com>
> # Date 1187778125 -7200
> # Node ID 96b5899e730ecaa2078883f75e86765fa1a36431
> # Parent  a3d679df54ebb1f977b97ab6b3e501134bf9e7ef
> add oom_kill_asking_task flag
> 
> Adds an oom_kill_asking_task flag to cpusets.  If unset (by default), we
> iterate through the task list via select_bad_process() during a
> cpuset-constrained OOM to find the best candidate task to kill.  If set,
> we simply kill current to avoid the overhead which is needed for some
> customers with a large number of threads or heavy workload.
> 

another sgi prod, please.

> 
> diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
> --- a/Documentation/cpusets.txt
> +++ b/Documentation/cpusets.txt
> @@ -181,6 +181,9 @@ containing the following files describin
>   - tasks: list of tasks (by pid) attached to that cpuset
>   - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
>   - memory_pressure: measure of how much paging pressure in cpuset
> + - oom_kill_asking_task flag: when this cpuset OOM's, should we kill
> +	the task that asked for the memory or should we iterate through
> +	the task list to find the best task to kill (can be expensive)?
>  
>  In addition, the root cpuset only has the following file:
>   - memory_pressure_enabled flag: compute memory_pressure?
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -52,6 +52,7 @@ extern void cpuset_set_last_tif_memdie(s
>  				       unsigned long last_tif_memdie);
>  extern int cpuset_set_oom(struct task_struct *task);
>  extern void cpuset_clear_oom(struct task_struct *task);
> +extern int cpuset_oom_kill_asking_task(struct task_struct *task);
>  
>  #define cpuset_memory_pressure_bump() 				\
>  	do {							\
> @@ -136,6 +137,10 @@ static inline int cpuset_set_oom(struct 
>  	return 0;
>  }
>  static inline void cpuset_clear_oom(struct task_struct *task) {}
> +static inline int cpuset_oom_kill_asking_task(struct task_struct *task)
> +{
> +	return 0;
> +}
>  
>  static inline void cpuset_memory_pressure_bump(void) {}
>  
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -116,6 +116,7 @@ typedef enum {
>  	CS_SPREAD_PAGE,
>  	CS_SPREAD_SLAB,
>  	CS_OOM,
> +	CS_OOM_KILL_ASKING_TASK,
>  } cpuset_flagbits_t;
>  
>  /* convenient tests for these bits */
> @@ -157,6 +158,11 @@ static inline int is_oom(const struct cp
>  static inline int is_oom(const struct cpuset *cs)
>  {
>  	return test_bit(CS_OOM, &cs->flags);
> +}
> +
> +static inline int is_oom_kill_asking_task(const struct cpuset *cs)
> +{
> +	return test_bit(CS_OOM_KILL_ASKING_TASK, &cs->flags);
>  }
>  
>  /*
> @@ -1068,7 +1074,8 @@ static int update_memory_pressure_enable
>   * update_flag - read a 0 or a 1 in a file and update associated flag
>   * bit:	the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
>   *				CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE,
> - *				CS_SPREAD_PAGE, CS_SPREAD_SLAB)
> + *				CS_SPREAD_PAGE, CS_SPREAD_SLAB,
> + *				CS_OOM_KILL_ASKING_TASK)
>   * cs:	the cpuset to update
>   * buf:	the buffer where we read the 0 or 1
>   *
> @@ -1320,6 +1327,7 @@ typedef enum {
>  	FILE_NOTIFY_ON_RELEASE,
>  	FILE_MEMORY_PRESSURE_ENABLED,
>  	FILE_MEMORY_PRESSURE,
> +	FILE_OOM_KILL_ASKING_TASK,
>  	FILE_SPREAD_PAGE,
>  	FILE_SPREAD_SLAB,
>  	FILE_TASKLIST,
> @@ -1382,6 +1390,9 @@ static ssize_t cpuset_common_file_write(
>  	case FILE_MEMORY_PRESSURE:
>  		retval = -EACCES;
>  		break;
> +	case FILE_OOM_KILL_ASKING_TASK:
> +		retval = update_flag(CS_OOM_KILL_ASKING_TASK, cs, buffer);
> +		break;
>  	case FILE_SPREAD_PAGE:
>  		retval = update_flag(CS_SPREAD_PAGE, cs, buffer);
>  		cs->mems_generation = cpuset_mems_generation++;
> @@ -1499,6 +1510,9 @@ static ssize_t cpuset_common_file_read(s
>  	case FILE_MEMORY_PRESSURE:
>  		s += sprintf(s, "%d", fmeter_getrate(&cs->fmeter));
>  		break;
> +	case FILE_OOM_KILL_ASKING_TASK:
> +		*s++ = is_oom_kill_asking_task(cs) ? '1' : '0';
> +		break;
>  	case FILE_SPREAD_PAGE:
>  		*s++ = is_spread_page(cs) ? '1' : '0';
>  		break;
> @@ -1861,6 +1875,11 @@ static struct cftype cft_memory_pressure
>  static struct cftype cft_memory_pressure = {
>  	.name = "memory_pressure",
>  	.private = FILE_MEMORY_PRESSURE,
> +};
> +
> +static struct cftype cft_oom_kill_asking_task = {
> +	.name = "oom_kill_asking_task",
> +	.private = FILE_OOM_KILL_ASKING_TASK,
>  };
>  
>  static struct cftype cft_spread_page = {
> @@ -1891,6 +1910,8 @@ static int cpuset_populate_dir(struct de
>  		return err;
>  	if ((err = cpuset_add_file(cs_dentry, &cft_memory_pressure)) < 0)
>  		return err;
> +	if ((err = cpuset_add_file(cs_dentry, &cft_oom_kill_asking_task)) < 0)
> +		return err;
>  	if ((err = cpuset_add_file(cs_dentry, &cft_spread_page)) < 0)
>  		return err;
>  	if ((err = cpuset_add_file(cs_dentry, &cft_spread_slab)) < 0)
> @@ -1923,6 +1944,8 @@ static long cpuset_create(struct cpuset 
>  	cs->flags = 0;
>  	if (notify_on_release(parent))
>  		set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
> +	if (is_oom_kill_asking_task(parent))
> +		set_bit(CS_OOM_KILL_ASKING_TASK, &cs->flags);
>  	if (is_spread_page(parent))
>  		set_bit(CS_SPREAD_PAGE, &cs->flags);
>  	if (is_spread_slab(parent))
> @@ -2661,6 +2684,20 @@ void cpuset_clear_oom(struct task_struct
>  }
>  
>  /*
> + * Returns 1 if current should simply be killed when a cpuset-constrained OOM
> + * occurs.  Otherwise, we iterate through the task list and select the best
> + * candidate we can find.
> + */
> +int cpuset_oom_kill_asking_task(struct task_struct *task)
> +{
> +	int ret;
> +	task_lock(task);
> +	ret = is_oom_kill_asking_task(task->cpuset);
> +	task_unlock(task);
> +	return ret;
> +}
> +
> +/*
>   * Collection of memory_pressure is suppressed unless
>   * this flag is enabled by writing "1" to the special
>   * cpuset file 'memory_pressure_enabled' in the root cpuset.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -457,6 +457,13 @@ void out_of_memory(struct zonelist *zone
>  
>  	case CONSTRAINT_CPUSET:
>  		read_lock(&tasklist_lock);
> +		if (cpuset_oom_kill_asking_task(current)) {
> +			oom_kill_process(current, 0,
> +					 "No available memory in cpuset", gfp_mask,
> +					 order);
> +			goto out_cpuset;
> +		}
> +
>  		last_tif_memdie = cpuset_get_last_tif_memdie(current);
>  		/*
>  		 * If current's cpuset is already in the OOM killer or its killed
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 23 of 24] serialize for cpusets
  2007-09-12 13:10   ` Andrew Morton
@ 2007-09-12 13:34     ` Andrea Arcangeli
  2007-09-12 19:08     ` David Rientjes
  2007-09-13  1:02     ` Christoph Lameter
  2 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-09-12 13:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes, Christoph Lameter, Paul Jackson

On Wed, Sep 12, 2007 at 06:10:03AM -0700, Andrew Morton wrote:
> > +void cpuset_clear_oom(struct task_struct *task)
> > +{
> > +	task_lock(task);
> > +	clear_bit(CS_OOM, &task->cpuset->flags);
> > +	task_unlock(task);
> > +}
> 
> Seems strange to do a spinlock around a single already-atomic bitop?

The CS_OOM information for us is serialized by the task_lock. But I
assume flags can change also outside of the task_lock for other usages
hence the need of clear_bit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing
  2007-09-12 13:02   ` Andrew Morton
@ 2007-09-12 13:36     ` Andrea Arcangeli
  2007-09-13  0:55       ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-09-12 13:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

On Wed, Sep 12, 2007 at 06:02:55AM -0700, Andrew Morton wrote:
> I'd suggest __read_mostly.

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15 of 24] limit reclaim if enough pages have been freed
  2007-09-12 12:58   ` Andrew Morton
@ 2007-09-12 13:38     ` Andrea Arcangeli
  0 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2007-09-12 13:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

On Wed, Sep 12, 2007 at 05:58:00AM -0700, Andrew Morton wrote:
> Also, this has nothing to do with oom-killing, which is the subject of this
> patch series?

Yes, but at least I kept this in a separated patch ;). Most of the VM
changes were strictly OOM related that's the reason of the subject.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13 of 24] simplify oom heuristics
  2007-09-12 12:52   ` Andrew Morton
@ 2007-09-12 13:40     ` Andrea Arcangeli
  2007-09-12 20:52       ` Andrew Morton
  0 siblings, 1 reply; 113+ messages in thread
From: Andrea Arcangeli @ 2007-09-12 13:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

On Wed, Sep 12, 2007 at 05:52:40AM -0700, Andrew Morton wrote:
> I think the idea behind the code which you're removing is to avoid killing
> a computationally-expensive task which we've already invested a lot of CPU
> time in.  IOW, kill the job which has been running for three seconds in
> preference to the one which has been running three weeks.
> 
> That seems like a good strategy to me.

I know... but for certain apps like simulations, the task that goes
oom is one of the longest running ones.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 23 of 24] serialize for cpusets
  2007-09-12 13:10   ` Andrew Morton
  2007-09-12 13:34     ` Andrea Arcangeli
@ 2007-09-12 19:08     ` David Rientjes
  2007-09-13  1:02     ` Christoph Lameter
  2 siblings, 0 replies; 113+ messages in thread
From: David Rientjes @ 2007-09-12 19:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Christoph Lameter, Paul Jackson

On Wed, 12 Sep 2007, Andrew Morton wrote:

> > # HG changeset patch
> > # User David Rientjes <rientjes@google.com>
> > # Date 1187778125 -7200
> > # Node ID a3d679df54ebb1f977b97ab6b3e501134bf9e7ef
> > # Parent  8807a4d14b241b2d1132fde7f83834603b6cf093
> > serialize for cpusets
> > 
> > Adds a last_tif_memdie_jiffies field to struct cpuset to store the
> > jiffies value at the last OOM kill.  This will detect deadlocks in the
> > CONSTRAINT_CPUSET case and kill another task if its detected.
> > 
> > Adds a CS_OOM bit to struct cpuset's flags field.  This will be tested,
> > set, and cleared atomically to denote a cpuset that currently has an
> > attached task exiting as a result of the OOM killer.  We are required to
> > take p->alloc_lock to dereference p->cpuset so this cannot be implemented
> > as a simple trylock.
> > 
> > As a result, we cannot allow the detachment of a task from a cpuset that
> > is currently OOM killing one of its tasks.  If we did, we would end up
> > clearing the CS_OOM bit in the wrong cpuset upon that task's exit.
> > 
> > sysctl's panic_on_oom is now only effected in the non-cpuset-constrained
> > case.
> > 
> > Cc: Andrea Arcangeli <andrea@suse.de>
> > Cc: Christoph Lameter <clameter@sgi.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> I understand that SGI's HPC customers care rather a lot about oom handling
> in cpusets.  It'd be nice if people@sgi could carefully review-and-test
> changes such as this before we go and break stuff for them, please.
> 

During the initial review of this change, Paul Jackson suggested adding 
oom_kill_asking_task, which the next patch in this series does, to switch 
this on and off.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13 of 24] simplify oom heuristics
  2007-09-12 13:40     ` Andrea Arcangeli
@ 2007-09-12 20:52       ` Andrew Morton
  0 siblings, 0 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-12 20:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 12 Sep 2007 15:40:12 +0200 Andrea Arcangeli <andrea@suse.de> wrote:

> On Wed, Sep 12, 2007 at 05:52:40AM -0700, Andrew Morton wrote:
> > I think the idea behind the code which you're removing is to avoid killing
> > a computationally-expensive task which we've already invested a lot of CPU
> > time in.  IOW, kill the job which has been running for three seconds in
> > preference to the one which has been running three weeks.
> > 
> > That seems like a good strategy to me.
> 
> I know... but for certain apps like simulations, the task that goes
> oom is one of the longest running ones.

hmm.  There are ways in which operators can tweak this manually, aren't there?
I'd expect that owners of large, computationally expensive tasks which tend to go
oom are the sorts of people who would actually bother to learn about and alter
the kernel defaults.

Perhaps we aren't giving them sufficient controls at present?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02 of 24] avoid oom deadlock in nfs_create_request
  2007-08-22 12:48 ` [PATCH 02 of 24] avoid oom deadlock in nfs_create_request Andrea Arcangeli
@ 2007-09-12 23:54   ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-12 23:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007, Andrea Arcangeli wrote:

> +	/* try to allocate the request struct */
> +	req = nfs_page_alloc();
> +	if (unlikely(!req)) {
> +		/*
> +		 * -ENOMEM will be returned only when TIF_MEMDIE is set
> +		 * so userland shouldn't risk to get confused by a new
> +		 * unhandled ENOMEM errno.
> +		 */
> +		WARN_ON(!test_thread_flag(TIF_MEMDIE));
> +		return ERR_PTR(-ENOMEM);

The comment does not match what is actually occurring. We unconditionally
return -ENOMEM. Debug leftover?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-08-22 12:48 ` [PATCH 04 of 24] serialize oom killer Andrea Arcangeli
  2007-09-12 12:02   ` Andrew Morton
@ 2007-09-13  0:09   ` Christoph Lameter
  2007-09-13 18:32     ` David Rientjes
  1 sibling, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:09 UTC (permalink / raw)
  To: pj; +Cc: Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 22 Aug 2007, Andrea Arcangeli wrote:

> It's risky and useless to run two oom killers in parallel, let serialize it to
> reduce the probability of spurious oom-killage.

Unless it is an OOM because of a constrained allocation. Then we will kill 
the current process anyways so its okay to have that run in multiple 
cpusets. That seems to have been the key thought when doing locking 
here.

We are already serializing the cpuset lock. cpuset_lock takes a per cpuset 
mutex! So OOM killing is already serialized per cpuset as it should be.

So for NUMA this is a useless duplication of a lock that needlessly 
adds an additional global serialization. What is missing here for you is 
serialization for the !NUMA case. 

cpuset_lock() falls back to no lock at all if !CONFIG_CPUSET. Paul: Would 
it make sense to make the fallback for cpuset_lock() take a global mutex?
If someone wants to lock a cpuset and the cpuset is the whole machine then 
a global lock should be taken right?

If we would fix cpusets like that then this patch would be no longer 
necessary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 05 of 24] avoid selecting already killed tasks
  2007-08-22 12:48 ` [PATCH 05 of 24] avoid selecting already killed tasks Andrea Arcangeli
@ 2007-09-13  0:13   ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:13 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07 of 24] balance_pgdat doesn't return the number of pages freed
  2007-09-12 12:18   ` Andrew Morton
@ 2007-09-13  0:26     ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 12 Sep 2007, Andrew Morton wrote:

> I'll skip this due to its dependency on
> [PATCH 06 of 24] reduce the probability of an OOM livelock

The return value of balance_pgdat() is never used independently of the 
prior patch.

The only user of balance_pgdat() is kswapd():


	finish_wait(&pgdat->kswapd_wait, &wait);
           if (!try_to_freeze()) {
                        /* We can speed up thawing tasks if we don't call
                         * balance_pgdat after returning from the refrigerator
                         */
                        balance_pgdat(pgdat, order);
          }
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-09-12 12:42   ` Andrew Morton
@ 2007-09-13  0:36     ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, David Rientjes, pj

On Wed, 12 Sep 2007, Andrew Morton wrote:

> Also, the oom-killer is cpuset aware.  Won't this change cause an
> oom-killing in cpuset A to needlessly disrupt processes running in cpuset
> B?

Right. I remember reviewing this before. One could maybe set a OOM flag 
per cpuset? But then OOM conditions can also be specific to a memory 
policy (MPOL_BIND) or to a particular node (GFP_THISNODE).

Maybe the best solution would be to set a per zone OOM flag?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14 of 24] oom select should only take rss into account
  2007-08-22 12:49 ` [PATCH 14 of 24] oom select should only take rss into account Andrea Arcangeli
@ 2007-09-13  0:43   ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, akpm, David Rientjes


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 16 of 24] avoid some lock operation in vm fast path
  2007-09-12 12:59   ` Andrew Morton
@ 2007-09-13  0:49     ` Christoph Lameter
  2007-09-13  1:16       ` Andrew Morton
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 12 Sep 2007, Andrew Morton wrote:

> OK, but we'd normally do this via some little wrapper functions which are
> empty-if-not-numa.

The only leftover function on reclaim_in_progress is to insure that 
zone_reclaim() does not run concurrently. Maybe that can be accomplished 
in a different way?

On the other hand: Maybe we would like to limit concurrent reclaim even 
for direct reclaim. We have some livelock issues because of zone lock 
contention on large boxes that may perhaps improve if we would simply let 
one processor do its freeing job.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 17 of 24] apply the anti deadlock features only to global oom
  2007-08-22 12:49 ` [PATCH 17 of 24] apply the anti deadlock features only to global oom Andrea Arcangeli
  2007-09-12 13:02   ` Andrew Morton
@ 2007-09-13  0:52   ` Christoph Lameter
  1 sibling, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007, Andrea Arcangeli wrote:

>  	switch (constraint) {
>  	case CONSTRAINT_MEMORY_POLICY:
> +		read_lock(&tasklist_lock);
>  		oom_kill_process(current, points,
>  				 "No available memory (MPOL_BIND)", gfp_mask, order);
> +		read_unlock(&tasklist_lock);
>  		break;
>  
>  	case CONSTRAINT_CPUSET:
> +		read_lock(&tasklist_lock);
>  		oom_kill_process(current, points,
>  				 "No available memory in cpuset", gfp_mask, order);
> +		read_unlock(&tasklist_lock);
>  		break;
>  
>  	case CONSTRAINT_NONE:
> +		if (down_trylock(&OOM_lock))
> +			break;
> +		read_lock(&tasklist_lock);

Hmmmm... The point is to take the OOM lock later to leave the NUMA 
stuff out. However, there is already a per cpuset lock being taken that 
could be useful also as a global lock if cpusets is off.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 17 of 24] apply the anti deadlock features only to global oom
  2007-09-12 13:02   ` Andrew Morton
@ 2007-09-13  0:53     ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 12 Sep 2007, Andrew Morton wrote:

> ok, I'm starting to get lost here.  Let's apply it unreviewed and if it
> breaks, that'll teach the numa weenies about the value of code review ;)

Nack. We shuld really try to consolidate the locking consistently. The 
cpuset lock and the OOM_kill lock are duplicating things.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 18 of 24] run panic the same way in both places
  2007-08-22 12:49 ` [PATCH 18 of 24] run panic the same way in both places Andrea Arcangeli
@ 2007-09-13  0:54   ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, David Rientjes

On Wed, 22 Aug 2007, Andrea Arcangeli wrote:

> The other panic is called after releasing some core global lock, that
> sounds safe to have for both panics (just in case panic tries to do
> anything more than oops does).

Extract a common function for panicing instead? That way we have only one 
place where we can mess things up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing
  2007-09-12 13:36     ` Andrea Arcangeli
@ 2007-09-13  0:55       ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:55 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-mm, David Rientjes

On Wed, 12 Sep 2007, Andrea Arcangeli wrote:

> On Wed, Sep 12, 2007 at 06:02:55AM -0700, Andrew Morton wrote:
> > I'd suggest __read_mostly.
> 
> Agreed.

Its a global OOM condition that will kill allocations in cpusets that are 
not OOM. Nack.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 21 of 24] select process to kill for cpusets
  2007-09-12 13:05   ` Andrew Morton
@ 2007-09-13  0:59     ` Christoph Lameter
  2007-09-13  5:13       ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  0:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, David Rientjes, Paul Jackson

On Wed, 12 Sep 2007, Andrew Morton wrote:

> > +			 * nothing and allow other cpusets to continue.
> > +			 */
> > +			if (constraint == CONSTRAINT_CPUSET)
> > +				goto out;
> >  			read_unlock(&tasklist_lock);
> >  			cpuset_unlock();
> >  			panic("Out of memory and no killable processes...\n");
> 
> Seems sensible, but it would be nice to get some thought cycles from pj &
> Christoph, please.

The reason that we do not scan the tasklist but kill the current process 
is also that scanning the tasklist on large systems is very expensive. 
Concurrent OOM killer may hold up the system for a long time. So we need
the kill without going throught the tasklist.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 23 of 24] serialize for cpusets
  2007-09-12 13:10   ` Andrew Morton
  2007-09-12 13:34     ` Andrea Arcangeli
  2007-09-12 19:08     ` David Rientjes
@ 2007-09-13  1:02     ` Christoph Lameter
  2 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  1:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, David Rientjes, Paul Jackson

On Wed, 12 Sep 2007, Andrew Morton wrote:

> I understand that SGI's HPC customers care rather a lot about oom handling
> in cpusets.  It'd be nice if people@sgi could carefully review-and-test
> changes such as this before we go and break stuff for them, please.

Is there some way that we can consolidate the cpuset and the !cpuset case? 
We have a cpuset_lock() for the cpuset case and now also the OOM bit. If 
both fall back to global in case of !CPUSET then we may be able to clean 
this up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 16 of 24] avoid some lock operation in vm fast path
  2007-09-13  0:49     ` Christoph Lameter
@ 2007-09-13  1:16       ` Andrew Morton
  2007-09-13  1:33         ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2007-09-13  1:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 12 Sep 2007 17:49:23 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 12 Sep 2007, Andrew Morton wrote:
> 
> > OK, but we'd normally do this via some little wrapper functions which are
> > empty-if-not-numa.
> 
> The only leftover function on reclaim_in_progress is to insure that 
> zone_reclaim() does not run concurrently. Maybe that can be accomplished 
> in a different way?

We could replace all_unreclaimable with `unsigned long flags' and do bitops
on it.

> On the other hand: Maybe we would like to limit concurrent reclaim even 
> for direct reclaim. We have some livelock issues because of zone lock 
> contention on large boxes that may perhaps improve if we would simply let 
> one processor do its freeing job.

There might be problems if the task which has the lock is using GFP_NOIO
and the one which failed to get the lock could have used GFP_KERNEL.

We should be able to directly decrease lock contention in there by chewing
on larger hunks: make scan_control.swap_cluster_max larger.  Did anyone try
that?

I guess we should stop calling that thing swap_cluster_max, really. 
swap_cluster_max is amount-of-stuff-to-write-to-swap for IO clustering. 
That's unrelated to amount-of-stuff-to-batch-in-page-reclaim for lock
contention reduction.  My fault.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 16 of 24] avoid some lock operation in vm fast path
  2007-09-13  1:16       ` Andrew Morton
@ 2007-09-13  1:33         ` Christoph Lameter
  2007-09-13  1:41           ` KAMEZAWA Hiroyuki
  2007-09-13  1:44           ` Andrew Morton
  0 siblings, 2 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13  1:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 12 Sep 2007, Andrew Morton wrote:

> We should be able to directly decrease lock contention in there by chewing
> on larger hunks: make scan_control.swap_cluster_max larger.  Did anyone try
> that?
> 
> I guess we should stop calling that thing swap_cluster_max, really. 
> swap_cluster_max is amount-of-stuff-to-write-to-swap for IO clustering. 
> That's unrelated to amount-of-stuff-to-batch-in-page-reclaim for lock
> contention reduction.  My fault.

So we need it configurable? Something like this?




Add /proc/sys/vm/reclaim_batch to configure the reclaim_batch size

Add a new proc variable to configure the reclaim batch size.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mmzone.h |    1 +
 kernel/sysctl.c        |    8 ++++++++
 mm/vmscan.c            |   41 +++++++++++++++++++++--------------------
 3 files changed, 30 insertions(+), 20 deletions(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-09-12 18:21:28.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-09-12 18:31:13.000000000 -0700
@@ -57,11 +57,11 @@ struct scan_control {
 	/* Can pages be swapped as part of reclaim? */
 	int may_swap;
 
-	/* This context's SWAP_CLUSTER_MAX. If freeing memory for
-	 * suspend, we effectively ignore SWAP_CLUSTER_MAX.
+	/* This context's  reclaim batch size. If freeing memory for
+	 * suspend, we effectively ignore reclaim_batch.
 	 * In this context, it doesn't matter that we scan the
 	 * whole list at once. */
-	int swap_cluster_max;
+	int reclaim_batch;
 
 	int swappiness;
 
@@ -105,6 +105,7 @@ struct scan_control {
  */
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
+int sysctl_reclaim_batch = SWAP_CLUSTER_MAX;
 
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
@@ -159,7 +160,7 @@ unsigned long shrink_slab(unsigned long 
 	unsigned long ret = 0;
 
 	if (scanned == 0)
-		scanned = SWAP_CLUSTER_MAX;
+		scanned = sysctl_reclaim_batch;
 
 	if (!down_read_trylock(&shrinker_rwsem))
 		return 1;	/* Assume we'll be able to shrink next time */
@@ -338,7 +339,7 @@ static pageout_t pageout(struct page *pa
 		int res;
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_NONE,
-			.nr_to_write = SWAP_CLUSTER_MAX,
+			.nr_to_write = sysctl_reclaim_batch,
 			.range_start = 0,
 			.range_end = LLONG_MAX,
 			.nonblocking = 1,
@@ -801,7 +802,7 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_freed;
 		unsigned long nr_active;
 
-		nr_taken = isolate_lru_pages(sc->swap_cluster_max,
+		nr_taken = isolate_lru_pages(sc->reclaim_batch,
 			     &zone->inactive_list,
 			     &page_list, &nr_scan, sc->order,
 			     (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
@@ -1076,7 +1077,7 @@ static unsigned long shrink_zone(int pri
 	zone->nr_scan_active +=
 		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
 	nr_active = zone->nr_scan_active;
-	if (nr_active >= sc->swap_cluster_max)
+	if (nr_active >= sc->reclaim_batch)
 		zone->nr_scan_active = 0;
 	else
 		nr_active = 0;
@@ -1084,7 +1085,7 @@ static unsigned long shrink_zone(int pri
 	zone->nr_scan_inactive +=
 		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
 	nr_inactive = zone->nr_scan_inactive;
-	if (nr_inactive >= sc->swap_cluster_max)
+	if (nr_inactive >= sc->reclaim_batch)
 		zone->nr_scan_inactive = 0;
 	else
 		nr_inactive = 0;
@@ -1092,14 +1093,14 @@ static unsigned long shrink_zone(int pri
 	while (nr_active || nr_inactive) {
 		if (nr_active) {
 			nr_to_scan = min(nr_active,
-					(unsigned long)sc->swap_cluster_max);
+					(unsigned long)sc->reclaim_batch);
 			nr_active -= nr_to_scan;
 			shrink_active_list(nr_to_scan, zone, sc, priority);
 		}
 
 		if (nr_inactive) {
 			nr_to_scan = min(nr_inactive,
-					(unsigned long)sc->swap_cluster_max);
+					(unsigned long)sc->reclaim_batch);
 			nr_inactive -= nr_to_scan;
 			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
 								sc);
@@ -1181,7 +1182,7 @@ unsigned long try_to_free_pages(struct z
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
-		.swap_cluster_max = SWAP_CLUSTER_MAX,
+		.reclaim_batch = sysctl_reclaim_batch,
 		.may_swap = 1,
 		.swappiness = vm_swappiness,
 		.order = order,
@@ -1210,7 +1211,7 @@ unsigned long try_to_free_pages(struct z
 			reclaim_state->reclaimed_slab = 0;
 		}
 		total_scanned += sc.nr_scanned;
-		if (nr_reclaimed >= sc.swap_cluster_max) {
+		if (nr_reclaimed >= sc.reclaim_batch) {
 			ret = 1;
 			goto out;
 		}
@@ -1222,8 +1223,8 @@ unsigned long try_to_free_pages(struct z
 		 * that's undesirable in laptop mode, where we *want* lumpy
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
-		if (total_scanned > sc.swap_cluster_max +
-					sc.swap_cluster_max / 2) {
+		if (total_scanned > sc.reclaim_batch +
+					sc.reclaim_batch / 2) {
 			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
 			sc.may_writepage = 1;
 		}
@@ -1288,7 +1289,7 @@ static unsigned long balance_pgdat(pg_da
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_swap = 1,
-		.swap_cluster_max = SWAP_CLUSTER_MAX,
+		.reclaim_batch = sysctl_reclaim_batch,
 		.swappiness = vm_swappiness,
 		.order = order,
 	};
@@ -1388,7 +1389,7 @@ loop_again:
 			 * the reclaim ratio is low, start doing writepage
 			 * even in laptop mode
 			 */
-			if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+			if (total_scanned > sysctl_reclaim_batch * 2 &&
 			    total_scanned > nr_reclaimed + nr_reclaimed / 2)
 				sc.may_writepage = 1;
 		}
@@ -1407,7 +1408,7 @@ loop_again:
 		 * matches the direct reclaim path behaviour in terms of impact
 		 * on zone->*_priority.
 		 */
-		if (nr_reclaimed >= SWAP_CLUSTER_MAX)
+		if (nr_reclaimed >= sysctl_reclaim_batch)
 			break;
 	}
 out:
@@ -1600,7 +1601,7 @@ unsigned long shrink_all_memory(unsigned
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_swap = 0,
-		.swap_cluster_max = nr_pages,
+		.reclaim_batch = nr_pages,
 		.may_writepage = 1,
 		.swappiness = vm_swappiness,
 	};
@@ -1782,8 +1783,8 @@ static int __zone_reclaim(struct zone *z
 	struct scan_control sc = {
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP),
-		.swap_cluster_max = max_t(unsigned long, nr_pages,
-					SWAP_CLUSTER_MAX),
+		.reclaim_batch = max_t(unsigned long, nr_pages,
+					sysctl_reclaim_batch),
 		.gfp_mask = gfp_mask,
 		.swappiness = vm_swappiness,
 	};
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2007-09-12 18:28:58.000000000 -0700
+++ linux-2.6/include/linux/mmzone.h	2007-09-12 18:29:42.000000000 -0700
@@ -607,6 +607,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
 
+extern int sysctl_reclaim_batch;
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
 extern char numa_zonelist_order[];
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2007-09-12 18:27:12.000000000 -0700
+++ linux-2.6/kernel/sysctl.c	2007-09-12 18:28:48.000000000 -0700
@@ -900,6 +900,14 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_intvec,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "reclaim_batch",
+		.data		= &sysctl_reclaim_batch,
+		.maxlen		= sizeof(sysctl_reclaim_batch),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= VM_DROP_PAGECACHE,
 		.procname	= "drop_caches",
 		.data		= &sysctl_drop_caches,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 16 of 24] avoid some lock operation in vm fast path
  2007-09-13  1:33         ` Christoph Lameter
@ 2007-09-13  1:41           ` KAMEZAWA Hiroyuki
  2007-09-13  1:44           ` Andrew Morton
  1 sibling, 0 replies; 113+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-09-13  1:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 12 Sep 2007 18:33:48 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

>
> +int sysctl_reclaim_batch = SWAP_CLUSTER_MAX;
>  
nitpick...

should be __read_mostly ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 16 of 24] avoid some lock operation in vm fast path
  2007-09-13  1:33         ` Christoph Lameter
  2007-09-13  1:41           ` KAMEZAWA Hiroyuki
@ 2007-09-13  1:44           ` Andrew Morton
  1 sibling, 0 replies; 113+ messages in thread
From: Andrew Morton @ 2007-09-13  1:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrea Arcangeli, linux-mm, David Rientjes

On Wed, 12 Sep 2007 18:33:48 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> > We should be able to directly decrease lock contention in there by chewing
> > on larger hunks: make scan_control.swap_cluster_max larger.  Did anyone try
> > that?
> > 
> > I guess we should stop calling that thing swap_cluster_max, really. 
> > swap_cluster_max is amount-of-stuff-to-write-to-swap for IO clustering. 
> > That's unrelated to amount-of-stuff-to-batch-in-page-reclaim for lock
> > contention reduction.  My fault.
> 
> So we need it configurable? Something like this?
> 
> 
> 
> 
> Add /proc/sys/vm/reclaim_batch to configure the reclaim_batch size
> 
> Add a new proc variable to configure the reclaim batch size.

That's a suitable start for someone to do a bit of performance testing.  If
it turns out to be worthwhile then perhaps we might decide to make it a
per-zone ratio based on present_pages or something, and to make the initial
defaults something more appropriate than SWAP_CLUSTER_MAX.

Also there might be tradeoffs between the size of this thing and the number
of cpus (per node?).

Dunno.  It all depends whether there's significant benefit to be had here. 
If there is, some additional testing and tuning would be needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 21 of 24] select process to kill for cpusets
  2007-09-13  0:59     ` Christoph Lameter
@ 2007-09-13  5:13       ` David Rientjes
  2007-09-13 17:55         ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-13  5:13 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Paul Jackson

On Wed, 12 Sep 2007, Christoph Lameter wrote:

> The reason that we do not scan the tasklist but kill the current process 
> is also that scanning the tasklist on large systems is very expensive. 
> Concurrent OOM killer may hold up the system for a long time. So we need
> the kill without going throught the tasklist.
> 

And that's why oom_kill_asking_task is added in the final patch of the 
series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 21 of 24] select process to kill for cpusets
  2007-09-13  5:13       ` David Rientjes
@ 2007-09-13 17:55         ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13 17:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Paul Jackson

On Wed, 12 Sep 2007, David Rientjes wrote:

> On Wed, 12 Sep 2007, Christoph Lameter wrote:
> 
> > The reason that we do not scan the tasklist but kill the current process 
> > is also that scanning the tasklist on large systems is very expensive. 
> > Concurrent OOM killer may hold up the system for a long time. So we need
> > the kill without going throught the tasklist.
> > 
> 
> And that's why oom_kill_asking_task is added in the final patch of the 
> series.

Yeah the patchset is the original one that I reviewed before with
fixups attached at the end.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-13  0:09   ` Christoph Lameter
@ 2007-09-13 18:32     ` David Rientjes
  2007-09-13 18:37       ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-13 18:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: pj, Andrea Arcangeli, linux-mm

On Wed, 12 Sep 2007, Christoph Lameter wrote:

> We are already serializing the cpuset lock. cpuset_lock takes a per cpuset 
> mutex! So OOM killing is already serialized per cpuset as it should be.
> 

The problem is that cpuset_lock() is a mutex and doesn't exit the OOM 
killer immediately if it can't be locked.  This is a problem that we've 
encountered before where multiple tasks enter the OOM killer and sleep 
waiting for the lock.  Then one instance of the OOM killer kills current 
and the cpuset is no longer OOM, but the other threads waiting on the 
mutex will still kill tasks unnecessarily after taking cpuset_lock().

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-13 18:32     ` David Rientjes
@ 2007-09-13 18:37       ` Christoph Lameter
  2007-09-13 18:46         ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13 18:37 UTC (permalink / raw)
  To: David Rientjes; +Cc: pj, Andrea Arcangeli, linux-mm

On Thu, 13 Sep 2007, David Rientjes wrote:

> On Wed, 12 Sep 2007, Christoph Lameter wrote:
> 
> > We are already serializing the cpuset lock. cpuset_lock takes a per cpuset 
> > mutex! So OOM killing is already serialized per cpuset as it should be.
> > 
> 
> The problem is that cpuset_lock() is a mutex and doesn't exit the OOM 
> killer immediately if it can't be locked.  This is a problem that we've 
> encountered before where multiple tasks enter the OOM killer and sleep 
> waiting for the lock.  Then one instance of the OOM killer kills current 
> and the cpuset is no longer OOM, but the other threads waiting on the 
> mutex will still kill tasks unnecessarily after taking cpuset_lock().

Ok then that needs to be changed. We need to do a cpuset_try_lock there?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-13 18:37       ` Christoph Lameter
@ 2007-09-13 18:46         ` David Rientjes
  2007-09-13 18:53           ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-13 18:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: pj, Andrea Arcangeli, linux-mm

On Thu, 13 Sep 2007, Christoph Lameter wrote:

> > The problem is that cpuset_lock() is a mutex and doesn't exit the OOM 
> > killer immediately if it can't be locked.  This is a problem that we've 
> > encountered before where multiple tasks enter the OOM killer and sleep 
> > waiting for the lock.  Then one instance of the OOM killer kills current 
> > and the cpuset is no longer OOM, but the other threads waiting on the 
> > mutex will still kill tasks unnecessarily after taking cpuset_lock().
> 
> Ok then that needs to be changed. We need to do a cpuset_try_lock there?
> 

It's easier to serialize it outside of out_of_memory() instead, since it 
only has a single caller and we don't need to serialize for sysrq.

This seems like it would collapse down nicely to a global or per-cpuset 
serialization with an added helper function implemented partially in 
kernel/cpuset.c for the CONFIG_CPUSETS case.

Then, in __alloc_pages(), we test for either a global or per-cpuset 
spin_trylock() and, if we acquire it, call out_of_memory() and goto 
restart as we currently do.  If it's contended, we reschedule ourself and 
goto restart when we awaken.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-13 18:46         ` David Rientjes
@ 2007-09-13 18:53           ` Christoph Lameter
  2007-09-14  0:36             ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-13 18:53 UTC (permalink / raw)
  To: David Rientjes; +Cc: pj, Andrea Arcangeli, linux-mm

On Thu, 13 Sep 2007, David Rientjes wrote:

> > Ok then that needs to be changed. We need to do a cpuset_try_lock there?
> 
> It's easier to serialize it outside of out_of_memory() instead, since it 
> only has a single caller and we don't need to serialize for sysrq.
> 
> This seems like it would collapse down nicely to a global or per-cpuset 
> serialization with an added helper function implemented partially in 
> kernel/cpuset.c for the CONFIG_CPUSETS case.
> 
> Then, in __alloc_pages(), we test for either a global or per-cpuset 
> spin_trylock() and, if we acquire it, call out_of_memory() and goto 
> restart as we currently do.  If it's contended, we reschedule ourself and 
> goto restart when we awaken.

Could you rephrase that in patch form? ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-13 18:53           ` Christoph Lameter
@ 2007-09-14  0:36             ` David Rientjes
  2007-09-14  2:31               ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-14  0:36 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, Andrea Arcangeli, linux-mm

On Thu, 13 Sep 2007, Christoph Lameter wrote:

> > It's easier to serialize it outside of out_of_memory() instead, since it 
> > only has a single caller and we don't need to serialize for sysrq.
> > 
> > This seems like it would collapse down nicely to a global or per-cpuset 
> > serialization with an added helper function implemented partially in 
> > kernel/cpuset.c for the CONFIG_CPUSETS case.
> > 
> > Then, in __alloc_pages(), we test for either a global or per-cpuset 
> > spin_trylock() and, if we acquire it, call out_of_memory() and goto 
> > restart as we currently do.  If it's contended, we reschedule ourself and 
> > goto restart when we awaken.
> 
> Could you rephrase that in patch form? ;-)
> 

Yeah, it turned out to be a little more invasive then I thought but it 
appears to be the cleanest solution for both the general CONSTRAINT_NONE 
and the per-cpuset CONSTRAINT_CPUSET cases.

I've been trying to keep score at home, but I've lost track of what 
patches from the series we're keeping so this is against HEAD.




serialize oom killer

Serializes the OOM killer both globally and per-cpuset, depending on the
system configuration.

A new spinlock, oom_lock, is introduced for the global case.  It
serializes the OOM killer for systems that are not using cpusets.  Only
one system task may enter the OOM killer at a time to prevent
unnecessarily killing others.

A per-cpuset flag, CS_OOM, is introduced in the flags field of struct
cpuset.  It serializes the OOM killer for only for hardwall allocations
targeted for that cpuset.  Only one task for each cpuset may enter the
OOM killer at a time to prevent unnecessarily killing others.  When a
per-cpuset OOM killing is taking place, the global spinlock is also
locked since we'll be alleviating that condition at the same time.

Regardless of the synchronization primitive used, if a task cannot
acquire the OOM lock, it is put to sleep before retrying the triggering
allocation so that the OOM killer may finish and free some memory.

We acquire either lock before attempting one last try at 
get_pages_from_freelist() with a very high watermark, otherwise we could 
invoke the OOM killer needlessly if another thread reschedules between 
this allocation attempt and trying to take the OOM lock.

Also converts the CONSTAINT_{NONE,CPUSET,MEMORY_POLICY} defines to an
enum and moves them to include/linux/swap.h.  We're going to need an
include/linux/oom_kill.h soon, probably.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/char/sysrq.c   |    3 +-
 include/linux/cpuset.h |   13 ++++++++++-
 include/linux/swap.h   |   14 ++++++++++-
 kernel/cpuset.c        |   16 +++++++++++++
 mm/oom_kill.c          |   58 ++++++++++++++++++++++++++++++++++++-----------
 mm/page_alloc.c        |   42 +++++++++++++++++++++++-----------
 6 files changed, 114 insertions(+), 32 deletions(-)

diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -270,8 +270,7 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL],
-			GFP_KERNEL, 0);
+	out_of_memory(GFP_KERNEL, 0, CONSTRAINT_NONE);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -60,7 +60,8 @@ extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
 
 extern void cpuset_lock(void);
 extern void cpuset_unlock(void);
-
+extern int cpuset_oom_test_and_set_lock(void);
+extern int cpuset_oom_unlock(void);
 extern int cpuset_mem_spread_node(void);
 
 static inline int cpuset_do_page_mem_spread(void)
@@ -129,6 +130,16 @@ static inline char *cpuset_task_status_allowed(struct task_struct *task,
 static inline void cpuset_lock(void) {}
 static inline void cpuset_unlock(void) {}
 
+static inline int cpuset_oom_test_and_set_lock(void)
+{
+	return -1;
+}
+
+static inline int cpuset_oom_unlock(void)
+{
+	return 0;
+}
+
 static inline int cpuset_mem_spread_node(void)
 {
 	return 0;
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -159,9 +159,21 @@ struct swap_list_t {
 #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
 
 /* linux/mm/oom_kill.c */
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+/*
+ * Types of limitations to the nodes from which allocations may occur
+ */
+enum oom_constraint {
+	CONSTRAINT_NONE,
+	CONSTRAINT_CPUSET,
+	CONSTRAINT_MEMORY_POLICY,
+};
+extern void out_of_memory(gfp_t gfp_mask, int order,
+			  enum oom_constraint constraint);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
+extern int oom_test_and_set_lock(struct zonelist *zonelist, gfp_t gfp_mask,
+				 enum oom_constraint *constraint);
+extern void oom_unlock(enum oom_constraint constraint);
 
 /* linux/mm/memory.c */
 extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -109,6 +109,7 @@ typedef enum {
 	CS_NOTIFY_ON_RELEASE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_IS_OOM,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -147,6 +148,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_oom(const struct cpuset *cs)
+{
+	return test_bit(CS_IS_OOM, &cs->flags);
+}
+
 /*
  * Increment this integer everytime any cpuset changes its
  * mems_allowed value.  Users of cpusets can track this generation
@@ -2527,6 +2533,16 @@ void cpuset_unlock(void)
 	mutex_unlock(&callback_mutex);
 }
 
+int cpuset_oom_test_and_set_lock(void)
+{
+	return test_and_set_bit(CS_IS_OOM, &current->cpuset->flags);
+}
+
+int cpuset_oom_unlock(void)
+{
+	return test_and_clear_bit(CS_IS_OOM, &current->cpuset->flags);
+}
+
 /**
  * cpuset_mem_spread_node() - On which node to begin search for a page
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
 #include <linux/notifier.h>
 
 int sysctl_panic_on_oom;
+static DEFINE_SPINLOCK(oom_lock);
 /* #define DEBUG */
 
 /**
@@ -164,13 +165,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 }
 
 /*
- * Types of limitations to the nodes from which allocations may occur
- */
-#define CONSTRAINT_NONE 1
-#define CONSTRAINT_MEMORY_POLICY 2
-#define CONSTRAINT_CPUSET 3
-
-/*
  * Determine the type of allocation constraint.
  */
 static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
@@ -387,6 +381,48 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+/*
+ * If using cpusets, try to lock task's per-cpuset OOM lock; otherwise, try to
+ * lock the global OOM spinlock.  Returns non-zero if the lock is contended or
+ * zero if acquired.
+ */
+int oom_test_and_set_lock(struct zonelist *zonelist, gfp_t gfp_mask,
+			  enum oom_constraint *constraint)
+{
+	int ret;
+
+	*constraint = constrained_alloc(zonelist, gfp_mask);
+	switch (*constraint) {
+	case CONSTRAINT_CPUSET:
+		ret = cpuset_oom_test_and_set_lock();
+		if (!ret)
+			spin_trylock(&oom_lock);
+		break;
+	default:
+		ret = spin_trylock(&oom_lock);
+		break;
+	}
+	return ret;
+}
+
+/*
+ * If using cpusets, unlock task's per-cpuset OOM lock; otherwise, unlock the
+ * global OOM spinlock.
+ */
+void oom_unlock(enum oom_constraint constraint)
+{
+	switch (constraint) {
+	case CONSTRAINT_CPUSET:
+		if (likely(spin_is_locked(&oom_lock)))
+			spin_unlock(&oom_lock);
+		cpuset_oom_unlock();
+		break;
+	default:
+		spin_unlock(&oom_lock);
+		break;
+	}
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  *
@@ -395,12 +431,11 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(gfp_t gfp_mask, int order, enum oom_constraint constraint)
 {
 	struct task_struct *p;
 	unsigned long points = 0;
 	unsigned long freed = 0;
-	int constraint;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
@@ -418,11 +453,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 	if (sysctl_panic_on_oom == 2)
 		panic("out of memory. Compulsory panic_on_oom is selected.\n");
 
-	/*
-	 * Check if there were limitations on the allocation (only relevant for
-	 * NUMA) that may require different handling.
-	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
 	cpuset_lock();
 	read_lock(&tasklist_lock);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1352,22 +1352,36 @@ nofail_alloc:
 		if (page)
 			goto got_pg;
 	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
-		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page)
-			goto got_pg;
+		enum oom_constraint constraint = CONSTRAINT_NONE;
 
-		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
-			goto nopage;
+		if (!oom_test_and_set_lock(zonelist, gfp_mask, &constraint)) {
+			/*
+			 * Go through the zonelist yet one more time, keep
+			 * very high watermark here, this is only to catch
+			 * a previous oom killing, we must fail if we're still
+			 * under heavy pressure.
+			 */
+			page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL,
+					order, zonelist,
+					ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+			if (page) {
+				oom_unlock(constraint);
+				goto got_pg;
+			}
+
+			/*
+			 * The OOM killer will not help higher order allocs so
+			 * fail
+			 */
+			if (order > PAGE_ALLOC_COSTLY_ORDER) {
+				oom_unlock(constraint);
+				goto nopage;
+			}
 
-		out_of_memory(zonelist, gfp_mask, order);
+			out_of_memory(gfp_mask, order, constraint);
+			oom_unlock(constraint);
+		} else
+			schedule_timeout_uninterruptible(1);
 		goto restart;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-14  0:36             ` David Rientjes
@ 2007-09-14  2:31               ` Christoph Lameter
  2007-09-14  3:33                 ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-14  2:31 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, Andrea Arcangeli, linux-mm

On Thu, 13 Sep 2007, David Rientjes wrote:

> serialize oom killer
> 
> Serializes the OOM killer both globally and per-cpuset, depending on the
> system configuration.
> 
> A new spinlock, oom_lock, is introduced for the global case.  It
> serializes the OOM killer for systems that are not using cpusets.  Only
> one system task may enter the OOM killer at a time to prevent
> unnecessarily killing others.

That oom_lock seems to be handled strangely. There is already a global 
cpuset with the per cpuset locks. If those locks would be available in a 
static structure in the !CPUSET case then I think that we could avoid the
oom_lock weirdness.

> > A per-cpuset flag, CS_OOM, is introduced in the flags field of struct
> cpuset.  It serializes the OOM killer for only for hardwall allocations
> targeted for that cpuset.  Only one task for each cpuset may enter the
> OOM killer at a time to prevent unnecessarily killing others.  When a
> per-cpuset OOM killing is taking place, the global spinlock is also
> locked since we'll be alleviating that condition at the same time.

Hummm... If the global lock is taken then we can only run one OOM killer 
at the time right?

> Also converts the CONSTAINT_{NONE,CPUSET,MEMORY_POLICY} defines to an
> enum and moves them to include/linux/swap.h.  We're going to need an
> include/linux/oom_kill.h soon, probably.

Sounds good.

> +/*
> + * If using cpusets, try to lock task's per-cpuset OOM lock; otherwise, try to
> + * lock the global OOM spinlock.  Returns non-zero if the lock is contended or
> + * zero if acquired.
> + */
> +int oom_test_and_set_lock(struct zonelist *zonelist, gfp_t gfp_mask,
> +			  enum oom_constraint *constraint)
> +{
> +	int ret;
> +
> +	*constraint = constrained_alloc(zonelist, gfp_mask);
> +	switch (*constraint) {
> +	case CONSTRAINT_CPUSET:
> +		ret = cpuset_oom_test_and_set_lock();
> +		if (!ret)
> +			spin_trylock(&oom_lock);

Ummm... If we cannot take the cpuset lock then we just casually try the 
oom_lock and do not care about the result?

> +		break;
> +	default:
> +		ret = spin_trylock(&oom_lock);
> +		break;
> +	}

So we take the global lock if we run out of memory in an allocation 
restriction using MPOL_BIND?

> +	return ret;
> +}
> +
> +/*
> + * If using cpusets, unlock task's per-cpuset OOM lock; otherwise, unlock the
> + * global OOM spinlock.
> + */
> +void oom_unlock(enum oom_constraint constraint)
> +{
> +	switch (constraint) {
> +	case CONSTRAINT_CPUSET:
> +		if (likely(spin_is_locked(&oom_lock)))
> +			spin_unlock(&oom_lock);

That looks a bit strange too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-14  2:31               ` Christoph Lameter
@ 2007-09-14  3:33                 ` David Rientjes
  2007-09-18 16:44                   ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-14  3:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, Andrea Arcangeli, linux-mm

On Thu, 13 Sep 2007, Christoph Lameter wrote:

> > A new spinlock, oom_lock, is introduced for the global case.  It
> > serializes the OOM killer for systems that are not using cpusets.  Only
> > one system task may enter the OOM killer at a time to prevent
> > unnecessarily killing others.
> 
> That oom_lock seems to be handled strangely. There is already a global 
> cpuset with the per cpuset locks. If those locks would be available in a 
> static structure in the !CPUSET case then I think that we could avoid the
> oom_lock weirdness.
> 

Sure, but such a static structure doesn't exist when CONFIG_CPUSETS isn't 
defined and there's no reason to create one just for the OOM killer.  That 
would require declaring the cpuset pointer in each task_struct even when 
we haven't enabled cpusets.  The OOM killer should be aware of cpuset-
constrained allocations, but not be dependant upon the subsystem.

> > > A per-cpuset flag, CS_OOM, is introduced in the flags field of struct
> > cpuset.  It serializes the OOM killer for only for hardwall allocations
> > targeted for that cpuset.  Only one task for each cpuset may enter the
> > OOM killer at a time to prevent unnecessarily killing others.  When a
> > per-cpuset OOM killing is taking place, the global spinlock is also
> > locked since we'll be alleviating that condition at the same time.
> 
> Hummm... If the global lock is taken then we can only run one OOM killer 
> at the time right?
> 

Yes, and that would happen if we didn't compile with CONFIG_CPUSETS or 
constrained_alloc() returns CONSTRAINT_NONE before we call out_of_memory() 
because the entire system is OOM.

> > + * If using cpusets, try to lock task's per-cpuset OOM lock; otherwise, try to
> > + * lock the global OOM spinlock.  Returns non-zero if the lock is contended or
> > + * zero if acquired.
> > + */
> > +int oom_test_and_set_lock(struct zonelist *zonelist, gfp_t gfp_mask,
> > +			  enum oom_constraint *constraint)
> > +{
> > +	int ret;
> > +
> > +	*constraint = constrained_alloc(zonelist, gfp_mask);
> > +	switch (*constraint) {
> > +	case CONSTRAINT_CPUSET:
> > +		ret = cpuset_oom_test_and_set_lock();
> > +		if (!ret)
> > +			spin_trylock(&oom_lock);
> 
> Ummm... If we cannot take the cpuset lock then we just casually try the 
> oom_lock and do not care about the result?
> 

We did take the cpuset lock.

We're testing and setting the CS_OOM bit in current->cpuset->flags.  If it 
is 0, meaning we have acquired the lock, we also lock the global lock 
since, by definition, any cpuset-constrained OOM killing will also help 
alleviate a system-wide OOM condition.  If the cpuset lock was contended, 
we don't lock the global lock, the function above returns 1, and we sleep 
when we return to __alloc_pages() before retrying.

> > +		break;
> > +	default:
> > +		ret = spin_trylock(&oom_lock);
> > +		break;
> > +	}
> 
> So we take the global lock if we run out of memory in an allocation 
> restriction using MPOL_BIND?
> 

Hmm, looks like we have another opportunity for an improvement here.

We have no way of locking only the nodes in the MPOL_BIND memory policy 
like we do on a cpuset granularity.  That would require an spinlock in 
each node which would work fine if we alter the CONSTRAINT_CPUSET case to 
lock each node in current->cpuset->mems_allowed.  We could do that if add 
a task_lock(current) before trying oom_test_and_set_lock() in 
__alloc_pages().

There's also no OOM locking at the zone level for GFP_DMA constrained 
allocations, so perhaps locking should be on the zone level.

> > +/*
> > + * If using cpusets, unlock task's per-cpuset OOM lock; otherwise, unlock the
> > + * global OOM spinlock.
> > + */
> > +void oom_unlock(enum oom_constraint constraint)
> > +{
> > +	switch (constraint) {
> > +	case CONSTRAINT_CPUSET:
> > +		if (likely(spin_is_locked(&oom_lock)))
> > +			spin_unlock(&oom_lock);
> 
> That looks a bit strange too.
> 

It looks strange and is open to a race, but it does what we want.  We 
take both the per-cpuset lock and the global lock whenever we are in a 
CONSTRAINT_CPUSET scenario so we need to unlock it here too.  The race 
isn't in this snippet of code because we're protected by the per-cpuset 
lock, but it's in oom_test_and_set_lock() where we lock both:

	CPU #1				CPU #2
	constrained_alloc() ==		constrained_alloc() ==
		CONSTRAINT_CPUSET		CONSTRAINT_NONE
	test_and_set_bit(CS_OOM, ...);	...
	...				spin_trylock(&oom_lock);
	...				out_of_memory();
	spin_trylock(&oom_lock);	...
	out_of_memory();		...
	spin_unlock(&oom_lock);		...

In that case, CPU #2 would not unlock &oom_lock because of the conditional 
you quoted above.

This scenario doesn't look much like serialization but that's completely 
intended.  We went OOM in a cpuset and then we went OOM in the system so 
something exclusive from the tasks bound to that cpuset caused the second 
OOM.  So killing current for the CONSTRAINT_CPUSET case probably won't 
help that condition since they occurred independently of each other.  What 
if they didn't?  Then the tasklist scanning in out_of_memory() will find 
the PF_EXITING task because it's a candidate for killing as well and the 
entire OOM killer will become a no-op for CPU #2.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-14  3:33                 ` David Rientjes
@ 2007-09-18 16:44                   ` David Rientjes
  2007-09-18 16:44                     ` [patch 1/4] oom: move prototypes to appropriate header file David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 16:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrea Arcangeli, linux-mm

On Thu, 13 Sep 2007, David Rientjes wrote:

> We have no way of locking only the nodes in the MPOL_BIND memory policy 
> like we do on a cpuset granularity.  That would require an spinlock in 
> each node which would work fine if we alter the CONSTRAINT_CPUSET case to 
> lock each node in current->cpuset->mems_allowed.  We could do that if add 
> a task_lock(current) before trying oom_test_and_set_lock() in 
> __alloc_pages().
> 
> There's also no OOM locking at the zone level for GFP_DMA constrained 
> allocations, so perhaps locking should be on the zone level.
> 

There's a way to get around adding a spinlock to struct zone by just 
saving the pointers of the zonelists passed to __alloc_pages() when the 
OOM killer is invoked.  Then, on subsequent calls to out_of_memory(), it 
is possible to scan through the list of zones that already have a 
corresponding allocation attempt that has failed and is already in the OOM 
killer.  Hopefully the OOM killer will kill a memory-hogging task, which 
the heuristics are pretty good for, and it will free up some space in 
those zones.  Thus, we should indeed be serializing on the zone level 
instead of node or cpuset level.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [patch 1/4] oom: move prototypes to appropriate header file
  2007-09-18 16:44                   ` David Rientjes
@ 2007-09-18 16:44                     ` David Rientjes
  2007-09-18 16:44                       ` [patch 2/4] oom: move constraints to enum David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 16:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Christoph Lameter, linux-mm

Move the OOM killer's extern function prototypes to include/linux/oom.h
and include it where necessary.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/char/sysrq.c |    1 +
 include/linux/oom.h  |   11 ++++++++++-
 include/linux/swap.h |    5 -----
 mm/page_alloc.c      |    1 +
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -36,6 +36,7 @@
 #include <linux/kexec.h>
 #include <linux/irq.h>
 #include <linux/hrtimer.h>
+#include <linux/oom.h>
 
 #include <asm/ptrace.h>
 #include <asm/irq_regs.h>
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,10 +1,19 @@
 #ifndef __INCLUDE_LINUX_OOM_H
 #define __INCLUDE_LINUX_OOM_H
 
+#include <linux/sched.h>
+
 /* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
 #define OOM_DISABLE (-17)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
-#endif
+#ifdef __KERNEL__
+
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern int register_oom_notifier(struct notifier_block *nb);
+extern int unregister_oom_notifier(struct notifier_block *nb);
+
+#endif /* __KERNEL__*/
+#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -158,11 +158,6 @@ struct swap_list_t {
 /* Swap 50% full? Release swapcache more aggressively.. */
 #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
 
-/* linux/mm/oom_kill.c */
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
-extern int register_oom_notifier(struct notifier_block *nb);
-extern int unregister_oom_notifier(struct notifier_block *nb);
-
 /* linux/mm/memory.c */
 extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -41,6 +41,7 @@
 #include <linux/pfn.h>
 #include <linux/backing-dev.h>
 #include <linux/fault-inject.h>
+#include <linux/oom.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [patch 2/4] oom: move constraints to enum
  2007-09-18 16:44                     ` [patch 1/4] oom: move prototypes to appropriate header file David Rientjes
@ 2007-09-18 16:44                       ` David Rientjes
  2007-09-18 16:44                         ` [patch 3/4] oom: save zonelist pointer for oom killer calls David Rientjes
  2007-09-18 19:55                         ` [patch 2/4] oom: move constraints to enum Christoph Lameter
  0 siblings, 2 replies; 113+ messages in thread
From: David Rientjes @ 2007-09-18 16:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Christoph Lameter, linux-mm

The OOM killer's CONSTRAINT definitions are really more appropriate in an
enum, so define them in include/linux/oom.h.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    9 +++++++++
 mm/oom_kill.c       |   12 +++---------
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -11,6 +11,15 @@
 
 #ifdef __KERNEL__
 
+/*
+ * Types of limitations to the nodes from which allocations may occur
+ */
+enum oom_constraint {
+	CONSTRAINT_NONE,
+	CONSTRAINT_CPUSET,
+	CONSTRAINT_MEMORY_POLICY,
+};
+
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -164,16 +164,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 }
 
 /*
- * Types of limitations to the nodes from which allocations may occur
- */
-#define CONSTRAINT_NONE 1
-#define CONSTRAINT_MEMORY_POLICY 2
-#define CONSTRAINT_CPUSET 3
-
-/*
  * Determine the type of allocation constraint.
  */
-static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+						    gfp_t gfp_mask)
 {
 #ifdef CONFIG_NUMA
 	struct zone **z;
@@ -400,7 +394,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 	struct task_struct *p;
 	unsigned long points = 0;
 	unsigned long freed = 0;
-	int constraint;
+	enum oom_constraint constraint;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [patch 3/4] oom: save zonelist pointer for oom killer calls
  2007-09-18 16:44                       ` [patch 2/4] oom: move constraints to enum David Rientjes
@ 2007-09-18 16:44                         ` David Rientjes
  2007-09-18 16:44                           ` [patch 4/4] oom: serialize out of memory calls David Rientjes
  2007-09-18 19:57                           ` [patch 3/4] oom: save zonelist pointer for oom killer calls Christoph Lameter
  2007-09-18 19:55                         ` [patch 2/4] oom: move constraints to enum Christoph Lameter
  1 sibling, 2 replies; 113+ messages in thread
From: David Rientjes @ 2007-09-18 16:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Christoph Lameter, linux-mm

OOM killer synchronization should be done with zone granularity so that
memory policy and cpuset allocations may have their corresponding zones
locked and allow parallel kills for other OOM conditions that may exist
elsewhere in the system.  DMA allocations can be targeted at the zone
level, which would not be possible if locking was done in nodes or
globally.

A pointer to the OOM-triggering zonelist is saved in a linked list.  Any
time there is an OOM condition, all zones in the zonelist are checked
against the zonelists stored in the OOM killer lists.  If the OOM killer
has already been called for an allocation that includes one of these
zones, the "trylock" fails and returns non-zero.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    3 ++
 mm/oom_kill.c       |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 0 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -20,6 +20,9 @@ enum oom_constraint {
 	CONSTRAINT_MEMORY_POLICY,
 };
 
+extern int oom_killer_trylock(struct zonelist *zonelist);
+extern void oom_killer_unlock(const struct zonelist *zonelist);
+
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,13 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 
+struct oom_zonelist {
+	struct zonelist *zonelist;
+	struct list_head list;
+};
+static LIST_HEAD(zonelists);
+static DEFINE_MUTEX(oom_zonelist_mutex);
+
 int sysctl_panic_on_oom;
 /* #define DEBUG */
 
@@ -381,6 +388,68 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+/*
+ * Call with oom_zonelist_mutex held.
+ */
+static int is_zone_locked(const struct zone *zone)
+{
+	struct oom_zonelist *oom_zl;
+	int i;
+
+	list_for_each_entry(oom_zl, &zonelists, list)	
+		for (i = 0; oom_zl->zonelist->zones[i]; i++)
+			if (zone == oom_zl->zonelist->zones[i])
+				return 1;
+	return 0;
+}
+
+/*
+ * Try to acquire the OOM killer lock for the zones in zonelist.  Returns
+ * non-zero if a parallel OOM killing is already taking place that includes a
+ * zone in the zonelist.
+ */
+int oom_killer_trylock(struct zonelist *zonelist)
+{
+	struct oom_zonelist *oom_zl;
+	int ret = 0;
+	int i;
+
+	mutex_lock(&oom_zonelist_mutex);
+	for (i = 0; zonelist->zones[i]; i++)
+		if (is_zone_locked(zonelist->zones[i])) {
+			ret = 1;
+			goto out;
+		}
+
+	oom_zl = kzalloc(sizeof(*oom_zl), GFP_KERNEL);
+	if (!oom_zl)
+		goto out;
+
+	oom_zl->zonelist = zonelist;
+	list_add(&oom_zl->list, &zonelists);
+out:
+	mutex_unlock(&oom_zonelist_mutex);
+	return ret;
+}
+
+/*
+ * Removes the zonelist from the list so that future allocations that include
+ * its zones can successfully call the OOM killer.
+ */
+void oom_killer_unlock(const struct zonelist *zonelist)
+{
+	struct oom_zonelist *oom_zl;
+
+	mutex_lock(&oom_zonelist_mutex);
+	list_for_each_entry(oom_zl, &zonelists, list)
+		if (zonelist == oom_zl->zonelist) {
+			list_del(&oom_zl->list);
+			break;
+		}
+	mutex_unlock(&oom_zonelist_mutex);
+	kfree(oom_zl);
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [patch 4/4] oom: serialize out of memory calls
  2007-09-18 16:44                         ` [patch 3/4] oom: save zonelist pointer for oom killer calls David Rientjes
@ 2007-09-18 16:44                           ` David Rientjes
  2007-09-18 19:54                             ` Christoph Lameter
  2007-09-18 19:57                           ` [patch 3/4] oom: save zonelist pointer for oom killer calls Christoph Lameter
  1 sibling, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 16:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Christoph Lameter, linux-mm

Before invoking the OOM killer, a final allocation attempt with a very
high watermark is attempted.  Serialization needs to occur at this point
or it may be possible that the allocation could succeed after acquiring
the lock.  If the lock is contended, the task is put to sleep and the
allocation attempt is retried when rescheduled.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1353,6 +1353,11 @@ nofail_alloc:
 		if (page)
 			goto got_pg;
 	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+		if (oom_killer_trylock(zonelist)) {
+			schedule_timeout_uninterruptible(1);
+			goto restart;
+		}
+
 		/*
 		 * Go through the zonelist yet one more time, keep
 		 * very high watermark here, this is only to catch
@@ -1361,14 +1366,19 @@ nofail_alloc:
 		 */
 		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
 				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page)
+		if (page) {
+			oom_killer_unlock(zonelist);
 			goto got_pg;
+		}
 
 		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER)
+		if (order > PAGE_ALLOC_COSTLY_ORDER) {
+			oom_killer_unlock(zonelist);
 			goto nopage;
+		}
 
 		out_of_memory(zonelist, gfp_mask, order);
+		oom_killer_unlock(zonelist);
 		goto restart;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 4/4] oom: serialize out of memory calls
  2007-09-18 16:44                           ` [patch 4/4] oom: serialize out of memory calls David Rientjes
@ 2007-09-18 19:54                             ` Christoph Lameter
  2007-09-18 19:56                               ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-18 19:54 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

>  			goto got_pg;
>  	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> +		if (oom_killer_trylock(zonelist)) {

Condition reversed? We want to restart if the lock is taken right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 2/4] oom: move constraints to enum
  2007-09-18 16:44                       ` [patch 2/4] oom: move constraints to enum David Rientjes
  2007-09-18 16:44                         ` [patch 3/4] oom: save zonelist pointer for oom killer calls David Rientjes
@ 2007-09-18 19:55                         ` Christoph Lameter
  1 sibling, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-18 19:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

> The OOM killer's CONSTRAINT definitions are really more appropriate in an
> enum, so define them in include/linux/oom.h.

Acked-by: Christoph Lameter <clameter@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 4/4] oom: serialize out of memory calls
  2007-09-18 19:54                             ` Christoph Lameter
@ 2007-09-18 19:56                               ` David Rientjes
  2007-09-18 20:01                                 ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 19:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, Christoph Lameter wrote:

> On Tue, 18 Sep 2007, David Rientjes wrote:
> 
> >  			goto got_pg;
> >  	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > +		if (oom_killer_trylock(zonelist)) {
> 
> Condition reversed? We want to restart if the lock is taken right?
> 

All trylocks return non-zero if they are contended, so the conditional is 
correct as written.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 3/4] oom: save zonelist pointer for oom killer calls
  2007-09-18 16:44                         ` [patch 3/4] oom: save zonelist pointer for oom killer calls David Rientjes
  2007-09-18 16:44                           ` [patch 4/4] oom: serialize out of memory calls David Rientjes
@ 2007-09-18 19:57                           ` Christoph Lameter
  2007-09-18 20:13                             ` David Rientjes
  1 sibling, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-18 19:57 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

> +
> +	oom_zl = kzalloc(sizeof(*oom_zl), GFP_KERNEL);
> +	if (!oom_zl)
> +		goto out;

An allocation in the oom killer? This could in turn trigger more 
problems. Maybe its best to put a list head into the zone?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 4/4] oom: serialize out of memory calls
  2007-09-18 19:56                               ` David Rientjes
@ 2007-09-18 20:01                                 ` Christoph Lameter
  2007-09-18 20:06                                   ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-18 20:01 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

> On Tue, 18 Sep 2007, Christoph Lameter wrote:
> 
> > On Tue, 18 Sep 2007, David Rientjes wrote:
> > 
> > >  			goto got_pg;
> > >  	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > > +		if (oom_killer_trylock(zonelist)) {
> > 
> > Condition reversed? We want to restart if the lock is taken right?
> > 
> 
> All trylocks return non-zero if they are contended, so the conditional is 
> correct as written.

trylocks return 1 = true if the lock was acquire. 0 = 
false if not.

F.e.

#define __raw_spin_trylock(x)           (cmpxchg_acq(&(x)->lock, 0, 1) == 
0)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 4/4] oom: serialize out of memory calls
  2007-09-18 20:01                                 ` Christoph Lameter
@ 2007-09-18 20:06                                   ` David Rientjes
  2007-09-18 20:23                                     ` [patch 5/4] oom: rename serialization helper functions David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 20:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, Christoph Lameter wrote:

> trylocks return 1 = true if the lock was acquire. 0 = 
> false if not.
> 
> F.e.
> 
> #define __raw_spin_trylock(x)           (cmpxchg_acq(&(x)->lock, 0, 1) == 
> 0)
> 

Yes, but this would require a change in oom_killer_trylock() since it is 
coded to return non-zero if the OOM killer has already been invoked for at 
least one of the zones.  The use of "trylock" here is being abused anyway 
since there's actually no locks involved, so maybe the function pair 
should simply be renamed to zone_in_oom() and zonelist_clear_oom().  I'll 
make the change, thanks for keeping it consistent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 3/4] oom: save zonelist pointer for oom killer calls
  2007-09-18 19:57                           ` [patch 3/4] oom: save zonelist pointer for oom killer calls Christoph Lameter
@ 2007-09-18 20:13                             ` David Rientjes
  2007-09-18 20:16                               ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 20:13 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, Christoph Lameter wrote:

> On Tue, 18 Sep 2007, David Rientjes wrote:
> 
> > +
> > +	oom_zl = kzalloc(sizeof(*oom_zl), GFP_KERNEL);
> > +	if (!oom_zl)
> > +		goto out;
> 
> An allocation in the oom killer? This could in turn trigger more 
> problems. Maybe its best to put a list head into the zone?
> 

I thought about doing that as well as statically allocating

	#define MAX_OOM_THREADS		4
	static struct zonelist *zonelists[MAX_OOM_THREADS];

and using semaphores.  But in my testing of this patchset and experience 
in working with the watermarks used in __alloc_pages(), we should never 
actually encounter a condition where we can't find
sizeof(struct oom_zonelist) of memory.  That's on the order of how many 
invocations of the OOM killer you have, but I don't actually think you'll 
have many that have a completely exclusive set of zones in the zonelist.  
Watermarks usually do the trick (and is the only reason TIF_MEMDIE works, 
by the way).

I'm not sure how embedding a list_head in struct zone would work even 
though we're adding the premise that a single zone can only be in the OOM 
killer once.  You'd have to recreate the zonelist by stringing together 
these heads in the zone but the whole concept relies upon finding a 
pointer to an already existing struct zonelist.  It works nicely as is 
because the struct zonelist is persistent in __alloc_pages() so it is easy 
to pass it to both zone_in_oom() and zonelist_clear_oom().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 3/4] oom: save zonelist pointer for oom killer calls
  2007-09-18 20:13                             ` David Rientjes
@ 2007-09-18 20:16                               ` Christoph Lameter
  2007-09-18 20:47                                 ` [patch 6/4] oom: pass null to kfree if zonelist is not cleared David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-18 20:16 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

> On Tue, 18 Sep 2007, Christoph Lameter wrote:
> 
> > On Tue, 18 Sep 2007, David Rientjes wrote:
> > 
> > > +
> > > +	oom_zl = kzalloc(sizeof(*oom_zl), GFP_KERNEL);
> > > +	if (!oom_zl)
> > > +		goto out;
> > 
> > An allocation in the oom killer? This could in turn trigger more 
> > problems. Maybe its best to put a list head into the zone?
> > 
> 
> I thought about doing that as well as statically allocating
> 
> 	#define MAX_OOM_THREADS		4
> 	static struct zonelist *zonelists[MAX_OOM_THREADS];
> 
> and using semaphores.  But in my testing of this patchset and experience 
> in working with the watermarks used in __alloc_pages(), we should never 
> actually encounter a condition where we can't find
> sizeof(struct oom_zonelist) of memory.  That's on the order of how many 
> invocations of the OOM killer you have, but I don't actually think you'll 
> have many that have a completely exclusive set of zones in the zonelist.  
> Watermarks usually do the trick (and is the only reason TIF_MEMDIE works, 
> by the way).

You are playing with fire here. The slab queues *may* have enough memory 
to satisfy that requests but if not then we may recursively call into the 
page allocator to get a page/pages. Sounds dangerous to me.
 
> I'm not sure how embedding a list_head in struct zone would work even 
> though we're adding the premise that a single zone can only be in the OOM 
> killer once.  You'd have to recreate the zonelist by stringing together 
> these heads in the zone but the whole concept relies upon finding a 
> pointer to an already existing struct zonelist.  It works nicely as is 
> because the struct zonelist is persistent in __alloc_pages() so it is easy 
> to pass it to both zone_in_oom() and zonelist_clear_oom().

Then add a flag? Andrew and I talked about switching the all_reclaimable 
field to a bitmask? Do the conversion there and then we can add a OOM kill 
active state to each zone.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [patch 5/4] oom: rename serialization helper functions
  2007-09-18 20:06                                   ` David Rientjes
@ 2007-09-18 20:23                                     ` David Rientjes
  2007-09-18 20:26                                       ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 20:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Christoph Lameter, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

> Yes, but this would require a change in oom_killer_trylock() since it is 
> coded to return non-zero if the OOM killer has already been invoked for at 
> least one of the zones.  The use of "trylock" here is being abused anyway 
> since there's actually no locks involved, so maybe the function pair 
> should simply be renamed to zone_in_oom() and zonelist_clear_oom().  I'll 
> make the change, thanks for keeping it consistent.
> 

oom: rename serialization helper functions

Rename oom_killer_trylock() and oom_killer_unlock() to zone_in_oom() and
zonelist_clear_oom(), respectively.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    4 ++--
 mm/oom_kill.c       |    4 ++--
 mm/page_alloc.c     |    8 ++++----
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -20,8 +20,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMORY_POLICY,
 };
 
-extern int oom_killer_trylock(struct zonelist *zonelist);
-extern void oom_killer_unlock(const struct zonelist *zonelist);
+extern int zone_in_oom(struct zonelist *zonelist);
+extern void zonelist_clear_oom(const struct zonelist *zonelist);
 
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
 extern int register_oom_notifier(struct notifier_block *nb);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -408,7 +408,7 @@ static int is_zone_locked(const struct zone *zone)
  * non-zero if a parallel OOM killing is already taking place that includes a
  * zone in the zonelist.
  */
-int oom_killer_trylock(struct zonelist *zonelist)
+int zone_in_oom(struct zonelist *zonelist)
 {
 	struct oom_zonelist *oom_zl;
 	int ret = 0;
@@ -436,7 +436,7 @@ out:
  * Removes the zonelist from the list so that future allocations that include
  * its zones can successfully call the OOM killer.
  */
-void oom_killer_unlock(const struct zonelist *zonelist)
+void zonelist_clear_oom(const struct zonelist *zonelist)
 {
 	struct oom_zonelist *oom_zl;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1353,7 +1353,7 @@ nofail_alloc:
 		if (page)
 			goto got_pg;
 	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (oom_killer_trylock(zonelist)) {
+		if (zone_in_oom(zonelist)) {
 			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
@@ -1367,18 +1367,18 @@ nofail_alloc:
 		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
 				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
-			oom_killer_unlock(zonelist);
+			zonelist_clear_oom(zonelist);
 			goto got_pg;
 		}
 
 		/* The OOM killer will not help higher order allocs so fail */
 		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			oom_killer_unlock(zonelist);
+			zonelist_clear_oom(zonelist);
 			goto nopage;
 		}
 
 		out_of_memory(zonelist, gfp_mask, order);
-		oom_killer_unlock(zonelist);
+		zonelist_clear_oom(zonelist);
 		goto restart;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 5/4] oom: rename serialization helper functions
  2007-09-18 20:23                                     ` [patch 5/4] oom: rename serialization helper functions David Rientjes
@ 2007-09-18 20:26                                       ` Christoph Lameter
  2007-09-18 20:39                                         ` [patch 5/4 v2] " David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-18 20:26 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

> -		if (oom_killer_trylock(zonelist)) {
> +		if (zone_in_oom(zonelist)) {

The name is confusing. Looks like we are just checking a bit whereas we 
attempt to set the zone to oom. try_set_zone_oom with the correct trylock 
semantics?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [patch 5/4 v2] oom: rename serialization helper functions
  2007-09-18 20:26                                       ` Christoph Lameter
@ 2007-09-18 20:39                                         ` David Rientjes
  2007-09-18 20:59                                           ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 20:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Christoph Lameter, linux-mm

On Tue, 18 Sep 2007, Christoph Lameter wrote:

> On Tue, 18 Sep 2007, David Rientjes wrote:
> 
> > -		if (oom_killer_trylock(zonelist)) {
> > +		if (zone_in_oom(zonelist)) {
> 
> The name is confusing. Looks like we are just checking a bit whereas we 
> attempt to set the zone to oom. try_set_zone_oom with the correct trylock 
> semantics?
> 

oom: rename serialization helper functions

Rename oom_killer_trylock() and oom_killer_unlock() to try_set_zone_oom()
and clear_zonelist_oom(), respectively.  Reverses the logic of
try_set_zone_oom() so that it returns zero if the zone is already found
in the OOM killer, similar to trylock semantics.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Replaces original oom-rename-serialization-helper-functions.patch.

 include/linux/oom.h |    4 ++--
 mm/oom_kill.c       |   14 +++++++-------
 mm/page_alloc.c     |    8 ++++----
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -20,8 +20,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMORY_POLICY,
 };
 
-extern int oom_killer_trylock(struct zonelist *zonelist);
-extern void oom_killer_unlock(const struct zonelist *zonelist);
+extern int try_set_zone_oom(struct zonelist *zonelist);
+extern void clear_zonelist_oom(const struct zonelist *zonelist);
 
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
 extern int register_oom_notifier(struct notifier_block *nb);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -404,20 +404,20 @@ static int is_zone_locked(const struct zone *zone)
 }
 
 /*
- * Try to acquire the OOM killer lock for the zones in zonelist.  Returns
- * non-zero if a parallel OOM killing is already taking place that includes a
- * zone in the zonelist.
+ * Try to acquire the OOM killer lock for the zones in zonelist.  Returns zero
+ * if a parallel OOM killing is already taking place that includes a zone in
+ * the zonelist.
  */
-int oom_killer_trylock(struct zonelist *zonelist)
+int try_set_zone_oom(struct zonelist *zonelist)
 {
 	struct oom_zonelist *oom_zl;
-	int ret = 0;
+	int ret = 1;
 	int i;
 
 	mutex_lock(&oom_zonelist_mutex);
 	for (i = 0; zonelist->zones[i]; i++)
 		if (is_zone_locked(zonelist->zones[i])) {
-			ret = 1;
+			ret = 0;
 			goto out;
 		}
 
@@ -436,7 +436,7 @@ out:
  * Removes the zonelist from the list so that future allocations that include
  * its zones can successfully call the OOM killer.
  */
-void oom_killer_unlock(const struct zonelist *zonelist)
+void clear_zonelist_oom(const struct zonelist *zonelist)
 {
 	struct oom_zonelist *oom_zl;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1353,7 +1353,7 @@ nofail_alloc:
 		if (page)
 			goto got_pg;
 	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (oom_killer_trylock(zonelist)) {
+		if (!try_set_zone_oom(zonelist)) {
 			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
@@ -1367,18 +1367,18 @@ nofail_alloc:
 		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
 				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
-			oom_killer_unlock(zonelist);
+			clear_zonelist_oom(zonelist);
 			goto got_pg;
 		}
 
 		/* The OOM killer will not help higher order allocs so fail */
 		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			oom_killer_unlock(zonelist);
+			clear_zonelist_oom(zonelist);
 			goto nopage;
 		}
 
 		out_of_memory(zonelist, gfp_mask, order);
-		oom_killer_unlock(zonelist);
+		clear_zonelist_oom(zonelist);
 		goto restart;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [patch 6/4] oom: pass null to kfree if zonelist is not cleared
  2007-09-18 20:16                               ` Christoph Lameter
@ 2007-09-18 20:47                                 ` David Rientjes
  2007-09-18 21:01                                   ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 20:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Christoph Lameter, linux-mm

On Tue, 18 Sep 2007, Christoph Lameter wrote:

> > I thought about doing that as well as statically allocating
> > 
> > 	#define MAX_OOM_THREADS		4
> > 	static struct zonelist *zonelists[MAX_OOM_THREADS];
> > 
> > and using semaphores.  But in my testing of this patchset and experience 
> > in working with the watermarks used in __alloc_pages(), we should never 
> > actually encounter a condition where we can't find
> > sizeof(struct oom_zonelist) of memory.  That's on the order of how many 
> > invocations of the OOM killer you have, but I don't actually think you'll 
> > have many that have a completely exclusive set of zones in the zonelist.  
> > Watermarks usually do the trick (and is the only reason TIF_MEMDIE works, 
> > by the way).
> 
> You are playing with fire here. The slab queues *may* have enough memory 
> to satisfy that requests but if not then we may recursively call into the 
> page allocator to get a page/pages. Sounds dangerous to me.
>  

Wrong.  Notice what the newly-named try_set_zone_oom() function returns if 
the kzalloc() fails; this was a specific design decision.  It returns 1, 
so the conditional in __alloc_pages() fails and the OOM killer progresses 
as normal.

Thanks for reminding me about that, though, because the following will be 
needed if that indeed happens.



oom: pass null to kfree if zonelist is not cleared

If a zonelist pointer cannot be found in the linked list, kfree() must be
called with NULL instead.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -438,7 +438,7 @@ out:
  */
 void clear_zonelist_oom(const struct zonelist *zonelist)
 {
-	struct oom_zonelist *oom_zl;
+	struct oom_zonelist *oom_zl = NULL;
 
 	mutex_lock(&oom_zonelist_mutex);
 	list_for_each_entry(oom_zl, &zonelists, list)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 5/4 v2] oom: rename serialization helper functions
  2007-09-18 20:39                                         ` [patch 5/4 v2] " David Rientjes
@ 2007-09-18 20:59                                           ` Christoph Lameter
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Lameter @ 2007-09-18 20:59 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

> oom: rename serialization helper functions

Acked-by: Christoph Lameter <clameter@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 6/4] oom: pass null to kfree if zonelist is not cleared
  2007-09-18 20:47                                 ` [patch 6/4] oom: pass null to kfree if zonelist is not cleared David Rientjes
@ 2007-09-18 21:01                                   ` Christoph Lameter
  2007-09-18 21:13                                     ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-18 21:01 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

> Wrong.  Notice what the newly-named try_set_zone_oom() function returns if 
> the kzalloc() fails; this was a specific design decision.  It returns 1, 
> so the conditional in __alloc_pages() fails and the OOM killer progresses 
> as normal.

So if kzalloc fails then we think that the zone is already running an oom 
killer while it may only be active on other zones? Doesnt that create more 
trouble?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 6/4] oom: pass null to kfree if zonelist is not cleared
  2007-09-18 21:01                                   ` Christoph Lameter
@ 2007-09-18 21:13                                     ` David Rientjes
  2007-09-18 21:25                                       ` Christoph Lameter
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 21:13 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, Christoph Lameter wrote:

> > Wrong.  Notice what the newly-named try_set_zone_oom() function returns if 
> > the kzalloc() fails; this was a specific design decision.  It returns 1, 
> > so the conditional in __alloc_pages() fails and the OOM killer progresses 
> > as normal.
> 
> So if kzalloc fails then we think that the zone is already running an oom 
> killer while it may only be active on other zones? Doesnt that create more 
> trouble?
> 

If the kzalloc fails, we're in a system-wide OOM state that isn't 
constrained by anything so we allow the OOM killer to be invoked just 
like this patchset was never applied.  We make no inference that it has 
already been invoked, there is nothing to suggest that it has.  All we 
know is that none of the zones in the zonelist from __alloc_pages() are 
currently in the OOM killer.

So we allow the OOM killer to proceed and trust that its heuristics will 
indeed kill a memory-hogging task and free up memory so we can at least 
start kmalloc'ing memory again.  The kernel seems to like that ability.

So the bottomline is that if the kzalloc fails, this entire patchset 
becomes a no-op for that OOM killer invocation; we allow out_of_memory() 
to be called and don't save the zonelist pointer.  I think you'll agree 
that if a kzalloc fails for such a small amount of memory that 
serialization of the OOM killer is the last thing we need to be concerned 
about.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 6/4] oom: pass null to kfree if zonelist is not cleared
  2007-09-18 21:13                                     ` David Rientjes
@ 2007-09-18 21:25                                       ` Christoph Lameter
  2007-09-18 22:16                                         ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Lameter @ 2007-09-18 21:25 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, David Rientjes wrote:

> If the kzalloc fails, we're in a system-wide OOM state that isn't 
> constrained by anything so we allow the OOM killer to be invoked just 
> like this patchset was never applied.  We make no inference that it has 
> already been invoked, there is nothing to suggest that it has.  All we 
> know is that none of the zones in the zonelist from __alloc_pages() are 
> currently in the OOM killer.

kzalloc can be restricted by the cpuset / mempolicy context and the 
GFP_THISNODE flags. It may fail for other reasons. Maybe passing some 
flags like (PF_MEMALLOC) to kzalloc will make it ignore those limits?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 6/4] oom: pass null to kfree if zonelist is not cleared
  2007-09-18 21:25                                       ` Christoph Lameter
@ 2007-09-18 22:16                                         ` David Rientjes
  2007-09-19 17:09                                           ` Paul Jackson
  0 siblings, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-18 22:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm

On Tue, 18 Sep 2007, Christoph Lameter wrote:

> > If the kzalloc fails, we're in a system-wide OOM state that isn't 
> > constrained by anything so we allow the OOM killer to be invoked just 
> > like this patchset was never applied.  We make no inference that it has 
> > already been invoked, there is nothing to suggest that it has.  All we 
> > know is that none of the zones in the zonelist from __alloc_pages() are 
> > currently in the OOM killer.
> 
> kzalloc can be restricted by the cpuset / mempolicy context and the 
> GFP_THISNODE flags. It may fail for other reasons. Maybe passing some 
> flags like (PF_MEMALLOC) to kzalloc will make it ignore those limits?
> 

Why would it be constrained by the cpuset policy if there is no 
__GFP_HARDWALL?

We could do

	current->flags |= PF_MEMALLOC;
	kzalloc(oom_zl, GFP_KERNEL);
	current->flags &= ~PF_MEMALLOC;
	if (!oom_zl)
		...

since we already know that the zonelist zones don't match any of those 
currently in the OOM killer and PF_MEMALLOC will allow for future memory 
freeing.  That would try the allocation with no watermarks and seems like 
it would help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 6/4] oom: pass null to kfree if zonelist is not cleared
  2007-09-18 22:16                                         ` David Rientjes
@ 2007-09-19 17:09                                           ` Paul Jackson
  2007-09-19 18:21                                             ` David Rientjes
  0 siblings, 1 reply; 113+ messages in thread
From: Paul Jackson @ 2007-09-19 17:09 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, akpm, andrea, linux-mm

David wrote:
> Why would it be constrained by the cpuset policy if there is no 
> __GFP_HARDWALL?

Er eh ... because it is ;)

With or without GFP_HARDWALL, allocations are constrained by cpuset
policy.

It's just a different policy (the nearest ancestor cpuset marked
mem_exclusive) without GFP_HARDWALL, rather than the current cpuset.

Cpuset constraints are ignored if in_interrupt, GFP_ATOMIC or
the thread flag TIF_MEMDIE is set.  Grep for "GFP_HARDWALL"
and read its comments (mostly in kernel/cpuset.c) and associated
code to see how these flags impact cpuset placement policy.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [patch 6/4] oom: pass null to kfree if zonelist is not cleared
  2007-09-19 17:09                                           ` Paul Jackson
@ 2007-09-19 18:21                                             ` David Rientjes
  0 siblings, 0 replies; 113+ messages in thread
From: David Rientjes @ 2007-09-19 18:21 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, akpm, andrea, linux-mm

On Wed, 19 Sep 2007, Paul Jackson wrote:

> David wrote:
> > Why would it be constrained by the cpuset policy if there is no 
> > __GFP_HARDWALL?
> 
> Er eh ... because it is ;)
> 
> With or without GFP_HARDWALL, allocations are constrained by cpuset
> policy.
> 
> It's just a different policy (the nearest ancestor cpuset marked
> mem_exclusive) without GFP_HARDWALL, rather than the current cpuset.
> 

The question is: why do we care?  I don't understand why it makes so much 
of a difference if the kzalloc fails and we fall back to non-serialized 
behavior, even though the updated patchset sets PF_MEMALLOC in current to 
avoid watermarks in its allocation.

We could set TIF_MEMDIE in current momentarily only for the kzalloc, but I 
think it's unnecessary and possibly troublesome because that task can be 
detected in parallel OOM killings and it suddenly becomes a no-op.  Even 
if we aren't serialized, the parallel OOM-killed task will be marked 
TIF_MEMDIE and we'll detect that and not kill anything because we've 
serialized on callback_mutex.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-08-22 12:48 ` [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
  2007-09-12 12:42   ` Andrew Morton
@ 2007-09-21 19:10   ` David Rientjes
  2008-01-03  1:08     ` Andrea Arcangeli
  1 sibling, 1 reply; 113+ messages in thread
From: David Rientjes @ 2007-09-21 19:10 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On Wed, 22 Aug 2007, Andrea Arcangeli wrote:

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1028,6 +1028,8 @@ static unsigned long shrink_zone(int pri
>  		nr_inactive = 0;
>  
>  	while (nr_active || nr_inactive) {
> +		if (is_VM_OOM())
> +			break;
>  		if (nr_active) {
>  			nr_to_scan = min(nr_active,
>  					(unsigned long)sc->swap_cluster_max);

This will need to use the new OOM zone-locking interface.  shrink_zones() 
accepts struct zone** as one of its formals so while traversing each zone 
this would simply become a test of zone_is_oom_locked(*z).

> @@ -1138,6 +1140,17 @@ unsigned long try_to_free_pages(struct z
>  	}
>  
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> +		if (is_VM_OOM()) {
> +			if (!test_thread_flag(TIF_MEMDIE)) {
> +				/* get out of the way */
> +				schedule_timeout_interruptible(1);
> +				/* don't waste cpu if we're still oom */
> +				if (is_VM_OOM())
> +					goto out;
> +			} else
> +				goto out;
> +		}
> +
>  		sc.nr_scanned = 0;
>  		if (!priority)
>  			disable_swap_token();
> 

Same as above, and it becomes trivial since try_to_free_pages() also 
accepts a struct zone** formal.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01 of 24] remove nr_scan_inactive/active
  2007-09-12 11:44   ` Andrew Morton
@ 2008-01-02 17:50     ` Andrea Arcangeli
  0 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2008-01-02 17:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

Hi Andrew,

On Wed, Sep 12, 2007 at 04:44:50AM -0700, Andrew Morton wrote:
> Does that above text describe something which you've observed and measured
> in practice, or is it theoretical-from-code-inspection?

it's hard to tell why oom handling takes so long while scanning the
lrus, so I tried to cut the useless work in places that could
generated overwork in that area. It's mostly theoretical though.

> The old code took care of the situtaion where zone_page_state(zone,
> NR_ACTIVE) is smaller than (1 << priority): do a bit of reclaim in that
> case anyway.  This is a minor issue, as we'll at least perform some
> scanning when priority is low.  But you should have depeted the now-wrong
> comment.

I see what you mean.

> Your change breaks that logic and there is potential that a small LRU will
> be underscanned, especially when reclaim is not under distress.

When the race triggers it may be underscanned anyway, so it can't
depend on it for correct operation, but most of the time it can help
and removing the code like I did will surely scan less in your
small-lru scenario.

> According to the above-described logic, one would think that it would be
> more accurate to replace the existing
> 
> 	if (nr_active >= sc->swap_cluster_max)
> 		zone->nr_scan_active = 0;
> 
> with
> 
> 	if (nr_active >= sc->swap_cluster_max)
> 		zone->nr_scan_active -= sc->swap_cluster_max;

not sure I follow why, this will underscan if it's the only change,
and it will make the race condition even more dangerous.

> Yet another alternative approach would be to remove the batching
> altogether.  If (zone_page_state(zone, NR_ACTIVE) >> priority) evaluates to
> "3", well, just go in and scan three pages.  That should address any
> accuracy problems and it will address the problem which you're addressing,
> but it will add unknown-but-probably-small computational cost.

It's quite simpler. All I care about is that nr_scan_*active, doesn't
grow to insane levels without any good reason in bigsmp, like it can
happen now.

I thought this racy code didn't deserve to exist but that's not my
priority, my priority is to avoid huge nr_*active values especially
with priorities going down to zero during oom, and that's easy enough
to achieve like this (mostly untested):

# HG changeset patch
# User Andrea Arcangeli <andrea@suse.de>
# Date 1199294746 -3600
# Node ID bc803863094aaef8a03dbec584370fb2b68b17d0
# Parent  e28e1be3fae5183e3e36e32e3feb9a59ec59c825
limit shrink zone scanning

Assume two tasks adds to nr_scan_*active at the same time (first line of the
old buggy code), they'll effectively double their scan rate, for no good
reason. What can happen is that instead of scanning nr_entries each, they'll
scan nr_entries*2 each. The more CPUs the bigger the race and the higher the
multiplication effect and the harder it will be to detect oom. This puts a cap
on the amount of work that it makes sense to do in case the race triggers.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1114,7 +1114,7 @@ static unsigned long shrink_zone(int pri
 	 */
 	zone->nr_scan_active +=
 		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-	nr_active = zone->nr_scan_active;
+	nr_active = min(zone->nr_scan_active, zone_page_state(zone, NR_ACTIVE));
 	if (nr_active >= sc->swap_cluster_max)
 		zone->nr_scan_active = 0;
 	else
@@ -1122,7 +1122,7 @@ static unsigned long shrink_zone(int pri
 
 	zone->nr_scan_inactive +=
 		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-	nr_inactive = zone->nr_scan_inactive;
+	nr_inactive = min(zone->nr_scan_inactive, zone_page_state(zone, NR_INACTIVE));
 	if (nr_inactive >= sc->swap_cluster_max)
 		zone->nr_scan_inactive = 0;
 	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03 of 24] prevent oom deadlocks during read/write operations
  2007-09-12 11:56   ` Andrew Morton
  2007-09-12  2:18     ` Nick Piggin
@ 2008-01-03  0:53     ` Andrea Arcangeli
  1 sibling, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  0:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes, Nick Piggin

On Wed, Sep 12, 2007 at 04:56:59AM -0700, Andrew Morton wrote:
> The patch adds sixty-odd bytes of text to some of the most-used code in the
> kernel.  Based on the above problem description I'm doubting that this is
> justified.  Please tell us more?

It's quite simple, malloc(1g) from 100 tasks and then read(1G) from
nfs on the same 100 tasks at the same time and they all go oom at the
same time. Without the sigkill check the oom killer is not very useful
and thing simply hangs for long.

> diff -puN mm/filemap.c~oom-handling-prevent-oom-deadlocks-during-read-write-operations mm/filemap.c
> --- a/mm/filemap.c~oom-handling-prevent-oom-deadlocks-during-read-write-operations
> +++ a/mm/filemap.c
> @@ -916,6 +916,15 @@ page_ok:
>  			goto out;
>  		}
>  
> +		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
> +			/*
> +			 * Must not hang almost forever in D state in presence
> +			 * of sigkill and lots of ram/swap (think during OOM).
> +			 */
> +			page_cache_release(page);
> +			goto out;
> +		}
> +
>  		/* nr is the maximum number of bytes to copy from this page */
>  		nr = PAGE_CACHE_SIZE;
>  		if (index == end_index) {
> @@ -2050,6 +2059,15 @@ static ssize_t generic_perform_write_2co
>  			break;
>  		}
>  
> +		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
> +			/*
> +			 * Must not hang almost forever in D state in presence
> +			 * of sigkill and lots of ram/swap (think during OOM).
> +			 */
> +			status = -ENOMEM;
> +			break;
> +		}
> +
>  		page = __grab_cache_page(mapping, index);
>  		if (!page) {
>  			status = -ENOMEM;
> @@ -2220,6 +2238,15 @@ again:
>  			break;
>  		}
>  
> +		if (unlikely(sigismember(&current->pending.signal, SIGKILL))) {
> +			/*
> +			 * Must not hang almost forever in D state in presence
> +			 * of sigkill and lots of ram/swap (think during OOM).
> +			 */
> +			status = -ENOMEM;
> +			break;
> +		}
> +
>  		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
>  						&page, &fsdata);
>  		if (unlikely(status))

Was there another approach for this? I merged your version anyway in
the meantime.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04 of 24] serialize oom killer
  2007-09-12 12:02   ` Andrew Morton
  2007-09-12 12:04     ` Andrew Morton
@ 2008-01-03  0:55     ` Andrea Arcangeli
  1 sibling, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  0:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

On Wed, Sep 12, 2007 at 05:02:05AM -0700, Andrew Morton wrote:
> Please use mutexes, not semaphores.  I'll make this change.
> 
> I think this patch needs more explanation/justification.

It's probably obsolete to discuss this as the zone-oom-lock mostly
obsoletes this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08 of 24] don't depend on PF_EXITING tasks to go away
  2007-09-12 12:20   ` Andrew Morton
@ 2008-01-03  0:56     ` Andrea Arcangeli
  0 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  0:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

On Wed, Sep 12, 2007 at 05:20:32AM -0700, Andrew Morton wrote:
> On Wed, 22 Aug 2007 14:48:55 +0200 Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > # HG changeset patch
> > # User Andrea Arcangeli <andrea@suse.de>
> > # Date 1187778125 -7200
> > # Node ID ffdc30241856d7155ceedd4132eef684f7cc7059
> > # Parent  b66d8470c04ed836787f69c7578d5fea4f18c322
> > don't depend on PF_EXITING tasks to go away
> > 
> > A PF_EXITING task don't have TIF_MEMDIE set so it might get stuck in
> > memory allocations without access to the PF_MEMALLOC pool (said that
> > ideally do_exit would better not require memory allocations, especially
> > not before calling exit_mm). The same way we raise its privilege to
> > TIF_MEMDIE if it's the current task, we should do it even if it's not
> > the current task to speedup oom killing.
> > 
> > Signed-off-by: Andrea Arcangeli <andrea@suse.de>
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -234,27 +234,13 @@ static struct task_struct *select_bad_pr
> >  		 * Note: this may have a chance of deadlock if it gets
> >  		 * blocked waiting for another task which itself is waiting
> >  		 * for memory. Is there a better alternative?
> > +		 *
> > +		 * Better not to skip PF_EXITING tasks, since they
> > +		 * don't have access to the PF_MEMALLOC pool until
> > +		 * we select them here first.
> >  		 */
> >  		if (test_tsk_thread_flag(p, TIF_MEMDIE))
> >  			return ERR_PTR(-1UL);
> > -
> > -		/*
> > -		 * This is in the process of releasing memory so wait for it
> > -		 * to finish before killing some other task by mistake.
> > -		 *
> > -		 * However, if p is the current task, we allow the 'kill' to
> > -		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
> > -		 * which will allow it to gain access to memory reserves in
> > -		 * the process of exiting and releasing its resources.
> > -		 * Otherwise we could get an easy OOM deadlock.
> > -		 */
> > -		if (p->flags & PF_EXITING) {
> > -			if (p != current)
> > -				return ERR_PTR(-1UL);
> > -
> > -			chosen = p;
> > -			*ppoints = ULONG_MAX;
> > -		}
> >  
> >  		if (p->oomkilladj == OOM_DISABLE)
> >  			continue;
> > 
> 
> hm, I'll believe you.
> 
> Does this address any problem which was actually observed in real life?

By memory yes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 06 of 24] reduce the probability of an OOM livelock
  2007-09-12 12:17   ` Andrew Morton
@ 2008-01-03  1:03     ` Andrea Arcangeli
  0 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  1:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

On Wed, Sep 12, 2007 at 05:17:30AM -0700, Andrew Morton wrote:
> I don't get it.  This code changes try_to_free_pages() so that it will only
> bale out when a single scan of the zone at a particular priority reclaimed
> more than swap_cluster_max pages.  Previously we'd include the results of all the
> lower-priority scanning in that comparison too.
> 
> So this patch will make try_to_free_pages() do _more_ scanning than it used
> to, in some situations.  Which seems opposite to what you're trying to do
> here.

It will do more scanning because it will think to be oom sooner!  And
OOM-sooner = less scanning. My objective is to go oom sooner. Not
after zillon of lru passes. Only 1 pass at priority 0 failing is now
enough to declare oom. Previously all previous passes had to fail too.

> A similar situation exists with this change.

Yes.

> Your changelog made no mention of the change to balance_pgdat() and I'm
> struggling a bit to see what it's doing in there.

I thought it better work the same for both.

> In both places, the definition of local variable nr_reclaimed can be moved
> into a more inner scope.  This makes the code easier to follow.  Please
> watch out for cleanup opportunities like that.

Cleaned up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't go away
  2007-09-12 12:30   ` Andrew Morton
  2007-09-12 12:34     ` Andrew Morton
@ 2008-01-03  1:06     ` Andrea Arcangeli
  1 sibling, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  1:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

On Wed, Sep 12, 2007 at 05:30:22AM -0700, Andrew Morton wrote:
> On Wed, 22 Aug 2007 14:48:56 +0200 Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > # HG changeset patch
> > # User Andrea Arcangeli <andrea@suse.de>
> > # Date 1187778125 -7200
> > # Node ID 9bf6a66eab3c52327daa831ef101d7802bc71791
> > # Parent  ffdc30241856d7155ceedd4132eef684f7cc7059
> > fallback killing more tasks if tif-memdie doesn't go away
> > 
> > Waiting indefinitely for a TIF_MEMDIE task to go away will deadlock. Two
> > tasks reading from the same inode at the same time and both going out of
> > memory inside a read(largebuffer) syscall, will even deadlock through
> > contention over the PG_locked bitflag. The task holding the semaphore
> > detects oom but the oom killer decides to kill the task blocked in
> > wait_on_page_locked(). The task holding the semaphore will hang inside
> > alloc_pages that will never return because it will wait the TIF_MEMDIE
> > task to go away, but the TIF_MEMDIE task can't go away until the task
> > holding the semaphore is killed in the first place.
> 
> hrm, OK, that's not nice
> 
> > It's quite unpractical to teach the oom killer the locking dependencies
> > across running tasks, so the feasible fix is to develop a logic that
> > after waiting a long time for a TIF_MEMDIE tasks goes away, fallbacks
> > on killing one more task. This also eliminates the possibility of
> > suprious oom killage (i.e. two tasks killed despite only one had to be
> > killed). It's not a math guarantee because we can't demonstrate that if
> > a TIF_MEMDIE SIGKILLED task didn't mange to complete do_exit within
> > 10sec, it never will. But the current probability of suprious oom
> > killing is sure much higher than the probability of suprious oom killing
> > with this patch applied.
> > 
> > The whole locking is around the tasklist_lock. On one side do_exit reads
> > TIF_MEMDIE and clears VM_is_OOM under the lock, on the other side the
> > oom killer accesses VM_is_OOM and TIF_MEMDIE under the lock. This is a
> > read_lock in the oom killer but it's actually a write lock thanks to the
> > OOM_lock semaphore running one oom killer at once (the locking rule is,
> > either use write_lock_irq or read_lock+OOM_lock).
> > 
> 
> 
> > 
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -849,6 +849,15 @@ static void exit_notify(struct task_stru
> >  	if (tsk->exit_signal == -1 && likely(!tsk->ptrace))
> >  		state = EXIT_DEAD;
> >  	tsk->exit_state = state;
> > +
> > +	/*
> > +	 * Read TIF_MEMDIE and set VM_is_OOM to 0 atomically inside
> > +	 * the tasklist_lock_lock.
> > +	 */
> > +	if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) {
> > +		extern unsigned long VM_is_OOM;
> > +		clear_bit(0, &VM_is_OOM);
> > +	}
> 
> Please, no externs-in-C, ever.

You mean in .c ;).

Anyway I dropped VM_is_OOM for now so the critical fixes will be
easier to merge. There are downsides in that, suprious oom killing
isn't impossibile anymore, but at least the other fixes have much more
priority.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit
  2007-09-21 19:10   ` David Rientjes
@ 2008-01-03  1:08     ` Andrea Arcangeli
  0 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  1:08 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, Andrew Morton

On Fri, Sep 21, 2007 at 12:10:23PM -0700, David Rientjes wrote:
> On Wed, 22 Aug 2007, Andrea Arcangeli wrote:
> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1028,6 +1028,8 @@ static unsigned long shrink_zone(int pri
> >  		nr_inactive = 0;
> >  
> >  	while (nr_active || nr_inactive) {
> > +		if (is_VM_OOM())
> > +			break;
> >  		if (nr_active) {
> >  			nr_to_scan = min(nr_active,
> >  					(unsigned long)sc->swap_cluster_max);
> 
> This will need to use the new OOM zone-locking interface.  shrink_zones() 
> accepts struct zone** as one of its formals so while traversing each zone 
> this would simply become a test of zone_is_oom_locked(*z).

yes I changed this with zone_is_oom_locked. same logic as before, to
spend the time in schedule_timeout while the system tries to solve the
oom condition instead of trashing the whole cpu caches over the lru.

> 
> > @@ -1138,6 +1140,17 @@ unsigned long try_to_free_pages(struct z
> >  	}
> >  
> >  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > +		if (is_VM_OOM()) {
> > +			if (!test_thread_flag(TIF_MEMDIE)) {
> > +				/* get out of the way */
> > +				schedule_timeout_interruptible(1);
> > +				/* don't waste cpu if we're still oom */
> > +				if (is_VM_OOM())
> > +					goto out;
> > +			} else
> > +				goto out;
> > +		}
> > +
> >  		sc.nr_scanned = 0;
> >  		if (!priority)
> >  			disable_swap_token();
> > 
> 
> Same as above, and it becomes trivial since try_to_free_pages() also 
> accepts a struct zone** formal.

yes, converted this too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15 of 24] limit reclaim if enough pages have been freed
  2007-09-12 12:57   ` Andrew Morton
@ 2008-01-03  1:12     ` Andrea Arcangeli
  0 siblings, 0 replies; 113+ messages in thread
From: Andrea Arcangeli @ 2008-01-03  1:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, David Rientjes

On Wed, Sep 12, 2007 at 05:57:23AM -0700, Andrew Morton wrote:
> whoa, that's a huge change to the scanning logic.  Suppose we've decided to
> scan 1,000,000 active pages and 42 inactive pages.  With this change we'll
> bale out after scanning the 42 inactive pages.  The change to the
> inactive/active balancing logic is potentially large.

Could be, but I don't think it's good to do such an overwork on large
ram systems when freeing swap-cluster-max pages is enough to guarantee
we're not getting spurious oom. It's a latency issue only here (not RT
at all, but still a latency issue). Anyway feel free to keep this
out. It's mostly independent from the rest.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2008-01-03  1:12 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-22 12:48 [PATCH 00 of 24] OOM related fixes Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 01 of 24] remove nr_scan_inactive/active Andrea Arcangeli
2007-09-12 11:44   ` Andrew Morton
2008-01-02 17:50     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 02 of 24] avoid oom deadlock in nfs_create_request Andrea Arcangeli
2007-09-12 23:54   ` Christoph Lameter
2007-08-22 12:48 ` [PATCH 03 of 24] prevent oom deadlocks during read/write operations Andrea Arcangeli
2007-09-12 11:56   ` Andrew Morton
2007-09-12  2:18     ` Nick Piggin
2008-01-03  0:53     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 04 of 24] serialize oom killer Andrea Arcangeli
2007-09-12 12:02   ` Andrew Morton
2007-09-12 12:04     ` Andrew Morton
2007-09-12 12:11       ` Andrea Arcangeli
2008-01-03  0:55     ` Andrea Arcangeli
2007-09-13  0:09   ` Christoph Lameter
2007-09-13 18:32     ` David Rientjes
2007-09-13 18:37       ` Christoph Lameter
2007-09-13 18:46         ` David Rientjes
2007-09-13 18:53           ` Christoph Lameter
2007-09-14  0:36             ` David Rientjes
2007-09-14  2:31               ` Christoph Lameter
2007-09-14  3:33                 ` David Rientjes
2007-09-18 16:44                   ` David Rientjes
2007-09-18 16:44                     ` [patch 1/4] oom: move prototypes to appropriate header file David Rientjes
2007-09-18 16:44                       ` [patch 2/4] oom: move constraints to enum David Rientjes
2007-09-18 16:44                         ` [patch 3/4] oom: save zonelist pointer for oom killer calls David Rientjes
2007-09-18 16:44                           ` [patch 4/4] oom: serialize out of memory calls David Rientjes
2007-09-18 19:54                             ` Christoph Lameter
2007-09-18 19:56                               ` David Rientjes
2007-09-18 20:01                                 ` Christoph Lameter
2007-09-18 20:06                                   ` David Rientjes
2007-09-18 20:23                                     ` [patch 5/4] oom: rename serialization helper functions David Rientjes
2007-09-18 20:26                                       ` Christoph Lameter
2007-09-18 20:39                                         ` [patch 5/4 v2] " David Rientjes
2007-09-18 20:59                                           ` Christoph Lameter
2007-09-18 19:57                           ` [patch 3/4] oom: save zonelist pointer for oom killer calls Christoph Lameter
2007-09-18 20:13                             ` David Rientjes
2007-09-18 20:16                               ` Christoph Lameter
2007-09-18 20:47                                 ` [patch 6/4] oom: pass null to kfree if zonelist is not cleared David Rientjes
2007-09-18 21:01                                   ` Christoph Lameter
2007-09-18 21:13                                     ` David Rientjes
2007-09-18 21:25                                       ` Christoph Lameter
2007-09-18 22:16                                         ` David Rientjes
2007-09-19 17:09                                           ` Paul Jackson
2007-09-19 18:21                                             ` David Rientjes
2007-09-18 19:55                         ` [patch 2/4] oom: move constraints to enum Christoph Lameter
2007-08-22 12:48 ` [PATCH 05 of 24] avoid selecting already killed tasks Andrea Arcangeli
2007-09-13  0:13   ` Christoph Lameter
2007-08-22 12:48 ` [PATCH 06 of 24] reduce the probability of an OOM livelock Andrea Arcangeli
2007-09-12 12:17   ` Andrew Morton
2008-01-03  1:03     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 07 of 24] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
2007-09-12 12:18   ` Andrew Morton
2007-09-13  0:26     ` Christoph Lameter
2007-08-22 12:48 ` [PATCH 08 of 24] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
2007-09-12 12:20   ` Andrew Morton
2008-01-03  0:56     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 09 of 24] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
2007-09-12 12:30   ` Andrew Morton
2007-09-12 12:34     ` Andrew Morton
2008-01-03  1:06     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 10 of 24] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
2007-09-12 12:42   ` Andrew Morton
2007-09-13  0:36     ` Christoph Lameter
2007-09-21 19:10   ` David Rientjes
2008-01-03  1:08     ` Andrea Arcangeli
2007-08-22 12:48 ` [PATCH 11 of 24] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
2007-09-12 12:44   ` Andrew Morton
2007-08-22 12:48 ` [PATCH 12 of 24] show mem information only when a task is actually being killed Andrea Arcangeli
2007-09-12 12:49   ` Andrew Morton
2007-08-22 12:49 ` [PATCH 13 of 24] simplify oom heuristics Andrea Arcangeli
2007-09-12 12:52   ` Andrew Morton
2007-09-12 13:40     ` Andrea Arcangeli
2007-09-12 20:52       ` Andrew Morton
2007-08-22 12:49 ` [PATCH 14 of 24] oom select should only take rss into account Andrea Arcangeli
2007-09-13  0:43   ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 15 of 24] limit reclaim if enough pages have been freed Andrea Arcangeli
2007-09-12 12:57   ` Andrew Morton
2008-01-03  1:12     ` Andrea Arcangeli
2007-09-12 12:58   ` Andrew Morton
2007-09-12 13:38     ` Andrea Arcangeli
2007-08-22 12:49 ` [PATCH 16 of 24] avoid some lock operation in vm fast path Andrea Arcangeli
2007-09-12 12:59   ` Andrew Morton
2007-09-13  0:49     ` Christoph Lameter
2007-09-13  1:16       ` Andrew Morton
2007-09-13  1:33         ` Christoph Lameter
2007-09-13  1:41           ` KAMEZAWA Hiroyuki
2007-09-13  1:44           ` Andrew Morton
2007-08-22 12:49 ` [PATCH 17 of 24] apply the anti deadlock features only to global oom Andrea Arcangeli
2007-09-12 13:02   ` Andrew Morton
2007-09-13  0:53     ` Christoph Lameter
2007-09-13  0:52   ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 18 of 24] run panic the same way in both places Andrea Arcangeli
2007-09-13  0:54   ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 19 of 24] cacheline align VM_is_OOM to prevent false sharing Andrea Arcangeli
2007-09-12 13:02   ` Andrew Morton
2007-09-12 13:36     ` Andrea Arcangeli
2007-09-13  0:55       ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 20 of 24] extract deadlock helper function Andrea Arcangeli
2007-08-22 12:49 ` [PATCH 21 of 24] select process to kill for cpusets Andrea Arcangeli
2007-09-12 13:05   ` Andrew Morton
2007-09-13  0:59     ` Christoph Lameter
2007-09-13  5:13       ` David Rientjes
2007-09-13 17:55         ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 22 of 24] extract select helper function Andrea Arcangeli
2007-08-22 12:49 ` [PATCH 23 of 24] serialize for cpusets Andrea Arcangeli
2007-09-12 13:10   ` Andrew Morton
2007-09-12 13:34     ` Andrea Arcangeli
2007-09-12 19:08     ` David Rientjes
2007-09-13  1:02     ` Christoph Lameter
2007-08-22 12:49 ` [PATCH 24 of 24] add oom_kill_asking_task flag Andrea Arcangeli
2007-09-12 13:11   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).