[RFC 0/9] Reclaim during GFP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/9] Reclaim during GFP_ATOMIC allocs
@ 2007-08-14 15:30 Christoph Lameter
  2007-08-14 15:30 ` [RFC 1/9] Allow reclaim via __GFP_NOMEMALLOC reclaim Christoph Lameter
                   ` (9 more replies)
  0 siblings, 10 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

This is the extended version of the reclaim patchset. It enables reclaim from
clean file backed pages during GFP_ATOMIC allocs. A bit invasive since
may locks must now be taken with saving flags. But it works.

Tested by repeatedly allocating 12MB of memory from the timer interrupt.

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 1/9] Allow reclaim via __GFP_NOMEMALLOC reclaim
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
@ 2007-08-14 15:30 ` Christoph Lameter
  2007-08-14 15:30 ` [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set Christoph Lameter
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

[-- Attachment #1: vmscan_nomemalloc --]
[-- Type: text/plain, Size: 3168 bytes --]

Make try_to_free_pages() not perform any allocations if __GFP_NOMEMALLOC
is set.

We can avoid allocations by not writing pages out or swapping. So on entry
to try_to_free_pages() we check for __GFP_NOMEMALLOC. If it is set
then sc.may_writepage and sc.mayswap are switched off and we short
circuit the writeout handling.

The throttling of VM writeout is also bypassed since there is no
writeout occurring.

It is likely difficult to make sure that the slab shrinkers do
not perform any allocations. So we simply do not shrink slabs.

The type of pages that can be reclaimed by a call to try_to_free_pages()
with the __GFP_NOMEMALLOC parameter is:

- Unmapped clean page cache pages.
- Mapped clean pages

Signed-off-by: Christoph Lameter <clameter@sgi.com>


---
 mm/vmscan.c |   25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-08-13 23:43:45.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-08-13 23:51:05.000000000 -0700
@@ -161,6 +161,13 @@ unsigned long shrink_slab(unsigned long 
 	if (scanned == 0)
 		scanned = SWAP_CLUSTER_MAX;
 
+	/*
+	 * Not sure if we can keep this clean of allocs.
+	 * Better leave it off for now
+	 */
+	if (gfp_mask & __GFP_NOMEMALLOC)
+		return 1;
+
 	if (!down_read_trylock(&shrinker_rwsem))
 		return 1;	/* Assume we'll be able to shrink next time */
 
@@ -1053,7 +1060,8 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
-	throttle_vm_writeout(sc->gfp_mask);
+	if (!(sc->gfp_mask & __GFP_NOMEMALLOC))
+		throttle_vm_writeout(sc->gfp_mask);
 
 	atomic_dec(&zone->reclaim_in_progress);
 	return nr_reclaimed;
@@ -1115,6 +1123,9 @@ static unsigned long shrink_zones(int pr
  * hope that some of these pages can be written.  But if the allocating task
  * holds filesystem locks which prevent writeout this might not work, and the
  * allocation attempt will fail.
+ *
+ * The __GFP_NOMEMALLOC flag has a special role. If it is set then no memory
+ * allocations or writeout will occur.
  */
 unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
 {
@@ -1127,15 +1138,17 @@ unsigned long try_to_free_pages(struct z
 	int i;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
-		.may_writepage = !laptop_mode,
 		.swap_cluster_max = SWAP_CLUSTER_MAX,
-		.may_swap = 1,
 		.swappiness = vm_swappiness,
 		.order = order,
 	};
 
 	count_vm_event(ALLOCSTALL);
 
+	if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+		sc.may_writepage = !laptop_mode;
+		sc.may_swap = 1;
+	}
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];
 
@@ -1162,6 +1175,9 @@ unsigned long try_to_free_pages(struct z
 			goto out;
 		}
 
+		if (!(gfp_mask & __GFP_NOMEMALLOC))
+			continue;
+
 		/*
 		 * Try to write back as many pages as we just scanned.  This
 		 * tends to cause slow streaming writers to write data to the

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
  2007-08-14 15:30 ` [RFC 1/9] Allow reclaim via __GFP_NOMEMALLOC reclaim Christoph Lameter
@ 2007-08-14 15:30 ` Christoph Lameter
  2007-08-18  7:10   ` Pavel Machek
  2007-08-14 15:30 ` [RFC 3/9] Make cond_rescheds conditional on __GFP_WAIT Christoph Lameter
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

[-- Attachment #1: reclaim_nomemalloc --]
[-- Type: text/plain, Size: 1858 bytes --]

If we exhaust the reserves in the page allocator when PF_MEMALLOC is set
then no longer give up but call into reclaim with PF_MEMALLOC set.

This is in essence a recursive call back into page reclaim with another
page flag (__GFP_NOMEMALLOC) set. The recursion is bounded since potential
allocations with __PF_NOMEMALLOC set will not enter that branch again.

This means that allocation under PF_MEMALLOC will no longer run out of
memory. Allocations under PF_MEMALLOC will do a limited form of reclaim
instead.

The reclaim is of particular important to stacked filesystems that may
do a lot of allocations in the write path. Reclaim will be working
as long as there are clean file backed pages to reclaim.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/page_alloc.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-08-13 23:50:01.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-08-13 23:58:43.000000000 -0700
@@ -1306,6 +1306,17 @@ nofail_alloc:
 				zonelist, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
+			/*
+			 * If we are already in reclaim then the environment
+			 * is already setup. We can simply call
+			 * try_to_get_free_pages(). Just make sure that
+			 * we do not allocate anything.
+			 */
+			if (p->flags & PF_MEMALLOC && wait &&
+				try_to_free_pages(zonelist->zones, order,
+						gfp_mask | __GFP_NOMEMALLOC))
+				goto restart;
+
 			if (gfp_mask & __GFP_NOFAIL) {
 				congestion_wait(WRITE, HZ/50);
 				goto nofail_alloc;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 3/9] Make cond_rescheds conditional on __GFP_WAIT
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
  2007-08-14 15:30 ` [RFC 1/9] Allow reclaim via __GFP_NOMEMALLOC reclaim Christoph Lameter
  2007-08-14 15:30 ` [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set Christoph Lameter
@ 2007-08-14 15:30 ` Christoph Lameter
  2007-08-14 15:30 ` [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c Christoph Lameter
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

[-- Attachment #1: vmscan_reclaim_resched --]
[-- Type: text/plain, Size: 1820 bytes --]

We cannot reschedule if we are in atomic reclaim. So make
the calls to cond_resched depending on the __GFP_WAIT flag.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/vmscan.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-08-14 07:34:18.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-08-14 07:34:25.000000000 -0700
@@ -258,6 +258,15 @@ static int may_write_to_queue(struct bac
 }
 
 /*
+ * Reschedule if we are not in atomic mode
+ */
+static void reclaim_resched(struct scan_control *sc)
+{
+	if (sc->gfp_mask & __GFP_WAIT)
+		cond_resched();
+}
+
+/*
  * We detected a synchronous write error writing a page out.  Probably
  * -ENOSPC.  We need to propagate that into the address_space for a subsequent
  * fsync(), msync() or close().
@@ -437,7 +446,7 @@ static unsigned long shrink_page_list(st
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
-	cond_resched();
+	reclaim_resched(sc);
 
 	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
@@ -446,7 +455,7 @@ static unsigned long shrink_page_list(st
 		int may_enter_fs;
 		int referenced;
 
-		cond_resched();
+		reclaim_resched(sc);
 
 		page = lru_to_page(page_list);
 		list_del(&page->lru);
@@ -938,7 +947,7 @@ force_reclaim_mapped:
 	spin_unlock_irq(&zone->lru_lock);
 
 	while (!list_empty(&l_hold)) {
-		cond_resched();
+		reclaim_resched(sc);
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
 		if (page_mapped(page)) {

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-08-14 15:30 ` [RFC 3/9] Make cond_rescheds conditional on __GFP_WAIT Christoph Lameter
@ 2007-08-14 15:30 ` Christoph Lameter
  2007-08-14 20:02   ` Andi Kleen
  2007-08-14 15:30 ` [RFC 5/9] Save irqflags on taking the mapping lock Christoph Lameter
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

[-- Attachment #1: vmscan_irqsave --]
[-- Type: text/plain, Size: 4697 bytes --]

Reclaim can be called with interrupts disabled in atomic reclaim.
vmscan.c is currently using spinlock_irq(). Switch to spin_lock_irqsave().

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/vmscan.c |   36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-08-14 07:34:25.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-08-14 07:34:55.000000000 -0700
@@ -775,11 +775,12 @@ static unsigned long shrink_inactive_lis
 	struct pagevec pvec;
 	unsigned long nr_scanned = 0;
 	unsigned long nr_reclaimed = 0;
+	unsigned long flags;
 
 	pagevec_init(&pvec, 1);
 
 	lru_add_drain();
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irqsave(&zone->lru_lock, flags);
 	do {
 		struct page *page;
 		unsigned long nr_taken;
@@ -798,12 +799,12 @@ static unsigned long shrink_inactive_lis
 		__mod_zone_page_state(zone, NR_INACTIVE,
 						-(nr_taken - nr_active));
 		zone->pages_scanned += nr_scan;
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 		nr_scanned += nr_scan;
 		nr_freed = shrink_page_list(&page_list, sc);
 		nr_reclaimed += nr_freed;
-		local_irq_disable();
+		local_irq_save(flags);
 		if (current_is_kswapd()) {
 			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
 			__count_vm_events(KSWAPD_STEAL, nr_freed);
@@ -828,15 +829,15 @@ static unsigned long shrink_inactive_lis
 			else
 				add_page_to_inactive_list(zone, page);
 			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
 				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
+				spin_lock_irqsave(&zone->lru_lock, flags);
 			}
 		}
   	} while (nr_scanned < max_scan);
-	spin_unlock(&zone->lru_lock);
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
 done:
-	local_irq_enable();
+	local_irq_restore(flags);
 	pagevec_release(&pvec);
 	return nr_reclaimed;
 }
@@ -890,6 +891,7 @@ static void shrink_active_list(unsigned 
 	struct page *page;
 	struct pagevec pvec;
 	int reclaim_mapped = 0;
+	unsigned long flags;
 
 	if (sc->may_swap) {
 		long mapped_ratio;
@@ -939,12 +941,12 @@ force_reclaim_mapped:
 	}
 
 	lru_add_drain();
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irqsave(&zone->lru_lock, flags);
 	pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
 			    &l_hold, &pgscanned, sc->order, ISOLATE_ACTIVE);
 	zone->pages_scanned += pgscanned;
 	__mod_zone_page_state(zone, NR_ACTIVE, -pgmoved);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 	while (!list_empty(&l_hold)) {
 		reclaim_resched(sc);
@@ -963,7 +965,7 @@ force_reclaim_mapped:
 
 	pagevec_init(&pvec, 1);
 	pgmoved = 0;
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irqsave(&zone->lru_lock, flags);
 	while (!list_empty(&l_inactive)) {
 		page = lru_to_page(&l_inactive);
 		prefetchw_prev_lru_page(page, &l_inactive, flags);
@@ -976,21 +978,21 @@ force_reclaim_mapped:
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
-			spin_unlock_irq(&zone->lru_lock);
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			pgdeactivate += pgmoved;
 			pgmoved = 0;
 			if (buffer_heads_over_limit)
 				pagevec_strip(&pvec);
 			__pagevec_release(&pvec);
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irqsave(&zone->lru_lock, flags);
 		}
 	}
 	__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
 	pgdeactivate += pgmoved;
 	if (buffer_heads_over_limit) {
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 		pagevec_strip(&pvec);
-		spin_lock_irq(&zone->lru_lock);
+		spin_lock_irqsave(&zone->lru_lock, flags);
 	}
 
 	pgmoved = 0;
@@ -1005,16 +1007,16 @@ force_reclaim_mapped:
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;
-			spin_unlock_irq(&zone->lru_lock);
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			__pagevec_release(&pvec);
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irqsave(&zone->lru_lock, flags);
 		}
 	}
 	__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 	pagevec_release(&pvec);
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 5/9] Save irqflags on taking the mapping lock
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-08-14 15:30 ` [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c Christoph Lameter
@ 2007-08-14 15:30 ` Christoph Lameter
  2007-08-14 15:30 ` [RFC 6/9] Disable irqs on taking the private_lock Christoph Lameter
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

[-- Attachment #1: vmscan_mapping_irqsave --]
[-- Type: text/plain, Size: 1576 bytes --]

---
 mm/vmscan.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-08-14 07:34:55.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-08-14 08:21:58.000000000 -0700
@@ -381,10 +381,12 @@ static pageout_t pageout(struct page *pa
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
+	unsigned long flags;
+
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	write_lock_irq(&mapping->tree_lock);
+	write_lock_irqsave(&mapping->tree_lock, flags);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -419,19 +421,19 @@ int remove_mapping(struct address_space 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
-		write_unlock_irq(&mapping->tree_lock);
+		write_unlock_irqrestore(&mapping->tree_lock, flags);
 		swap_free(swap);
 		__put_page(page);	/* The pagecache ref */
 		return 1;
 	}
 
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	__put_page(page);
 	return 1;
 
 cannot_free:
-	write_unlock_irq(&mapping->tree_lock);
+	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 6/9] Disable irqs on taking the private_lock
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-08-14 15:30 ` [RFC 5/9] Save irqflags on taking the mapping lock Christoph Lameter
@ 2007-08-14 15:30 ` Christoph Lameter
  2007-08-14 15:30 ` [RFC 7/9] Save flags in swap.c Christoph Lameter
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

[-- Attachment #1: vmscan_private_lock_irqsave --]
[-- Type: text/plain, Size: 7795 bytes --]

It is necessary to take the private_lock in the reclaim path to
be able to unmap pages. For atomic reclaim we must consistently
disable interrupts.

There is still a FIXME in sync_mapping_buffers(). The private_lock
is passed there as a parameter and the function does not do
the requres disabling of irqs and saving of flags.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/buffer.c |   50 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-08-13 22:43:03.000000000 -0700
+++ linux-2.6/fs/buffer.c	2007-08-14 07:39:22.000000000 -0700
@@ -256,13 +256,14 @@ __find_get_block_slow(struct block_devic
 	struct buffer_head *head;
 	struct page *page;
 	int all_mapped = 1;
+	unsigned long flags;
 
 	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
 	page = find_get_page(bd_mapping, index);
 	if (!page)
 		goto out;
 
-	spin_lock(&bd_mapping->private_lock);
+	spin_lock_irqsave(&bd_mapping->private_lock, flags);
 	if (!page_has_buffers(page))
 		goto out_unlock;
 	head = page_buffers(page);
@@ -293,7 +294,7 @@ __find_get_block_slow(struct block_devic
 		printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
 	}
 out_unlock:
-	spin_unlock(&bd_mapping->private_lock);
+	spin_unlock_irqrestore(&bd_mapping->private_lock, flags);
 	page_cache_release(page);
 out:
 	return ret;
@@ -632,6 +633,11 @@ int sync_mapping_buffers(struct address_
 	if (buffer_mapping == NULL || list_empty(&mapping->private_list))
 		return 0;
 
+	/*
+	 * FIXME: Ugly situation for ATOMIC reclaim. The private_lock
+	 * requires spin_lock_irqsave but we only do a spin_lock in
+	 * fsync_buffers_list!
+	 */
 	return fsync_buffers_list(&buffer_mapping->private_lock,
 					&mapping->private_list);
 }
@@ -658,6 +664,7 @@ void mark_buffer_dirty_inode(struct buff
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct address_space *buffer_mapping = bh->b_page->mapping;
+	unsigned long flags;
 
 	mark_buffer_dirty(bh);
 	if (!mapping->assoc_mapping) {
@@ -666,11 +673,11 @@ void mark_buffer_dirty_inode(struct buff
 		BUG_ON(mapping->assoc_mapping != buffer_mapping);
 	}
 	if (list_empty(&bh->b_assoc_buffers)) {
-		spin_lock(&buffer_mapping->private_lock);
+		spin_lock_irqsave(&buffer_mapping->private_lock, flags);
 		list_move_tail(&bh->b_assoc_buffers,
 				&mapping->private_list);
 		bh->b_assoc_map = mapping;
-		spin_unlock(&buffer_mapping->private_lock);
+		spin_unlock_irqrestore(&buffer_mapping->private_lock, flags);
 	}
 }
 EXPORT_SYMBOL(mark_buffer_dirty_inode);
@@ -736,11 +743,12 @@ static int __set_page_dirty(struct page 
 int __set_page_dirty_buffers(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	unsigned long flags;
 
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	spin_lock(&mapping->private_lock);
+	spin_lock_irqsave(&mapping->private_lock, flags);
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
 		struct buffer_head *bh = head;
@@ -750,7 +758,7 @@ int __set_page_dirty_buffers(struct page
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	spin_unlock(&mapping->private_lock);
+	spin_unlock_irqrestore(&mapping->private_lock, flags);
 
 	return __set_page_dirty(page, mapping, 1);
 }
@@ -840,11 +848,12 @@ void invalidate_inode_buffers(struct ino
 		struct address_space *mapping = &inode->i_data;
 		struct list_head *list = &mapping->private_list;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
+		unsigned long flags;
 
-		spin_lock(&buffer_mapping->private_lock);
+		spin_lock_irqsave(&buffer_mapping->private_lock, flags);
 		while (!list_empty(list))
 			__remove_assoc_queue(BH_ENTRY(list->next));
-		spin_unlock(&buffer_mapping->private_lock);
+		spin_unlock_irqrestore(&buffer_mapping->private_lock, flags);
 	}
 }
 
@@ -862,8 +871,9 @@ int remove_inode_buffers(struct inode *i
 		struct address_space *mapping = &inode->i_data;
 		struct list_head *list = &mapping->private_list;
 		struct address_space *buffer_mapping = mapping->assoc_mapping;
+		unsigned long flags;
 
-		spin_lock(&buffer_mapping->private_lock);
+		spin_lock_irqsave(&buffer_mapping->private_lock, flags);
 		while (!list_empty(list)) {
 			struct buffer_head *bh = BH_ENTRY(list->next);
 			if (buffer_dirty(bh)) {
@@ -872,7 +882,7 @@ int remove_inode_buffers(struct inode *i
 			}
 			__remove_assoc_queue(bh);
 		}
-		spin_unlock(&buffer_mapping->private_lock);
+		spin_unlock_irqrestore(&buffer_mapping->private_lock, flags);
 	}
 	return ret;
 }
@@ -999,6 +1009,7 @@ grow_dev_page(struct block_device *bdev,
 	struct inode *inode = bdev->bd_inode;
 	struct page *page;
 	struct buffer_head *bh;
+	unsigned long flags;
 
 	page = find_or_create_page(inode->i_mapping, index,
 		(mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS)|__GFP_MOVABLE);
@@ -1029,10 +1040,10 @@ grow_dev_page(struct block_device *bdev,
 	 * lock to be atomic wrt __find_get_block(), which does not
 	 * run under the page lock.
 	 */
-	spin_lock(&inode->i_mapping->private_lock);
+	spin_lock_irqsave(&inode->i_mapping->private_lock, flags);
 	link_dev_buffers(page, bh);
 	init_page_buffers(page, bdev, block, size);
-	spin_unlock(&inode->i_mapping->private_lock);
+	spin_unlock_irqrestore(&inode->i_mapping->private_lock, flags);
 	return page;
 
 failed:
@@ -1182,11 +1193,12 @@ void __bforget(struct buffer_head *bh)
 	clear_buffer_dirty(bh);
 	if (!list_empty(&bh->b_assoc_buffers)) {
 		struct address_space *buffer_mapping = bh->b_page->mapping;
+		unsigned long flags;
 
-		spin_lock(&buffer_mapping->private_lock);
+		spin_lock_irqsave(&buffer_mapping->private_lock, flags);
 		list_del_init(&bh->b_assoc_buffers);
 		bh->b_assoc_map = NULL;
-		spin_unlock(&buffer_mapping->private_lock);
+		spin_unlock_irqrestore(&buffer_mapping->private_lock, flags);
 	}
 	__brelse(bh);
 }
@@ -1513,6 +1525,7 @@ void create_empty_buffers(struct page *p
 			unsigned long blocksize, unsigned long b_state)
 {
 	struct buffer_head *bh, *head, *tail;
+	unsigned long flags;
 
 	head = alloc_page_buffers(page, blocksize, 1);
 	bh = head;
@@ -1523,7 +1536,7 @@ void create_empty_buffers(struct page *p
 	} while (bh);
 	tail->b_this_page = head;
 
-	spin_lock(&page->mapping->private_lock);
+	spin_lock_irqsave(&page->mapping->private_lock, flags);
 	if (PageUptodate(page) || PageDirty(page)) {
 		bh = head;
 		do {
@@ -1535,7 +1548,7 @@ void create_empty_buffers(struct page *p
 		} while (bh != head);
 	}
 	attach_page_buffers(page, head);
-	spin_unlock(&page->mapping->private_lock);
+	spin_unlock_irqrestore(&page->mapping->private_lock, flags);
 }
 EXPORT_SYMBOL(create_empty_buffers);
 
@@ -2844,6 +2857,7 @@ int try_to_free_buffers(struct page *pag
 	struct address_space * const mapping = page->mapping;
 	struct buffer_head *buffers_to_free = NULL;
 	int ret = 0;
+	unsigned long flags;
 
 	BUG_ON(!PageLocked(page));
 	if (PageWriteback(page))
@@ -2854,7 +2868,7 @@ int try_to_free_buffers(struct page *pag
 		goto out;
 	}
 
-	spin_lock(&mapping->private_lock);
+	spin_lock_irqsave(&mapping->private_lock, flags);
 	ret = drop_buffers(page, &buffers_to_free);
 
 	/*
@@ -2873,7 +2887,7 @@ int try_to_free_buffers(struct page *pag
 	 */
 	if (ret)
 		cancel_dirty_page(page, PAGE_CACHE_SIZE);
-	spin_unlock(&mapping->private_lock);
+	spin_unlock_irqrestore(&mapping->private_lock, flags);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 7/9] Save flags in swap.c
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-08-14 15:30 ` [RFC 6/9] Disable irqs on taking the private_lock Christoph Lameter
@ 2007-08-14 15:30 ` Christoph Lameter
  2007-08-14 15:30 ` [RFC 8/9] Reclaim on an atomic allocation if necessary Christoph Lameter
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

[-- Attachment #1: vmscan_swap_lock_irqsave --]
[-- Type: text/plain, Size: 3548 bytes --]

We need to call various LRU management functions with interrupts
disabled for atomic reclaim. Make them save flags.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/swap.c |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c	2007-08-14 08:00:57.000000000 -0700
+++ linux-2.6/mm/swap.c	2007-08-14 08:03:50.000000000 -0700
@@ -140,15 +140,16 @@ int rotate_reclaimable_page(struct page 
 void fastcall activate_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
+	unsigned long flags;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irqsave(&zone->lru_lock, flags);
 	if (PageLRU(page) && !PageActive(page)) {
 		del_page_from_inactive_list(zone, page);
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
 		__count_vm_event(PGACTIVATE);
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
 /*
@@ -258,6 +259,7 @@ void release_pages(struct page **pages, 
 	int i;
 	struct pagevec pages_to_free;
 	struct zone *zone = NULL;
+	unsigned long flags = 0;
 
 	pagevec_init(&pages_to_free, cold);
 	for (i = 0; i < nr; i++) {
@@ -281,7 +283,7 @@ void release_pages(struct page **pages, 
 				if (zone)
 					spin_unlock_irq(&zone->lru_lock);
 				zone = pagezone;
-				spin_lock_irq(&zone->lru_lock);
+				spin_lock_irqsave(&zone->lru_lock, flags);
 			}
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
@@ -298,7 +300,7 @@ void release_pages(struct page **pages, 
   		}
 	}
 	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 	pagevec_free(&pages_to_free);
 }
@@ -352,6 +354,7 @@ void __pagevec_lru_add(struct pagevec *p
 {
 	int i;
 	struct zone *zone = NULL;
+	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
@@ -361,14 +364,14 @@ void __pagevec_lru_add(struct pagevec *p
 			if (zone)
 				spin_unlock_irq(&zone->lru_lock);
 			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irqsave(&zone->lru_lock, flags);
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		add_page_to_inactive_list(zone, page);
 	}
 	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	release_pages(pvec->pages, pvec->nr, pvec->cold);
 	pagevec_reinit(pvec);
 }
@@ -379,6 +382,7 @@ void __pagevec_lru_add_active(struct pag
 {
 	int i;
 	struct zone *zone = NULL;
+	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
@@ -386,9 +390,9 @@ void __pagevec_lru_add_active(struct pag
 
 		if (pagezone != zone) {
 			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
 			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irqsave(&zone->lru_lock, flags);
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
@@ -397,7 +401,7 @@ void __pagevec_lru_add_active(struct pag
 		add_page_to_active_list(zone, page);
 	}
 	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	release_pages(pvec->pages, pvec->nr, pvec->cold);
 	pagevec_reinit(pvec);
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 8/9] Reclaim on an atomic allocation if necessary
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
                   ` (6 preceding siblings ...)
  2007-08-14 15:30 ` [RFC 7/9] Save flags in swap.c Christoph Lameter
@ 2007-08-14 15:30 ` Christoph Lameter
  2007-08-14 15:30 ` [RFC 9/9] Testing: Perform GFP_ATOMIC overallocation Christoph Lameter
  2007-08-16  2:49 ` [RFC 0/9] Reclaim during GFP_ATOMIC allocs Nick Piggin
  9 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

[-- Attachment #1: reclaim_on_atomic_alloc --]
[-- Type: text/plain, Size: 1260 bytes --]

Simply call reclaim if we get to a point where we cannot perform
the desired atomic allocation. If the reclaim is successful then
restart the allocation.

This will allow atomic allocs to not run out of memory. We reclaim clean
pages instead. If we are in an interrupt then the interrupt holdoff
will be long since reclaim processing is intensive. However, we will
no longer OOM.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/page_alloc.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-08-14 07:42:09.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-08-14 07:53:34.000000000 -0700
@@ -1326,8 +1326,12 @@ nofail_alloc:
 	}
 
 	/* Atomic allocations - we can't balance anything */
-	if (!wait)
+	if (!wait) {
+		if (try_to_free_pages(zonelist->zones, order, gfp_mask
+							| __GFP_NOMEMALLOC))
+			goto restart;
 		goto nopage;
+	}
 
 	cond_resched();
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 9/9] Testing: Perform GFP_ATOMIC overallocation
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
                   ` (7 preceding siblings ...)
  2007-08-14 15:30 ` [RFC 8/9] Reclaim on an atomic allocation if necessary Christoph Lameter
@ 2007-08-14 15:30 ` Christoph Lameter
  2007-08-16  2:49 ` [RFC 0/9] Reclaim during GFP_ATOMIC allocs Nick Piggin
  9 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 15:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, dkegel, Peter Zijlstra, David Miller,
	Nick Piggin

[-- Attachment #1: test_timer --]
[-- Type: text/plain, Size: 3381 bytes --]

Trigger a failure or reclaim by allocating large amounts of memory from the
timer interrupt.

This will show a protocol of what happened. F.e.

Timer: Excesssive Atomic allocs
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 96 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Atomically reclaimed 64 pages
Timer: Memory freed

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 kernel/timer.c |   30 ++++++++++++++++++++++++++++++
 mm/vmscan.c    |    4 ++++
 2 files changed, 34 insertions(+)

Index: linux-2.6/kernel/timer.c
===================================================================
--- linux-2.6.orig/kernel/timer.c	2007-08-14 07:43:21.000000000 -0700
+++ linux-2.6/kernel/timer.c	2007-08-14 07:43:22.000000000 -0700
@@ -817,6 +817,12 @@ unsigned long next_timer_interrupt(void)
 #endif
 
 /*
+ * Min freekbytes is 2m. 3000 pages give us 12M which is
+ * able to exhaust the reserves
+ */
+#define NR_TEST 3000
+
+/*
  * Called from the timer interrupt handler to charge one tick to the current 
  * process.  user_tick is 1 if the tick is user time, 0 for system.
  */
@@ -824,6 +830,9 @@ void update_process_times(int user_tick)
 {
 	struct task_struct *p = current;
 	int cpu = smp_processor_id();
+	struct page **base;
+	int i;
+	static unsigned long lasttime = 0;
 
 	/* Note: this timer irq context must be accounted for as well. */
 	if (user_tick)
@@ -835,6 +844,27 @@ void update_process_times(int user_tick)
 		rcu_check_callbacks(cpu, user_tick);
 	scheduler_tick();
 	run_posix_cpu_timers(p);
+
+	/* Every 2 minutes */
+	if (jiffies % (120 * HZ) == 0 && time_after(jiffies, lasttime)) {
+		printk(KERN_CRIT "Timer: Excesssive Atomic allocs\n");
+		/* Force memory to become exhausted */
+		base = kzalloc(NR_TEST * sizeof(void *), GFP_ATOMIC);
+
+		for (i = 0; i < NR_TEST; i++) {
+			base[i] = alloc_page(GFP_ATOMIC);
+			if (!base[i]) {
+				printk("Alloc failed at %d\n", i);
+				break;
+			}
+		}
+		for (i = 0; i < NR_TEST; i++)
+			if (base[i])
+				put_page(base[i]);
+		kfree(base);
+		printk(KERN_CRIT "Timer: Memory freed\n");
+		lasttime = jiffies;
+	}
 }
 
 /*
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-08-14 07:53:17.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-08-14 08:09:12.000000000 -0700
@@ -1232,6 +1232,10 @@ out:
 
 		zone->prev_priority = priority;
 	}
+
+	if (!(gfp_mask & __GFP_WAIT))
+		printk(KERN_WARNING "Atomically reclaimed %lu pages\n", nr_reclaimed);
+
 	return ret;
 }
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 20:02   ` Andi Kleen
@ 2007-08-14 19:12     ` Christoph Lameter
  2007-08-14 20:05       ` Peter Zijlstra
  2007-08-14 20:33       ` Andi Kleen
  0 siblings, 2 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 14 Aug 2007, Andi Kleen wrote:

> Christoph Lameter <clameter@sgi.com> writes:
> 
> > Reclaim can be called with interrupts disabled in atomic reclaim.
> > vmscan.c is currently using spinlock_irq(). Switch to spin_lock_irqsave().
> 
> I like the idea in principle. If this fully works out we could
> potentially keep less memory free by default which would be a good
> thing in general: free memory is bad memory.

Right.
 
> But would be interesting to measure what the lock
> changes do to interrupt latency. Probably nothing good.

Yup.
 
> A more benign alternative might be to just set a per CPU flag during
> these critical sections and then only do atomic reclaim on a local
> interrupt when the flag is not set.  That would make it a little less
> reliable, but much less intrusive and with some luck still give many
> of the benefits.

There are other lock interactions that may cause problems. If we do not 
switch to the saving of irq flags then all involved spinlocks must become 
trylocks because the interrupt could have happened while the spinlock is 
held. So interrupts must be disabled on locks acquired during an 
interrupt.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 15:30 ` [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c Christoph Lameter
@ 2007-08-14 20:02   ` Andi Kleen
  2007-08-14 19:12     ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 20:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> Reclaim can be called with interrupts disabled in atomic reclaim.
> vmscan.c is currently using spinlock_irq(). Switch to spin_lock_irqsave().

I like the idea in principle. If this fully works out we could
potentially keep less memory free by default which would be a good
thing in general: free memory is bad memory.

But would be interesting to measure what the lock
changes do to interrupt latency. Probably nothing good.

A more benign alternative might be to just set a per CPU flag during
these critical sections and then only do atomic reclaim on a local
interrupt when the flag is not set.  That would make it a little less
reliable, but much less intrusive and with some luck still give many
of the benefits.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 19:12     ` Christoph Lameter
@ 2007-08-14 20:05       ` Peter Zijlstra
  2007-08-14 20:34         ` Andi Kleen
  2007-08-14 20:33       ` Andi Kleen
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-08-14 20:05 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel, Ingo Molnar

On Tue, 2007-08-14 at 12:12 -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
> 
> > Christoph Lameter <clameter@sgi.com> writes:
> > 
> > > Reclaim can be called with interrupts disabled in atomic reclaim.
> > > vmscan.c is currently using spinlock_irq(). Switch to spin_lock_irqsave().
> > 
> > I like the idea in principle. If this fully works out we could
> > potentially keep less memory free by default which would be a good
> > thing in general: free memory is bad memory.
> 
> Right.
>  
> > But would be interesting to measure what the lock
> > changes do to interrupt latency. Probably nothing good.
> 
> Yup.
>  
> > A more benign alternative might be to just set a per CPU flag during
> > these critical sections and then only do atomic reclaim on a local
> > interrupt when the flag is not set.  That would make it a little less
> > reliable, but much less intrusive and with some luck still give many
> > of the benefits.
> 
> There are other lock interactions that may cause problems. If we do not 
> switch to the saving of irq flags then all involved spinlocks must become 
> trylocks because the interrupt could have happened while the spinlock is 
> held. So interrupts must be disabled on locks acquired during an 
> interrupt.

A much simpler approach to this seems to use threaded interrupts like
-rt does.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 19:12     ` Christoph Lameter
  2007-08-14 20:05       ` Peter Zijlstra
@ 2007-08-14 20:33       ` Andi Kleen
  2007-08-14 20:42         ` Christoph Lameter
  1 sibling, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 20:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

> There are other lock interactions that may cause problems. If we do not 
> switch to the saving of irq flags then all involved spinlocks must become 
> trylocks because the interrupt could have happened while the spinlock is 
> held. So interrupts must be disabled on locks acquired during an 
> interrupt.

I was thinking of a per cpu flag that is set before and unset after
taking the lock in process context. If the flag is set the interrupt
will never try to take the spinlock and return NULL instead. 
That should be equivalent to cli/sti for this special case.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 20:05       ` Peter Zijlstra
@ 2007-08-14 20:34         ` Andi Kleen
  0 siblings, 0 replies; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 20:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Andi Kleen, linux-mm, linux-kernel,
	Ingo Molnar

> A much simpler approach to this seems to use threaded interrupts like
> -rt does.

Then the interrupt could potentially stay blocked for very long
waiting for process context to finish its work. Also not good.
Essentially it would be equivalent to cli/sti for interrupts
that need to free memory.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 20:33       ` Andi Kleen
@ 2007-08-14 20:42         ` Christoph Lameter
  2007-08-14 20:44           ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 20:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > There are other lock interactions that may cause problems. If we do not 
> > switch to the saving of irq flags then all involved spinlocks must become 
> > trylocks because the interrupt could have happened while the spinlock is 
> > held. So interrupts must be disabled on locks acquired during an 
> > interrupt.
> 
> I was thinking of a per cpu flag that is set before and unset after
> taking the lock in process context. If the flag is set the interrupt
> will never try to take the spinlock and return NULL instead. 
> That should be equivalent to cli/sti for this special case.

Hmmmm... The spinlock is its own flag. If the lock is taken then the flag 
is set. So if we check all relevant spinlocks before going into reclaim 
then we could return NULL.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 20:42         ` Christoph Lameter
@ 2007-08-14 20:44           ` Andi Kleen
  2007-08-14 21:15             ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 20:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

> Hmmmm... The spinlock is its own flag.

Yes, but it's not CPU local. Taking the spinlock from another CPU's
interrupt handler is perfectly safe, just not from the local CPU.
If you use the spinlock as flag you would need to lock out everybody.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 20:44           ` Andi Kleen
@ 2007-08-14 21:15             ` Christoph Lameter
  2007-08-14 21:23               ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 21:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > Hmmmm... The spinlock is its own flag.
> 
> Yes, but it's not CPU local. Taking the spinlock from another CPU's
> interrupt handler is perfectly safe, just not from the local CPU.
> If you use the spinlock as flag you would need to lock out everybody.

So every spinlock would have an array of chars sized to NR_CPUS and set 
the flag when the lock is taken?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 21:15             ` Christoph Lameter
@ 2007-08-14 21:23               ` Andi Kleen
  2007-08-14 21:26                 ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 21:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

On Tue, Aug 14, 2007 at 02:15:12PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
> 
> > > Hmmmm... The spinlock is its own flag.
> > 
> > Yes, but it's not CPU local. Taking the spinlock from another CPU's
> > interrupt handler is perfectly safe, just not from the local CPU.
> > If you use the spinlock as flag you would need to lock out everybody.
> 
> So every spinlock would have an array of chars sized to NR_CPUS and set 
> the flag when the lock is taken?

I was more thinking of a single per cpu flag for all of page reclaim
That keeps it also cache local.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 21:23               ` Andi Kleen
@ 2007-08-14 21:26                 ` Christoph Lameter
  2007-08-14 21:29                   ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 21:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > So every spinlock would have an array of chars sized to NR_CPUS and set 
> > the flag when the lock is taken?
> 
> I was more thinking of a single per cpu flag for all of page reclaim
> That keeps it also cache local.

We already have such a flag in the zone structure

        /* A count of how many reclaimers are scanning this zone */
        atomic_t                reclaim_in_progress;


The problem is that the LRU lock etc is also taken outside of reclaim. In 
order for the flag to work we would have to set it everywhere the lru lock 
etcis taken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 21:26                 ` Christoph Lameter
@ 2007-08-14 21:29                   ` Andi Kleen
  2007-08-14 21:37                     ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 21:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

On Tue, Aug 14, 2007 at 02:26:43PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
> 
> > > So every spinlock would have an array of chars sized to NR_CPUS and set 
> > > the flag when the lock is taken?
> > 
> > I was more thinking of a single per cpu flag for all of page reclaim
> > That keeps it also cache local.
> 
> We already have such a flag in the zone structure

Zone structure is not strictly CPU local so it's broader
than needed. But it might work.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 21:29                   ` Andi Kleen
@ 2007-08-14 21:37                     ` Christoph Lameter
  2007-08-14 21:44                       ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 21:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > We already have such a flag in the zone structure
> 
> Zone structure is not strictly CPU local so it's broader
> than needed. But it might work.

We could convert this into a per cpu array?

But that still creates lots of overhead each time we take the lru lock!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 21:37                     ` Christoph Lameter
@ 2007-08-14 21:44                       ` Andi Kleen
  2007-08-14 21:48                         ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 21:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

On Tue, Aug 14, 2007 at 02:37:27PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
> 
> > > We already have such a flag in the zone structure
> > 
> > Zone structure is not strictly CPU local so it's broader
> > than needed. But it might work.
> 
> We could convert this into a per cpu array?

Perhaps. That would make it more expensive to read for
its current users though. 

> But that still creates lots of overhead each time we take the lru lock!

A lot of overhead in what way? Setting a flag in a cache hot
per CPU data variable shouldn't be more than a few cycles.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 21:44                       ` Andi Kleen
@ 2007-08-14 21:48                         ` Christoph Lameter
  2007-08-14 21:56                           ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 21:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > But that still creates lots of overhead each time we take the lru lock!
> 
> A lot of overhead in what way? Setting a flag in a cache hot
> per CPU data variable shouldn't be more than a few cycles.

Could you be a bit more specific? Where do you want to place the data?

What we are talking about is

atomic_inc(&zone->reclaim_cpu[smp_processor_id()]);
smp_wmb();
spin_lock(&zone->lru_lock);

....

spin_unlock(&zone_lru_lock);
smp_wmb();
atomic_dec(&zone->reclaim_cpu[smp_processor_id()]);

That is not light weight.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 21:48                         ` Christoph Lameter
@ 2007-08-14 21:56                           ` Andi Kleen
  2007-08-14 22:07                             ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 21:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

On Tue, Aug 14, 2007 at 02:48:31PM -0700, Christoph Lameter wrote:
> On Tue, 14 Aug 2007, Andi Kleen wrote:
> 
> > > But that still creates lots of overhead each time we take the lru lock!
> > 
> > A lot of overhead in what way? Setting a flag in a cache hot
> > per CPU data variable shouldn't be more than a few cycles.
> 
> Could you be a bit more specific? Where do you want to place the data?

DEFINE_PER_CPU(int, zone_flag);

	get_cpu();	// likely already true and then not needed
	__get_cpu(zone_flag) = 1;
	/* wmb is implied in spin_lock I think */
	spin_lock(&zone->lru_lock);
	...
	spin_unlock(&zone->lru_lock);
	__get_cpu(zone_flag) = 0;
	put_cpu();

Interrupt handler

	if (!__get_cpu(zone_flag)) {
		do things with zone locks 
	}

The interrupt handler shouldn't touch zone_flag. If it wants
to it would need to be converted to a local_t and incremented/decremented
(should be about the same cost at least on architectures with sane
local_t implementation) 
		
-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 21:56                           ` Andi Kleen
@ 2007-08-14 22:07                             ` Christoph Lameter
  2007-08-14 22:16                               ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 22:07 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Tue, 14 Aug 2007, Andi Kleen wrote:

> > Could you be a bit more specific? Where do you want to place the data?
> 
> DEFINE_PER_CPU(int, zone_flag);
> 
> 	get_cpu();	// likely already true and then not needed
> 	__get_cpu(zone_flag) = 1;
> 	/* wmb is implied in spin_lock I think */

No its not. Only on x64 which has implicit write ordering.

> 	spin_lock(&zone->lru_lock);
> 	...
> 	spin_unlock(&zone->lru_lock);
> 	__get_cpu(zone_flag) = 0;
> 	put_cpu();
> 
> Interrupt handler
> 
> 	if (!__get_cpu(zone_flag)) {

There are more spinlocks needed. So we would just check the whole bunch 
and fail if any of them are used?

> 		do things with zone locks 
> 	}
> 
> The interrupt handler shouldn't touch zone_flag. If it wants
> to it would need to be converted to a local_t and incremented/decremented
> (should be about the same cost at least on architectures with sane
> local_t implementation) 

That would mean we need to fork the code for reclaim?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 22:07                             ` Christoph Lameter
@ 2007-08-14 22:16                               ` Andi Kleen
  2007-08-14 22:20                                 ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 22:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

On Tue, Aug 14, 2007 at 03:07:10PM -0700, Christoph Lameter wrote:
> There are more spinlocks needed. So we would just check the whole bunch 
> and fail if any of them are used?

Yes zone_flag would apply to all of them.

> 
> > 		do things with zone locks 
> > 	}
> > 
> > The interrupt handler shouldn't touch zone_flag. If it wants
> > to it would need to be converted to a local_t and incremented/decremented
> > (should be about the same cost at least on architectures with sane
> > local_t implementation) 
> 
> That would mean we need to fork the code for reclaim?

Not with the local_t increment.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 22:16                               ` Andi Kleen
@ 2007-08-14 22:20                                 ` Christoph Lameter
  2007-08-14 22:21                                   ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 22:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Wed, 15 Aug 2007, Andi Kleen wrote:

> > > The interrupt handler shouldn't touch zone_flag. If it wants
> > > to it would need to be converted to a local_t and incremented/decremented
> > > (should be about the same cost at least on architectures with sane
> > > local_t implementation) 
> > 
> > That would mean we need to fork the code for reclaim?
> 
> Not with the local_t increment.

Ok I have a vague idea on how this could but its likely that the 
changes make things worse rather than better. Additional reference to a 
new cacheline (per cpu but still), preempt disable. Lots of code at all
call sites. Interrupt enable/disable is quite efficient in recent 
processors.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 22:20                                 ` Christoph Lameter
@ 2007-08-14 22:21                                   ` Andi Kleen
  2007-08-14 22:41                                     ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2007-08-14 22:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-mm, linux-kernel

> Ok I have a vague idea on how this could but its likely that the 
> changes make things worse rather than better. Additional reference to a 
> new cacheline (per cpu but still), preempt disable. Lots of code at all
> call sites. Interrupt enable/disable is quite efficient in recent 
> processors.

The goal of this was not to be faster than interrupt disable,
but to avoid the interrupt latency impact. This might be a problem
when spending a lot of time inside the locks.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c
  2007-08-14 22:21                                   ` Andi Kleen
@ 2007-08-14 22:41                                     ` Christoph Lameter
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-14 22:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, linux-kernel

On Wed, 15 Aug 2007, Andi Kleen wrote:

> > Ok I have a vague idea on how this could but its likely that the 
> > changes make things worse rather than better. Additional reference to a 
> > new cacheline (per cpu but still), preempt disable. Lots of code at all
> > call sites. Interrupt enable/disable is quite efficient in recent 
> > processors.
> 
> The goal of this was not to be faster than interrupt disable,
> but to avoid the interrupt latency impact. This might be a problem
> when spending a lot of time inside the locks.

Both. They need to be fast too and not complicate the kernel too much. I 
have not seen a serious holdoff case. The biggest issue is still the 
zone->lru lock but interrupts are always disabled for that one already.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 0/9] Reclaim during GFP_ATOMIC allocs
  2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
                   ` (8 preceding siblings ...)
  2007-08-14 15:30 ` [RFC 9/9] Testing: Perform GFP_ATOMIC overallocation Christoph Lameter
@ 2007-08-16  2:49 ` Nick Piggin
  2007-08-16 20:24   ` Christoph Lameter
  9 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2007-08-16  2:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, akpm, dkegel, Peter Zijlstra,
	David Miller

On Tue, Aug 14, 2007 at 08:30:21AM -0700, Christoph Lameter wrote:
> This is the extended version of the reclaim patchset. It enables reclaim from
> clean file backed pages during GFP_ATOMIC allocs. A bit invasive since
> may locks must now be taken with saving flags. But it works.
> 
> Tested by repeatedly allocating 12MB of memory from the timer interrupt.
> 
> -- 

Just to clarify... I can see how recursive reclaim can prevent memory getting
eaten up by reclaim (which thus causes allocations from interrupt handlers to
fail)...

But this patchset I don't see will do anything to prevent reclaim deadlocks,
right? (because if there is reclaimable memory at hand, then kswapd should
eventually reclaim it).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 0/9] Reclaim during GFP_ATOMIC allocs
  2007-08-16  2:49 ` [RFC 0/9] Reclaim during GFP_ATOMIC allocs Nick Piggin
@ 2007-08-16 20:24   ` Christoph Lameter
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Lameter @ 2007-08-16 20:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, linux-kernel, akpm, dkegel, Peter Zijlstra,
	David Miller

On Thu, 16 Aug 2007, Nick Piggin wrote:

> Just to clarify... I can see how recursive reclaim can prevent memory getting
> eaten up by reclaim (which thus causes allocations from interrupt handlers to
> fail)...
> 
> But this patchset I don't see will do anything to prevent reclaim deadlocks,
> right? (because if there is reclaimable memory at hand, then kswapd should
> eventually reclaim it).

What deadlocks are you thinking about? Reclaim can be run concurrently 
right now.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-14 15:30 ` [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set Christoph Lameter
@ 2007-08-18  7:10   ` Pavel Machek
  2007-08-20 19:00     ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Pavel Machek @ 2007-08-18  7:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, akpm, dkegel, Peter Zijlstra,
	David Miller, Nick Piggin

Hi!

> If we exhaust the reserves in the page allocator when PF_MEMALLOC is set
> then no longer give up but call into reclaim with PF_MEMALLOC set.
> 
> This is in essence a recursive call back into page reclaim with another
> page flag (__GFP_NOMEMALLOC) set. The recursion is bounded since potential
> allocations with __PF_NOMEMALLOC set will not enter that branch again.
> 
> This means that allocation under PF_MEMALLOC will no longer run out of
> memory. Allocations under PF_MEMALLOC will do a limited form of reclaim
> instead.
> 
> The reclaim is of particular important to stacked filesystems that may
> do a lot of allocations in the write path. Reclaim will be working
> as long as there are clean file backed pages to reclaim.

I don't get it. Lets say that we have stacked filesystem that needs
it. That filesystem is broken today.

Now you give it second chance by reclaiming clean pages, but there are
no guarantees that we have any.... so that filesystem is still broken
with your patch...?

Should we fix the filesystem instead?
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-18  7:10   ` Pavel Machek
@ 2007-08-20 19:00     ` Christoph Lameter
  2007-08-20 20:17       ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-20 19:00 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-mm, linux-kernel, akpm, dkegel, Peter Zijlstra,
	David Miller, Nick Piggin

On Sat, 18 Aug 2007, Pavel Machek wrote:

> > The reclaim is of particular important to stacked filesystems that may
> > do a lot of allocations in the write path. Reclaim will be working
> > as long as there are clean file backed pages to reclaim.
> 
> I don't get it. Lets say that we have stacked filesystem that needs
> it. That filesystem is broken today.
> 
> Now you give it second chance by reclaiming clean pages, but there are
> no guarantees that we have any.... so that filesystem is still broken
> with your patch...?

There is a guarantee that we have some because the user space program is 
executing. Meaning the executable pages can be retrieved. The amount 
dirty memory in the system is limited by the dirty_ratio. So the VM can 
only get into trouble if there is a sufficient amount of anonymous pages 
and all executables have been reclaimed. That is pretty rare.

Plus the same issue can happen today. Writes are usually not completed 
during reclaim. If the writes are sufficiently deferred then you have the 
same issue now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-20 19:00     ` Christoph Lameter
@ 2007-08-20 20:17       ` Peter Zijlstra
  2007-08-20 20:27         ` Christoph Lameter
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-08-20 20:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pavel Machek, linux-mm, linux-kernel, akpm, dkegel, David Miller,
	Nick Piggin

On Mon, 2007-08-20 at 12:00 -0700, Christoph Lameter wrote:
> On Sat, 18 Aug 2007, Pavel Machek wrote:
> 
> > > The reclaim is of particular important to stacked filesystems that may
> > > do a lot of allocations in the write path. Reclaim will be working
> > > as long as there are clean file backed pages to reclaim.
> > 
> > I don't get it. Lets say that we have stacked filesystem that needs
> > it. That filesystem is broken today.
> > 
> > Now you give it second chance by reclaiming clean pages, but there are
> > no guarantees that we have any.... so that filesystem is still broken
> > with your patch...?
> 
> There is a guarantee that we have some because the user space program is 
> executing. Meaning the executable pages can be retrieved. The amount 
> dirty memory in the system is limited by the dirty_ratio. So the VM can 
> only get into trouble if there is a sufficient amount of anonymous pages 
> and all executables have been reclaimed. That is pretty rare.
> 
> Plus the same issue can happen today. Writes are usually not completed 
> during reclaim. If the writes are sufficiently deferred then you have the 
> same issue now.

Once we have initiated (disk) writeout we do not need more memory to
complete it, all we need to do is wait for the completion interrupt.

Networking is different here in that an unbounded amount of net traffic
needs to be processed in order to find the completion event.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-20 20:17       ` Peter Zijlstra
@ 2007-08-20 20:27         ` Christoph Lameter
  2007-08-20 21:14           ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-20 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pavel Machek, linux-mm, linux-kernel, akpm, dkegel, David Miller,
	Nick Piggin

On Mon, 20 Aug 2007, Peter Zijlstra wrote:

> > Plus the same issue can happen today. Writes are usually not completed 
> > during reclaim. If the writes are sufficiently deferred then you have the 
> > same issue now.
> 
> Once we have initiated (disk) writeout we do not need more memory to
> complete it, all we need to do is wait for the completion interrupt.

We cannot reclaim the page as long as the I/O is not complete. If you 
have too many anonymous pages and the rest of memory is dirty then you can 
get into OOM scenarios even without this patch.

> Networking is different here in that an unbounded amount of net traffic
> needs to be processed in order to find the completion event.

Its not that different. Pages are pinned during writeout from reclaim and 
it is not clear when the write will complete. There are no bounds that I 
know in reclaim for the writeback of dirty anonymous pages.

But some throttling function like for dirty pages is likely needed for 
network traffic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-20 20:27         ` Christoph Lameter
@ 2007-08-20 21:14           ` Peter Zijlstra
  2007-08-20 21:17             ` Christoph Lameter
  2007-08-21  0:39             ` Nick Piggin
  0 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-08-20 21:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pavel Machek, linux-mm, linux-kernel, akpm, dkegel, David Miller,
	Nick Piggin

On Mon, 2007-08-20 at 13:27 -0700, Christoph Lameter wrote:
> On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> 
> > > Plus the same issue can happen today. Writes are usually not completed 
> > > during reclaim. If the writes are sufficiently deferred then you have the 
> > > same issue now.
> > 
> > Once we have initiated (disk) writeout we do not need more memory to
> > complete it, all we need to do is wait for the completion interrupt.
> 
> We cannot reclaim the page as long as the I/O is not complete. If you 
> have too many anonymous pages and the rest of memory is dirty then you can 
> get into OOM scenarios even without this patch.

As long as the reserve is large enough to completely initialize writeout
of a single page we can make progress. Once writeout is initialized the
completion interrupt is guaranteed to happen (assuming working
hardware).

This makes that I can happily run a 256M anonymous workload on a machine
with only 128M memory.

> > Networking is different here in that an unbounded amount of net traffic
> > needs to be processed in order to find the completion event.
> 
> Its not that different.

Yes it is, disk based completion does not require memory, network based
completion requires unbounded memory.

>  Pages are pinned during writeout from reclaim and 
> it is not clear when the write will complete. 

For disk based writeback you do not know when it comes, but you need
only passively wait for it. 

For networked writeback you need to receive all packets that happen to
be targeted at your machine and inspect them - and toss some away
because you cannot keep everything, memory is limited.

> There are no bounds that I 
> know in reclaim for the writeback of dirty anonymous pages.

throttle_vm_writeout() does sort-of.

> But some throttling function like for dirty pages is likely needed for 
> network traffic.

Yes, Daniel is working on writeout throttling.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-20 21:14           ` Peter Zijlstra
@ 2007-08-20 21:17             ` Christoph Lameter
  2007-08-21 14:07               ` Peter Zijlstra
  2007-08-21  0:39             ` Nick Piggin
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Lameter @ 2007-08-20 21:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pavel Machek, linux-mm, linux-kernel, akpm, dkegel, David Miller,
	Nick Piggin

On Mon, 20 Aug 2007, Peter Zijlstra wrote:

> > Its not that different.
> 
> Yes it is, disk based completion does not require memory, network based
> completion requires unbounded memory.

Disk based completion only require no memory if its not on a stack of 
other devices and if the interrupt handles is appropriately shaped. If 
there are multile levels below or there is some sort of complex 
completion handling then this also may require memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-20 21:14           ` Peter Zijlstra
  2007-08-20 21:17             ` Christoph Lameter
@ 2007-08-21  0:39             ` Nick Piggin
  2007-08-21 14:07               ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2007-08-21  0:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Pavel Machek, linux-mm, linux-kernel, akpm,
	dkegel, David Miller

On Mon, Aug 20, 2007 at 11:14:08PM +0200, Peter Zijlstra wrote:
> On Mon, 2007-08-20 at 13:27 -0700, Christoph Lameter wrote:
> > On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> > 
> > > > Plus the same issue can happen today. Writes are usually not completed 
> > > > during reclaim. If the writes are sufficiently deferred then you have the 
> > > > same issue now.
> > > 
> > > Once we have initiated (disk) writeout we do not need more memory to
> > > complete it, all we need to do is wait for the completion interrupt.
> > 
> > We cannot reclaim the page as long as the I/O is not complete. If you 
> > have too many anonymous pages and the rest of memory is dirty then you can 
> > get into OOM scenarios even without this patch.
> 
> As long as the reserve is large enough to completely initialize writeout
> of a single page we can make progress. Once writeout is initialized the
> completion interrupt is guaranteed to happen (assuming working
> hardware).

Although interestingly, we are not guaranteed to have enough memory to
completely initialise writeout of a single page.

The buffer layer doesn't require disk blocks to be allocated at page
dirty-time. Allocating disk blocks can require complex filesystem operations
and readin of buffer cache pages. The buffer_head structures themselves may
not even be present and must be allocated :P

In _practice_, this isn't such a problem because we have dirty limits, and
we're almost guaranteed to have some clean pages to be reclaimed. In this
same way, networked filesystems are not a problem in practice. However
network swap, because there is no dirty limits on swap, can actually see
the deadlock problems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-20 21:17             ` Christoph Lameter
@ 2007-08-21 14:07               ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-08-21 14:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pavel Machek, linux-mm, linux-kernel, akpm, dkegel, David Miller,
	Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 823 bytes --]

On Mon, 2007-08-20 at 14:17 -0700, Christoph Lameter wrote:
> On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> 
> > > Its not that different.
> > 
> > Yes it is, disk based completion does not require memory, network based
> > completion requires unbounded memory.
> 
> Disk based completion only require no memory if its not on a stack of 
> other devices and if the interrupt handles is appropriately shaped. If 
> there are multile levels below or there is some sort of complex 
> completion handling then this also may require memory.

I'm not aware of such a scenario - but it could well be. Still if it
would it would take a _bounded_ amount of memory per page.

Network would still differ in that it requires an _unbounded_ amount of
packets to receive and process in order to receive that completion.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-21  0:39             ` Nick Piggin
@ 2007-08-21 14:07               ` Peter Zijlstra
  2007-08-23  3:38                 ` Nick Piggin
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-08-21 14:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Pavel Machek, linux-mm, linux-kernel, akpm,
	dkegel, David Miller

[-- Attachment #1: Type: text/plain, Size: 2931 bytes --]

On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote:
> On Mon, Aug 20, 2007 at 11:14:08PM +0200, Peter Zijlstra wrote:
> > On Mon, 2007-08-20 at 13:27 -0700, Christoph Lameter wrote:
> > > On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> > > 
> > > > > Plus the same issue can happen today. Writes are usually not completed 
> > > > > during reclaim. If the writes are sufficiently deferred then you have the 
> > > > > same issue now.
> > > > 
> > > > Once we have initiated (disk) writeout we do not need more memory to
> > > > complete it, all we need to do is wait for the completion interrupt.
> > > 
> > > We cannot reclaim the page as long as the I/O is not complete. If you 
> > > have too many anonymous pages and the rest of memory is dirty then you can 
> > > get into OOM scenarios even without this patch.
> > 
> > As long as the reserve is large enough to completely initialize writeout
> > of a single page we can make progress. Once writeout is initialized the
> > completion interrupt is guaranteed to happen (assuming working
> > hardware).
> 
> Although interestingly, we are not guaranteed to have enough memory to
> completely initialise writeout of a single page.

Yes, that is due to the unbounded nature of direct reclaim, no?

I've been meaning to write some patches to address this problem in a way
that does not introduce the hard wall Linus objects to. If only I had
this extra day in the week :-/

And then there is the deadlock in add_to_swap() that I still have to
look into, I hope it can eventually be solved using reserve based
allocation.

> The buffer layer doesn't require disk blocks to be allocated at page
> dirty-time. Allocating disk blocks can require complex filesystem operations
> and readin of buffer cache pages. The buffer_head structures themselves may
> not even be present and must be allocated :P
> 
> In _practice_, this isn't such a problem because we have dirty limits, and
> we're almost guaranteed to have some clean pages to be reclaimed. In this
> same way, networked filesystems are not a problem in practice. However
> network swap, because there is no dirty limits on swap, can actually see
> the deadlock problems.

The main problem with networked swap is not so much sending out the
pages (this has similar problems like the filesystems but is all bounded
in its memory use).

The biggest issue is receiving the completion notification. Network
needs to fall back to a state where it does not blindly consumes memory
or drops _all_ packets. An intermediate state is required, one where we
can receive and inspect incoming packets but commit to very few.

In order to create such a network state and for it to be stable, a
certain amount of memory needs to be available and an external trigger
is needed to enter and leave this state - currently provided by there
being more memory available than needed or not.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-21 14:07               ` Peter Zijlstra
@ 2007-08-23  3:38                 ` Nick Piggin
  2007-08-23  9:26                   ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2007-08-23  3:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Pavel Machek, linux-mm, linux-kernel, akpm,
	dkegel, David Miller

On Tue, Aug 21, 2007 at 04:07:15PM +0200, Peter Zijlstra wrote:
> On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote:
> > 
> > Although interestingly, we are not guaranteed to have enough memory to
> > completely initialise writeout of a single page.
> 
> Yes, that is due to the unbounded nature of direct reclaim, no?

Even writing out a single page to a plain old block backed filesystem
can take a fair chunk of memory. I'm not really sure how problematic
this is with a "real" filesystem, but even with something pretty simple,
you might have to do block allocation, which itself might have to do
indirect block allocation (which itself can be 3 or 4 levels), all of
which have to actually update block bitmaps (which themselves may be
many pages big). Then you also may have to even just allocate the
buffer_head structure itself. And that's just to write out a single
buffer in the page (on a 64K page system, there might be 64 of these).

Unbounded direct reclaim surely doesn't help either :P

> I've been meaning to write some patches to address this problem in a way
> that does not introduce the hard wall Linus objects to. If only I had
> this extra day in the week :-/

For this problem I think the right way to go is to ensure everything
is allocated to do writeout at page-dirty-time. This is what fsblock
does (or at least _allows_ for: filesystems that do journalling or
delayed allocation etc. themselves will have to ensure they have
sufficient preallocations to do the manipulations they need at writeout
time).

But again, on the pragmatic side, the best behaviour I think is just
to have writeouts not allocate from reserves without first trying to
reclaim some clean memory, and also limit the number of users of the
reserve. We want this anyway, right, because we don't want regular
reclaim to start causing things like atomic allocation failures when
load goes up.

> And then there is the deadlock in add_to_swap() that I still have to
> look into, I hope it can eventually be solved using reserve based
> allocation.

Yes it should have a reserve. It wouldn't be hard, all you need is
enough memory to be able to swap out a single page I would think (ie.
one preload's worth).

> > The buffer layer doesn't require disk blocks to be allocated at page
> > dirty-time. Allocating disk blocks can require complex filesystem operations
> > and readin of buffer cache pages. The buffer_head structures themselves may
> > not even be present and must be allocated :P
> > 
> > In _practice_, this isn't such a problem because we have dirty limits, and
> > we're almost guaranteed to have some clean pages to be reclaimed. In this
> > same way, networked filesystems are not a problem in practice. However
> > network swap, because there is no dirty limits on swap, can actually see
> > the deadlock problems.
> 
> The main problem with networked swap is not so much sending out the
> pages (this has similar problems like the filesystems but is all bounded
> in its memory use).
> 
> The biggest issue is receiving the completion notification. Network
> needs to fall back to a state where it does not blindly consumes memory
> or drops _all_ packets. An intermediate state is required, one where we
> can receive and inspect incoming packets but commit to very few.

Yes, I understand this is the main problem. But it is not _helped_ by
the fact that reclaim reserves include the atomic allocation reserves.
I haven't run this problem for a long time, but I'd venture to guess the
_main_ reason the deadlock is hit is not because of networking allocating
a lot of other irrelevant data, but because of reclaim using up most of
the atomic allocation reserves.

And this observation is not tied to recurisve reclaim: if we somehow had
a reserve for atomic allocations that was aside from the reclaim reserve,
I think such a system would be practically free of deadlock for more
anonymous-intensive workloads too.

> In order to create such a network state and for it to be stable, a
> certain amount of memory needs to be available and an external trigger
> is needed to enter and leave this state - currently provided by there
> being more memory available than needed or not.

I do appreciate the deadlock and solution.  I'm puzzled by your last line
though? Currently we do not provide the required reserves in the network
layer, *at all*, right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-23  3:38                 ` Nick Piggin
@ 2007-08-23  9:26                   ` Peter Zijlstra
  2007-08-23 10:11                     ` Nikita Danilov
  2007-08-24  4:00                     ` Nick Piggin
  0 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-08-23  9:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Pavel Machek, linux-mm, linux-kernel, akpm,
	dkegel, David Miller, Nikita Danilov

[-- Attachment #1: Type: text/plain, Size: 6354 bytes --]

On Thu, 2007-08-23 at 05:38 +0200, Nick Piggin wrote:
> On Tue, Aug 21, 2007 at 04:07:15PM +0200, Peter Zijlstra wrote:
> > On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote:
> > > 
> > > Although interestingly, we are not guaranteed to have enough memory to
> > > completely initialise writeout of a single page.
> > 
> > Yes, that is due to the unbounded nature of direct reclaim, no?
>  
> Even writing out a single page to a plain old block backed filesystem
> can take a fair chunk of memory. I'm not really sure how problematic
> this is with a "real" filesystem, but even with something pretty simple,
> you might have to do block allocation, which itself might have to do
> indirect block allocation (which itself can be 3 or 4 levels), all of
> which have to actually update block bitmaps (which themselves may be
> many pages big). Then you also may have to even just allocate the
> buffer_head structure itself. And that's just to write out a single
> buffer in the page (on a 64K page system, there might be 64 of these).

Right, nikita once talked me though all that when we talked about
clustered writeout.

IIRC filesystems were supposed to keep mempools big enough to do this
for a single writepage at a time. Not sure its actually done though.

One advantage here is that swap writeout is very simple, so for
swap_writepage() the overhead is minimal, and we can free up space to
make progress with the fs writeout. And if there is little anonymous in
the system it must have a lot clean because of the dirty limit.

But yeah, there are some nasty details left here.

> > I've been meaning to write some patches to address this problem in a way
> > that does not introduce the hard wall Linus objects to. If only I had
> > this extra day in the week :-/
> 
> For this problem I think the right way to go is to ensure everything
> is allocated to do writeout at page-dirty-time. This is what fsblock
> does (or at least _allows_ for: filesystems that do journalling or
> delayed allocation etc. themselves will have to ensure they have
> sufficient preallocations to do the manipulations they need at writeout
> time).
> 
> But again, on the pragmatic side, the best behaviour I think is just
> to have writeouts not allocate from reserves without first trying to
> reclaim some clean memory, and also limit the number of users of the
> reserve. We want this anyway, right, because we don't want regular
> reclaim to start causing things like atomic allocation failures when
> load goes up.

My idea is to extend kswapd, run cpus_per_node instances of kswapd per
node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds
per cpu)

whenever we would hit direct reclaim, add ourselves to a special
waitqueue corresponding to the type of GFP and kick all the
corresponding kswapds.

Now Linus' big objection is that all these processes would hit a wall
and not progress until the watermarks are high again.

Here is were the 'special' part of the waitqueue comes into order.

Instead of freeing pages to the page allocator, these kswapds would hand
out pages to the waiting processes in a round robin fashion. Only if
there are no more waiting processes left, would the page go to the buddy
system.

> > And then there is the deadlock in add_to_swap() that I still have to
> > look into, I hope it can eventually be solved using reserve based
> > allocation.
> 
> Yes it should have a reserve. It wouldn't be hard, all you need is
> enough memory to be able to swap out a single page I would think (ie.
> one preload's worth).

Yeah, just need to look at the locking an batching, and ensure it has
enough preload to survive one batch, once all the locks are dropped it
can breathe again :-)
 
> > > The buffer layer doesn't require disk blocks to be allocated at page
> > > dirty-time. Allocating disk blocks can require complex filesystem operations
> > > and readin of buffer cache pages. The buffer_head structures themselves may
> > > not even be present and must be allocated :P
> > > 
> > > In _practice_, this isn't such a problem because we have dirty limits, and
> > > we're almost guaranteed to have some clean pages to be reclaimed. In this
> > > same way, networked filesystems are not a problem in practice. However
> > > network swap, because there is no dirty limits on swap, can actually see
> > > the deadlock problems.
> > 
> > The main problem with networked swap is not so much sending out the
> > pages (this has similar problems like the filesystems but is all bounded
> > in its memory use).
> > 
> > The biggest issue is receiving the completion notification. Network
> > needs to fall back to a state where it does not blindly consumes memory
> > or drops _all_ packets. An intermediate state is required, one where we
> > can receive and inspect incoming packets but commit to very few.
>  
> Yes, I understand this is the main problem. But it is not _helped_ by
> the fact that reclaim reserves include the atomic allocation reserves.
> I haven't run this problem for a long time, but I'd venture to guess the
> _main_ reason the deadlock is hit is not because of networking allocating
> a lot of other irrelevant data, but because of reclaim using up most of
> the atomic allocation reserves.

Ah, interesting notion.

> And this observation is not tied to recurisve reclaim: if we somehow had
> a reserve for atomic allocations that was aside from the reclaim reserve,
> I think such a system would be practically free of deadlock for more
> anonymous-intensive workloads too.

One could get quite far, however the scenario of shutting down the
remote swap server while other network traffic is present will surely
still deadlock.

> > In order to create such a network state and for it to be stable, a
> > certain amount of memory needs to be available and an external trigger
> > is needed to enter and leave this state - currently provided by there
> > being more memory available than needed or not.
> 
> I do appreciate the deadlock and solution.  I'm puzzled by your last line
> though? Currently we do not provide the required reserves in the network
> layer, *at all*, right?

Right, I was speaking of a kernel with my patches applied. Sorry for the
confusion.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-23  9:26                   ` Peter Zijlstra
@ 2007-08-23 10:11                     ` Nikita Danilov
  2007-08-23 13:58                       ` Peter Zijlstra
  2007-08-24  4:00                     ` Nick Piggin
  1 sibling, 1 reply; 46+ messages in thread
From: Nikita Danilov @ 2007-08-23 10:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Christoph Lameter, Pavel Machek, linux-mm,
	linux-kernel, akpm, dkegel, David Miller

Peter Zijlstra writes:

[...]

 > My idea is to extend kswapd, run cpus_per_node instances of kswapd per
 > node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds
 > per cpu)
 > 
 > whenever we would hit direct reclaim, add ourselves to a special
 > waitqueue corresponding to the type of GFP and kick all the
 > corresponding kswapds.

There are two standard objections to this:

    - direct reclaim was introduced to reduce memory allocation latency,
      and going to scheduler kills this. But more importantly,

    - it might so happen that _all_ per-cpu kswapd instances are
      blocked, e.g., waiting for IO on indirect blocks, or queue
      congestion. In that case whole system stops waiting for IO to
      complete. In the direct reclaim case, other threads can continue
      zone scanning.

Nikita.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-23 10:11                     ` Nikita Danilov
@ 2007-08-23 13:58                       ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-08-23 13:58 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Nick Piggin, Christoph Lameter, Pavel Machek, linux-mm,
	linux-kernel, akpm, dkegel, David Miller

[-- Attachment #1: Type: text/plain, Size: 1953 bytes --]

On Thu, 2007-08-23 at 14:11 +0400, Nikita Danilov wrote:
> Peter Zijlstra writes:
> 
> [...]
> 
>  > My idea is to extend kswapd, run cpus_per_node instances of kswapd per
>  > node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds
>  > per cpu)
>  > 
>  > whenever we would hit direct reclaim, add ourselves to a special
>  > waitqueue corresponding to the type of GFP and kick all the
>  > corresponding kswapds.
> 
> There are two standard objections to this:
> 
>     - direct reclaim was introduced to reduce memory allocation latency,
>       and going to scheduler kills this. But more importantly,

The part you snipped:

> > Here is were the 'special' part of the waitqueue comes into order.
> > 
> > Instead of freeing pages to the page allocator, these kswapds would hand
> > out pages to the waiting processes in a round robin fashion. Only if
> > there are no more waiting processes left, would the page go to the buddy
> > system.

should deal with that, it allows processes to quickly get some memory.

>     - it might so happen that _all_ per-cpu kswapd instances are
>       blocked, e.g., waiting for IO on indirect blocks, or queue
>       congestion. In that case whole system stops waiting for IO to
>       complete. In the direct reclaim case, other threads can continue
>       zone scanning.

By running separate GFP_KERNEL, GFP_NOFS and GFP_NOIO kswapds this
should not occur. Much like it now does not occur.

This approach would make it work pretty much like it does now. But
instead of letting each separate context run into reclaim we then have a
fixed set of reclaim contexts which evenly distribute their resulting
free pages.

The possible down sides are:

 - more schedule()s, but I don't think these will matter when we're that
deep into reclaim
 - less concurrency - but I hope 1 set per cpu is enough, we could up
this if it turns out to really help.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set
  2007-08-23  9:26                   ` Peter Zijlstra
  2007-08-23 10:11                     ` Nikita Danilov
@ 2007-08-24  4:00                     ` Nick Piggin
  1 sibling, 0 replies; 46+ messages in thread
From: Nick Piggin @ 2007-08-24  4:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Pavel Machek, linux-mm, linux-kernel, akpm,
	dkegel, David Miller, Nikita Danilov

On Thu, Aug 23, 2007 at 11:26:48AM +0200, Peter Zijlstra wrote:
> On Thu, 2007-08-23 at 05:38 +0200, Nick Piggin wrote:
> > On Tue, Aug 21, 2007 at 04:07:15PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2007-08-21 at 02:39 +0200, Nick Piggin wrote:
> > > > 
> > > > Although interestingly, we are not guaranteed to have enough memory to
> > > > completely initialise writeout of a single page.
> > > 
> > > Yes, that is due to the unbounded nature of direct reclaim, no?
> >  
> > Even writing out a single page to a plain old block backed filesystem
> > can take a fair chunk of memory. I'm not really sure how problematic
> > this is with a "real" filesystem, but even with something pretty simple,
> > you might have to do block allocation, which itself might have to do
> > indirect block allocation (which itself can be 3 or 4 levels), all of
> > which have to actually update block bitmaps (which themselves may be
> > many pages big). Then you also may have to even just allocate the
> > buffer_head structure itself. And that's just to write out a single
> > buffer in the page (on a 64K page system, there might be 64 of these).
> 
> Right, nikita once talked me though all that when we talked about
> clustered writeout.
> 
> IIRC filesystems were supposed to keep mempools big enough to do this
> for a single writepage at a time. Not sure its actually done though.
 
It isn't ;) At least I don't think so for the minix-derived ones
I've seen. But no matter, this is going a bit off topic anyway.


> > But again, on the pragmatic side, the best behaviour I think is just
> > to have writeouts not allocate from reserves without first trying to
> > reclaim some clean memory, and also limit the number of users of the
> > reserve. We want this anyway, right, because we don't want regular
> > reclaim to start causing things like atomic allocation failures when
> > load goes up.
> 
> My idea is to extend kswapd, run cpus_per_node instances of kswapd per
> node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds
> per cpu)
> 
> whenever we would hit direct reclaim, add ourselves to a special
> waitqueue corresponding to the type of GFP and kick all the
> corresponding kswapds.

I don't know what this is solving? You don't need to run all reclaim
from kswapd process in order to limit concurrency. Just explicitly
limit it when a process applies for PF_MEMALLOC reserves. I had a
patch to do this at one point, but it never got much testing -- I
think there were other problems iwth a single process able to do
unbounded writeout and such anyway. But yeah, I don't think getting
rid of direct reclaim will do anything magical.

 
> Now Linus' big objection is that all these processes would hit a wall
> and not progress until the watermarks are high again.
> 
> Here is were the 'special' part of the waitqueue comes into order.
> 
> Instead of freeing pages to the page allocator, these kswapds would hand
> out pages to the waiting processes in a round robin fashion. Only if
> there are no more waiting processes left, would the page go to the buddy
> system.

Directly getting back pages (and having more than 1 kswapd per node)
may be things worth exploring at some point. But I don't see how muchi
bearing they have to any deadlock problems.


> > > And then there is the deadlock in add_to_swap() that I still have to
> > > look into, I hope it can eventually be solved using reserve based
> > > allocation.
> > 
> > Yes it should have a reserve. It wouldn't be hard, all you need is
> > enough memory to be able to swap out a single page I would think (ie.
> > one preload's worth).
> 
> Yeah, just need to look at the locking an batching, and ensure it has
> enough preload to survive one batch, once all the locks are dropped it
> can breathe again :-)

I don't think you'd need to do anything remotely fancy ;) Just so long
as it can allocate a swapcache entry for a single page to write out, that
page will be written and eventually reclaimed, along with its radix tree
nodes.  


> > > The biggest issue is receiving the completion notification. Network
> > > needs to fall back to a state where it does not blindly consumes memory
> > > or drops _all_ packets. An intermediate state is required, one where we
> > > can receive and inspect incoming packets but commit to very few.
> >  
> > Yes, I understand this is the main problem. But it is not _helped_ by
> > the fact that reclaim reserves include the atomic allocation reserves.
> > I haven't run this problem for a long time, but I'd venture to guess the
> > _main_ reason the deadlock is hit is not because of networking allocating
> > a lot of other irrelevant data, but because of reclaim using up most of
> > the atomic allocation reserves.
> 
> Ah, interesting notion.
> 
> > And this observation is not tied to recurisve reclaim: if we somehow had
> > a reserve for atomic allocations that was aside from the reclaim reserve,
> > I think such a system would be practically free of deadlock for more
> > anonymous-intensive workloads too.
> 
> One could get quite far, however the scenario of shutting down the
> remote swap server while other network traffic is present will surely
> still deadlock.

I guess it would still have all the same theoretical holes, and some
could surely still be tickled, yes ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2007-08-24  4:00 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-14 15:30 [RFC 0/9] Reclaim during GFP_ATOMIC allocs Christoph Lameter
2007-08-14 15:30 ` [RFC 1/9] Allow reclaim via __GFP_NOMEMALLOC reclaim Christoph Lameter
2007-08-14 15:30 ` [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set Christoph Lameter
2007-08-18  7:10   ` Pavel Machek
2007-08-20 19:00     ` Christoph Lameter
2007-08-20 20:17       ` Peter Zijlstra
2007-08-20 20:27         ` Christoph Lameter
2007-08-20 21:14           ` Peter Zijlstra
2007-08-20 21:17             ` Christoph Lameter
2007-08-21 14:07               ` Peter Zijlstra
2007-08-21  0:39             ` Nick Piggin
2007-08-21 14:07               ` Peter Zijlstra
2007-08-23  3:38                 ` Nick Piggin
2007-08-23  9:26                   ` Peter Zijlstra
2007-08-23 10:11                     ` Nikita Danilov
2007-08-23 13:58                       ` Peter Zijlstra
2007-08-24  4:00                     ` Nick Piggin
2007-08-14 15:30 ` [RFC 3/9] Make cond_rescheds conditional on __GFP_WAIT Christoph Lameter
2007-08-14 15:30 ` [RFC 4/9] Atomic reclaim: Save irq flags in vmscan.c Christoph Lameter
2007-08-14 20:02   ` Andi Kleen
2007-08-14 19:12     ` Christoph Lameter
2007-08-14 20:05       ` Peter Zijlstra
2007-08-14 20:34         ` Andi Kleen
2007-08-14 20:33       ` Andi Kleen
2007-08-14 20:42         ` Christoph Lameter
2007-08-14 20:44           ` Andi Kleen
2007-08-14 21:15             ` Christoph Lameter
2007-08-14 21:23               ` Andi Kleen
2007-08-14 21:26                 ` Christoph Lameter
2007-08-14 21:29                   ` Andi Kleen
2007-08-14 21:37                     ` Christoph Lameter
2007-08-14 21:44                       ` Andi Kleen
2007-08-14 21:48                         ` Christoph Lameter
2007-08-14 21:56                           ` Andi Kleen
2007-08-14 22:07                             ` Christoph Lameter
2007-08-14 22:16                               ` Andi Kleen
2007-08-14 22:20                                 ` Christoph Lameter
2007-08-14 22:21                                   ` Andi Kleen
2007-08-14 22:41                                     ` Christoph Lameter
2007-08-14 15:30 ` [RFC 5/9] Save irqflags on taking the mapping lock Christoph Lameter
2007-08-14 15:30 ` [RFC 6/9] Disable irqs on taking the private_lock Christoph Lameter
2007-08-14 15:30 ` [RFC 7/9] Save flags in swap.c Christoph Lameter
2007-08-14 15:30 ` [RFC 8/9] Reclaim on an atomic allocation if necessary Christoph Lameter
2007-08-14 15:30 ` [RFC 9/9] Testing: Perform GFP_ATOMIC overallocation Christoph Lameter
2007-08-16  2:49 ` [RFC 0/9] Reclaim during GFP_ATOMIC allocs Nick Piggin
2007-08-16 20:24   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).