[PATCH 00/33] Swap over NFS -v14

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/33] Swap over NFS -v14
@ 2007-10-30 16:04 Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 01/33] mm: gfp_to_alloc_flags() Peter Zijlstra
                   ` (33 more replies)
  0 siblings, 34 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

Hi,

Another posting of the full swap over NFS series. 

[ I tried just posting the first part last time around, but
  that just gets more confusion by lack of a general picture ]

[ patches against 2.6.23-mm1, also to be found online at:
  http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.23-mm1/ ]

The patch-set can be split in roughtly 5 parts, for each of which I shall give
a description.

  Part 1, patches 1-12

The problem with swap over network is the generic swap problem: needing memory
to free memory. Normally this is solved using mempools, as can be seen in the
BIO layer.

Swap over network has the problem that the network subsystem does not use fixed
sized allocations, but heavily relies on kmalloc(). This makes mempools
unusable.

This first part provides a generic reserve framework.

Care is taken to only affect the slow paths - when we're low on memory.

Caveats: it is currently SLUB only.

 1 - mm: gfp_to_alloc_flags()
 2 - mm: tag reseve pages
 3 - mm: slub: add knowledge of reserve pages
 4 - mm: allow mempool to fall back to memalloc reserves
 5 - mm: kmem_estimate_pages()
 6 - mm: allow PF_MEMALLOC from softirq context
 7 - mm: serialize access to min_free_kbytes
 8 - mm: emergency pool
 9 - mm: system wide ALLOC_NO_WATERMARK
10 - mm: __GFP_MEMALLOC
11 - mm: memory reserve management
12 - selinux: tag avc cache alloc as non-critical

  Part 2, patches 13-15

Provide some generic network infrastructure needed later on.

13 - net: wrap sk->sk_backlog_rcv()
14 - net: packet split receive api
15 - net: sk_allocation() - concentrate socket related allocations

  Part 3, patches 16-23

Now that we have a generic memory reserve system, use it on the network stack.
The thing that makes this interesting is that, contrary to BIO, both the
transmit and receive path require memory allocations. 

That is, in the BIO layer write back completion is usually just an ISR flipping
a bit and waking stuff up. A network write back completion involved receiving
packets, which when there is no memory, is rather hard. And even when there is
memory there is no guarantee that the required packet comes in in the window
that that memory buys us.

The solution to this problem is found in the fact that network is to be assumed
lossy. Even now, when there is no memory to receive packets the network card
will have to discard packets. What we do is move this into the network stack.

So we reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them. This way, we can filter out those packets
that ensure progress (writeback completion) and disregard the others (as would
have happened anyway). [ NOTE: this is a stable mode of operation with limited
memory usage, exactly the kind of thing we need ]

Again, care is taken to keep much of the overhead of this to only affect the
slow path. Only packets allocated from the reserves will suffer the extra
atomic overhead needed for accounting.

16 - netvm: network reserve infrastructure
17 - sysctl: propagate conv errors
18 - netvm: INET reserves.
19 - netvm: hook skb allocation to reserves
20 - netvm: filter emergency skbs.
21 - netvm: prevent a TCP specific deadlock
22 - netfilter: NF_QUEUE vs emergency skbs
23 - netvm: skb processing

  Part 4, patches 24-26

Generic vm infrastructure to handle swapping to a filesystem instead of a block
device. The approach here has been questioned, people would like to see a less
invasive approach.

One suggestion is to create and use a_ops->swap_{in,out}().

24 - mm: prepare swap entry methods for use in page methods
25 - mm: add support for non block device backed swap files
26 - mm: methods for teaching filesystems about PG_swapcache pages

  Part 5, patches 27-33

Finally, convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.

27 - nfs: remove mempools
28 - nfs: teach the NFS client how to treat PG_swapcache pages
29 - nfs: disable data cache revalidation for swapfiles
30 - nfs: swap vs nfs_writepage
31 - nfs: enable swap on NFS
32 - nfs: fix various memory recursions possible with swap over NFS.
33 - nfs: do not warn on radix tree node allocation failures

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 01/33] mm: gfp_to_alloc_flags()
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 02/33] mm: tag reseve pages Peter Zijlstra
                   ` (32 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-gfp-to-alloc_flags.patch --]
[-- Type: text/plain, Size: 5781 bytes --]

Factor out the gfp to alloc_flags mapping so it can be used in other places.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/internal.h   |   11 ++++++
 mm/page_alloc.c |   98 ++++++++++++++++++++++++++++++++------------------------
 2 files changed, 67 insertions(+), 42 deletions(-)

Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -47,4 +47,15 @@ static inline unsigned long page_order(s
 	VM_BUG_ON(!PageBuddy(page));
 	return page_private(page);
 }
+
+#define ALLOC_HARDER		0x01 /* try to alloc harder */
+#define ALLOC_HIGH		0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN		0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 #endif
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1139,14 +1139,6 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1535,6 +1527,44 @@ static void set_page_owner(struct page *
 #endif /* CONFIG_PAGE_OWNER */
 
 /*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page * fastcall
@@ -1589,48 +1619,28 @@ restart:
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	/* This is the last chance, in general, before the goto nopage. */
+	page = get_page_from_freelist(gfp_mask, order, zonelist,
+			alloc_flags & ~ALLOC_NO_WATERMARKS);
 	if (page)
 		goto got_pg;
 
 	/* This allocation should allow future memory freeing. */
-
 rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
-				zonelist, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
+		/* go through the zonelist yet again, ignoring mins */
+		page = get_page_from_freelist(gfp_mask, order, zonelist,
+				ALLOC_NO_WATERMARKS);
+		if (page)
+			goto got_pg;
+
+		if (wait && (gfp_mask & __GFP_NOFAIL)) {
+			congestion_wait(WRITE, HZ/50);
+			goto nofail_alloc;
 		}
 		goto nopage;
 	}
@@ -1639,6 +1649,10 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	cond_resched();
 
 	/* We now go into synchronous reclaim */

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 02/33] mm: tag reseve pages
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 01/33] mm: gfp_to_alloc_flags() Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 03/33] mm: slub: add knowledge of reserve pages Peter Zijlstra
                   ` (31 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: page_alloc-reserve.patch --]
[-- Type: text/plain, Size: 1490 bytes --]

Tag pages allocated from the reserves with a non-zero page->reserve.
This allows us to distinguish and account reserve pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    1 +
 mm/page_alloc.c          |    4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -70,6 +70,7 @@ struct page {
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
+		int reserve;		/* page_alloc: page is a reserve page */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1448,8 +1448,10 @@ zonelist_scan:
 		}
 
 		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
-		if (page)
+		if (page) {
+			page->reserve = !!(alloc_flags & ALLOC_NO_WATERMARKS);
 			break;
+		}
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 03/33] mm: slub: add knowledge of reserve pages
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 01/33] mm: gfp_to_alloc_flags() Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 02/33] mm: tag reseve pages Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-31  3:37   ` Nick Piggin
  2007-10-30 16:04 ` [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
                   ` (30 subsequent siblings)
  33 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: reserve-slab.patch --]
[-- Type: text/plain, Size: 4012 bytes --]

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it.

Care is taken to only touch the SLUB slow path.

This is done to ensure reserve pages don't leak out and get consumed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/slub_def.h |    1 +
 mm/slub.c                |   31 +++++++++++++++++++++++--------
 2 files changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -20,11 +20,12 @@
 #include <linux/mempolicy.h>
 #include <linux/ctype.h>
 #include <linux/kallsyms.h>
+#include "internal.h"
 
 /*
  * Lock order:
  *   1. slab_lock(page)
- *   2. slab->list_lock
+ *   2. node->list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -1074,7 +1075,7 @@ static void setup_object(struct kmem_cac
 		s->ctor(s, object);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
 {
 	struct page *page;
 	struct kmem_cache_node *n;
@@ -1090,6 +1091,7 @@ static struct page *new_slab(struct kmem
 	if (!page)
 		goto out;
 
+	*reserve = page->reserve;
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
@@ -1468,10 +1470,22 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
+	int reserve = 0;
 
 	if (!c->page)
 		goto new_slab;
 
+	if (unlikely(c->reserve)) {
+		/*
+		 * If the current slab is a reserve slab and the current
+		 * allocation context does not allow access to the reserves
+		 * we must force an allocation to test the current levels.
+		 */
+		if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+			goto alloc_slab;
+		reserve = 1;
+	}
+
 	slab_lock(c->page);
 	if (unlikely(!node_match(c, node)))
 		goto another_slab;
@@ -1479,10 +1493,9 @@ load_freelist:
 	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SlabDebug(c->page)))
+	if (unlikely(SlabDebug(c->page) || reserve))
 		goto debug;
 
-	object = c->page->freelist;
 	c->freelist = object[c->offset];
 	c->page->inuse = s->objects;
 	c->page->freelist = NULL;
@@ -1500,16 +1513,18 @@ new_slab:
 		goto load_freelist;
 	}
 
+alloc_slab:
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
-	new = new_slab(s, gfpflags, node);
+	new = new_slab(s, gfpflags, node, &reserve);
 
 	if (gfpflags & __GFP_WAIT)
 		local_irq_disable();
 
 	if (new) {
 		c = get_cpu_slab(s, smp_processor_id());
+		c->reserve = reserve;
 		if (c->page) {
 			/*
 			 * Someone else populated the cpu_slab while we
@@ -1537,8 +1552,7 @@ new_slab:
 	}
 	return NULL;
 debug:
-	object = c->page->freelist;
-	if (!alloc_debug_processing(s, c->page, object, addr))
+	if (SlabDebug(c->page) && !alloc_debug_processing(s, c->page, object, addr))
 		goto another_slab;
 
 	c->page->inuse++;
@@ -2010,10 +2024,11 @@ static struct kmem_cache_node *early_kme
 {
 	struct page *page;
 	struct kmem_cache_node *n;
+	int reserve;
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags, node);
+	page = new_slab(kmalloc_caches, gfpflags, node, &reserve);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h
+++ linux-2.6/include/linux/slub_def.h
@@ -17,6 +17,7 @@ struct kmem_cache_cpu {
 	int node;
 	unsigned int offset;
 	unsigned int objsize;
+	int reserve;
 };
 
 struct kmem_cache_node {

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 03/33] mm: slub: add knowledge of reserve pages Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-31  3:40   ` Nick Piggin
  2007-10-30 16:04 ` [PATCH 05/33] mm: kmem_estimate_pages() Peter Zijlstra
                   ` (29 subsequent siblings)
  33 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-mempool_fixup.patch --]
[-- Type: text/plain, Size: 1548 bytes --]

Allow the mempool to use the memalloc reserves when all else fails and
the allocation context would otherwise allow it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mempool.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/mempool.c
===================================================================
--- linux-2.6.orig/mm/mempool.c
+++ linux-2.6/mm/mempool.c
@@ -14,6 +14,7 @@
 #include <linux/mempool.h>
 #include <linux/blkdev.h>
 #include <linux/writeback.h>
+#include "internal.h"
 
 static void add_element(mempool_t *pool, void *element)
 {
@@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf
 	void *element;
 	unsigned long flags;
 	wait_queue_t wait;
-	gfp_t gfp_temp;
+	gfp_t gfp_temp, gfp_orig = gfp_mask;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -228,6 +229,15 @@ repeat_alloc:
 	}
 	spin_unlock_irqrestore(&pool->lock, flags);
 
+	/* if we really had right to the emergency reserves try those */
+	if (gfp_to_alloc_flags(gfp_orig) & ALLOC_NO_WATERMARKS) {
+		if (gfp_temp & __GFP_NOMEMALLOC) {
+			gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			goto repeat_alloc;
+		} else
+			gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
+	}
+
 	/* We must not sleep in the GFP_ATOMIC case */
 	if (!(gfp_mask & __GFP_WAIT))
 		return NULL;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 05/33] mm: kmem_estimate_pages()
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-31  3:43   ` Nick Piggin
  2007-10-30 16:04 ` [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
                   ` (28 subsequent siblings)
  33 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-kmem_estimate_pages.patch --]
[-- Type: text/plain, Size: 4070 bytes --]

Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/slab.h |    3 +
 mm/slub.c            |   82 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -60,6 +60,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_estimate_pages(struct kmem_cache *cachep, gfp_t flags, int objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -94,6 +95,8 @@ int kmem_ptr_validate(struct kmem_cache 
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 size_t ksize(const void *);
+unsigned kestimate_single(size_t, gfp_t, int);
+unsigned kestimate(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -2293,6 +2293,37 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL(kmem_cache_name);
 
 /*
+ * return the max number of pages required to allocated count
+ * objects from the given cache
+ */
+unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects)
+{
+	unsigned long slabs;
+
+	if (WARN_ON(!s) || WARN_ON(!s->objects))
+		return 0;
+
+	slabs = DIV_ROUND_UP(objects, s->objects);
+
+	/*
+	 * Account the possible additional overhead if the slab holds more that
+	 * one object.
+	 */
+	if (s->objects > 1) {
+		/*
+		 * Account the possible additional overhead if per cpu slabs
+		 * are currently empty and have to be allocated. This is very
+		 * unlikely but a possible scenario immediately after
+		 * kmem_cache_shrink.
+		 */
+		slabs += num_online_cpus();
+	}
+
+	return slabs << s->order;
+}
+EXPORT_SYMBOL_GPL(kmem_estimate_pages);
+
+/*
  * Attempt to free all slabs on a node. Return the number of slabs we
  * were unable to free.
  */
@@ -2630,6 +2661,57 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
+ * return the max number of pages required to allocate @count objects
+ * of @size bytes from kmalloc given @flags.
+ */
+unsigned kestimate_single(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = get_slab(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_estimate_pages(s, flags, count);
+
+}
+EXPORT_SYMBOL_GPL(kestimate_single);
+
+/*
+ * return the max number of pages required to allocate @bytes from kmalloc
+ * in an unspecified number of allocation of heterogeneous size.
+ */
+unsigned kestimate(gfp_t flags, size_t bytes)
+{
+	int i;
+	unsigned long pages;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (i = 1; i < PAGE_SHIFT; i++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & SLUB_DMA))
+			s = dma_kmalloc_cache(i, flags);
+		else
+#endif
+			s = &kmalloc_caches[i];
+
+		if (s)
+			pages += kmem_estimate_pages(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kestimate);
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 05/33] mm: kmem_estimate_pages() Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-31  3:51   ` Nick Piggin
  2007-10-30 16:04 ` [PATCH 07/33] mm: serialize access to min_free_kbytes Peter Zijlstra
                   ` (27 subsequent siblings)
  33 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-PF_MEMALLOC-softirq.patch --]
[-- Type: text/plain, Size: 2425 bytes --]

Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current->flags, ksoftirqd will have its own 
task_struct.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    4 ++++
 kernel/softirq.c      |    3 +++
 mm/page_alloc.c       |    7 ++++---
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1557,9 +1557,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((p->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (p->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
Index: linux-2.6/kernel/softirq.c
===================================================================
--- linux-2.6.orig/kernel/softirq.c
+++ linux-2.6/kernel/softirq.c
@@ -211,6 +211,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -249,6 +251,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1389,6 +1389,10 @@ static inline void put_task_struct(struc
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+#define tsk_restore_flags(p, pflags, mask) \
+	do {	(p)->flags &= ~(mask); \
+		(p)->flags |= ((pflags) & (mask)); } while (0)
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask);
 #else

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 07/33] mm: serialize access to min_free_kbytes
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 08/33] mm: emergency pool Peter Zijlstra
                   ` (26 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-setup_per_zone_pages_min.patch --]
[-- Type: text/plain, Size: 2061 bytes --]

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -116,6 +116,7 @@ static char * const zone_names[MAX_NR_ZO
 	 "Movable",
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -4162,12 +4163,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -4222,6 +4223,15 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_pages_min();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4257,7 +4267,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_pages_min();
+	__setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
 	return 0;
 }

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 08/33] mm: emergency pool
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 07/33] mm: serialize access to min_free_kbytes Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
                   ` (25 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-page_alloc-emerg.patch --]
[-- Type: text/plain, Size: 6831 bytes --]

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mmzone.h |    3 +
 mm/page_alloc.c        |   82 +++++++++++++++++++++++++++++++++++++++++++------
 mm/vmstat.c            |    6 +--
 3 files changed, 78 insertions(+), 13 deletions(-)

Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -213,7 +213,7 @@ enum zone_type {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		pages_emerg, pages_min, pages_low, pages_high;
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
@@ -682,6 +682,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
 			struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int pages);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -118,6 +118,8 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -1252,7 +1254,7 @@ int zone_watermark_ok(struct zone *z, in
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min + z->lowmem_reserve[classzone_idx] + z->pages_emerg)
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -1733,8 +1735,8 @@ nofail_alloc:
 nopage:
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
-			" order:%d, mode:0x%x\n",
-			p->comm, order, gfp_mask);
+			" order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n",
+			p->comm, order, gfp_mask, alloc_flags, p->flags);
 		dump_stack();
 		show_mem();
 	}
@@ -1952,9 +1954,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->pages_min),
-			K(zone->pages_low),
-			K(zone->pages_high),
+			K(zone->pages_emerg + zone->pages_min),
+			K(zone->pages_emerg + zone->pages_low),
+			K(zone->pages_emerg + zone->pages_high),
 			K(zone_page_state(zone, NR_ACTIVE)),
 			K(zone_page_state(zone, NR_INACTIVE)),
 			K(zone->present_pages),
@@ -4113,7 +4115,7 @@ static void calculate_totalreserve_pages
 			}
 
 			/* we treat pages_high as reserved pages. */
-			max += zone->pages_high;
+			max += zone->pages_high + zone->pages_emerg;
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -4170,7 +4172,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -4182,11 +4185,13 @@ static void __setup_per_zone_pages_min(v
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		u64 tmp, tmp_emerg;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
+		tmp_emerg = (u64)pages_emerg * zone->present_pages;
+		do_div(tmp_emerg, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -4205,12 +4210,14 @@ static void __setup_per_zone_pages_min(v
 			if (min_pages > 128)
 				min_pages = 128;
 			zone->pages_min = min_pages;
+			zone->pages_emerg = 0;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
 			zone->pages_min = tmp;
+			zone->pages_emerg = tmp_emerg;
 		}
 
 		zone->pages_low   = zone->pages_min + (tmp >> 2);
@@ -4232,6 +4239,63 @@ void setup_per_zone_pages_min(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+static void __adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_pages_min();
+}
+
+static int test_reserve_limits(void)
+{
+	struct zone *zone;
+	int node;
+
+	for_each_zone(zone)
+		wakeup_kswapd(zone, 0);
+
+	for_each_online_node(node) {
+		struct page *page = alloc_pages_node(node, GFP_KERNEL, 0);
+		if (!page)
+			return -ENOMEM;
+
+		__free_page(page);
+	}
+
+	return 0;
+}
+
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks reclaim into action to
+ *	satisfy the higher watermarks.
+ *
+ *	returns -ENOMEM when it failed to satisfy the watermarks.
+ */
+int adjust_memalloc_reserve(int pages)
+{
+	int err = 0;
+
+	mutex_lock(&var_free_mutex);
+	__adjust_memalloc_reserve(pages);
+	if (pages > 0) {
+		err = test_reserve_limits();
+		if (err) {
+			__adjust_memalloc_reserve(-pages);
+			goto unlock;
+		}
+	}
+	printk(KERN_DEBUG "Emergency reserve: %d\n", var_free_kbytes);
+
+unlock:
+	mutex_unlock(&var_free_mutex);
+	return err;
+}
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -752,9 +752,9 @@ static void zoneinfo_show_print(struct s
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   zone->pages_min,
-		   zone->pages_low,
-		   zone->pages_high,
+		   zone->pages_emerg + zone->pages_min,
+		   zone->pages_emerg + zone->pages_low,
+		   zone->pages_emerg + zone->pages_high,
 		   zone->pages_scanned,
 		   zone->nr_scan_active, zone->nr_scan_inactive,
 		   zone->spanned_pages,

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 08/33] mm: emergency pool Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-31  3:52   ` Nick Piggin
  2007-10-30 16:04 ` [PATCH 10/33] mm: __GFP_MEMALLOC Peter Zijlstra
                   ` (24 subsequent siblings)
  33 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: global-ALLOC_NO_WATERMARKS.patch --]
[-- Type: text/plain, Size: 1078 bytes --]

Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
wide - which they are per setup_per_zone_pages_min(), when we scrape the
barrel, do it properly.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/page_alloc.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1638,6 +1638,12 @@ restart:
 rebalance:
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
+		/*
+		 * break out of mempolicy boundaries
+		 */
+		zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+			gfp_zone(gfp_mask);
+
 		/* go through the zonelist yet again, ignoring mins */
 		page = get_page_from_freelist(gfp_mask, order, zonelist,
 				ALLOC_NO_WATERMARKS);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 10/33] mm: __GFP_MEMALLOC
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 11/33] mm: memory reserve management Peter Zijlstra
                   ` (23 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-page_alloc-GFP_EMERGENCY.patch --]
[-- Type: text/plain, Size: 2273 bytes --]

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/gfp.h |    3 ++-
 mm/page_alloc.c     |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -43,6 +43,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* Retry the allocation.  Might fail */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* Retry for ever.  Cannot fail */
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
@@ -88,7 +89,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control allocation constraints */
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1560,7 +1560,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (p->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_MEMALLOC)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_irq() && (p->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 11/33] mm: memory reserve management
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 10/33] mm: __GFP_MEMALLOC Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 12/33] selinux: tag avc cache alloc as non-critical Peter Zijlstra
                   ` (22 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-reserve.patch --]
[-- Type: text/plain, Size: 14479 bytes --]

Generic reserve management code. 

It provides methods to reserve and charge. Upon this, generic alloc/free style
reserve pools could be build, which could fully replace mempool_t
functionality.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/reserve.h |   54 +++++
 mm/Makefile             |    2 
 mm/reserve.c            |  436 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 491 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/reserve.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/reserve.h
@@ -0,0 +1,54 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_RESERVE_H
+#define _LINUX_RESERVE_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct mem_reserve {
+	struct mem_reserve *parent;
+	struct list_head children;
+	struct list_head siblings;
+
+	const char *name;
+
+	long pages;
+	long limit;
+	long usage;
+	spinlock_t lock;	/* protects limit and usage */
+};
+
+extern struct mem_reserve mem_reserve_root;
+
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+		      struct mem_reserve *parent);
+int mem_reserve_connect(struct mem_reserve *new_child,
+	       		struct mem_reserve *node);
+int mem_reserve_disconnect(struct mem_reserve *node);
+
+int mem_reserve_pages_set(struct mem_reserve *res, long pages);
+int mem_reserve_pages_add(struct mem_reserve *res, long pages);
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages,
+			     int overcommit);
+
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes);
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes,
+			       int overcommit);
+
+struct kmem_cache;
+
+int mem_reserve_kmem_cache_set(struct mem_reserve *res,
+	       		       struct kmem_cache *s,
+			       int objects);
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res,
+				  long objs,
+				  int overcommit);
+
+#endif /* _LINUX_RESERVE_H */
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   page_alloc.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-			   page_isolation.o $(mmu-y)
+			   page_isolation.o reserve.o $(mmu-y)
 
 obj-$(CONFIG_BOUNCE)	+= bounce.o
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
Index: linux-2.6/mm/reserve.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/reserve.c
@@ -0,0 +1,436 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007, Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * Description:
+ *
+ * Manage a set of memory reserves.
+ *
+ * A memory reserve is a reserve for a specified number of object of specified
+ * size. Since memory is managed in pages, this reserve demand is then
+ * translated into a page unit.
+ *
+ * So each reserve has a specified object limit, an object usage count and a
+ * number of pages required to back these objects.
+ *
+ * Usage is charged against a reserve, if the charge fails, the resource must
+ * not be allocated/used.
+ *
+ * The reserves are managed in a tree, and the resource demands (pages and
+ * limit) are propagated up the tree. Obviously the object limit will be
+ * meaningless as soon as the unit starts mixing, but the required page reserve
+ * (being of one unit) is still valid at the root.
+ *
+ * It is the page demand of the root node that is used to set the global
+ * reserve (adjust_memalloc_reserve() which sets zone->pages_emerg).
+ *
+ * As long as a subtree has the same usage unit, an aggregate node can be used
+ * to charge against, instead of the leaf nodes. However, do be consistent with
+ * who is charged, resource usage is not propagated up the tree (for
+ * performance reasons).
+ */
+
+#include <linux/reserve.h>
+#include <linux/mutex.h>
+#include <linux/mmzone.h>
+#include <linux/log2.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+static DEFINE_MUTEX(mem_reserve_mutex);
+
+/**
+ * @mem_reserve_root - the global reserve root
+ *
+ * The global reserve is empty, and has no limit unit, it merely
+ * acts as an aggregation point for reserves and an interface to
+ * adjust_memalloc_reserve().
+ */
+struct mem_reserve mem_reserve_root = {
+	.children = LIST_HEAD_INIT(mem_reserve_root.children),
+	.siblings = LIST_HEAD_INIT(mem_reserve_root.siblings),
+	.name = "total reserve",
+};
+
+EXPORT_SYMBOL_GPL(mem_reserve_root);
+
+/**
+ * mem_reserve_init - initialize a memory reserve object
+ * @res - the new reserve object
+ * @name - a name for this reserve
+ */
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+		      struct mem_reserve *parent)
+{
+	memset(res, 0, sizeof(*res));
+	INIT_LIST_HEAD(&res->children);
+	INIT_LIST_HEAD(&res->siblings);
+	res->name = name;
+
+	if (parent)
+		mem_reserve_connect(res, parent);
+}
+
+EXPORT_SYMBOL_GPL(mem_reserve_init);
+
+/*
+ * propagate the pages and limit changes up the tree.
+ */
+static void __calc_reserve(struct mem_reserve *res, long pages, long limit)
+{
+	unsigned long flags;
+
+	for ( ; res; res = res->parent) {
+		res->pages += pages;
+
+		if (limit) {
+			spin_lock_irqsave(&res->lock, flags);
+			res->limit += limit;
+			spin_unlock_irqrestore(&res->lock, flags);
+		}
+	}
+}
+
+/**
+ * __mem_reserve_add - primitive to change the size of a reserve
+ * @res - reserve to change
+ * @pages - page delta
+ * @limit - usage limit delta
+ *
+ * Returns -ENOMEM when a size increase is not possible atm.
+ */
+static int __mem_reserve_add(struct mem_reserve *res, long pages, long limit)
+{
+	int ret = 0;
+	long reserve;
+
+	reserve = mem_reserve_root.pages;
+	__calc_reserve(res, pages, 0);
+	reserve = mem_reserve_root.pages - reserve;
+
+	if (reserve) {
+		ret = adjust_memalloc_reserve(reserve);
+		if (ret)
+			__calc_reserve(res, -pages, 0);
+	}
+
+	if (!ret)
+		__calc_reserve(res, 0, limit);
+
+	return ret;
+}
+
+/**
+ * __mem_reserve_charge - primitive to charge object usage to a reserve
+ * @res - reserve to charge
+ * @charge - size of the charge
+ * @overcommit - allow despite of limit (use with caution!)
+ *
+ * Returns non-zero on success, zero on failure.
+ */
+static
+int __mem_reserve_charge(struct mem_reserve *res, long charge, int overcommit)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&res->lock, flags);
+	if (charge < 0 || res->usage + charge < res->limit || overcommit) {
+		res->usage += charge;
+		if (unlikely(res->usage < 0))
+			res->usage = 0;
+		ret = 1;
+	}
+	spin_unlock_irqrestore(&res->lock, flags);
+
+	return ret;
+}
+
+/**
+ * mem_reserve_connect - connect a reserve to another in a child-parent relation
+ * @new_child - the reserve node to connect (child)
+ * @node - the reserve node to connect to (parent)
+ *
+ * Returns -ENOMEM when the new connection would increase the reserve (parent
+ * is connected to mem_reserve_root) and there is no memory to do so.
+ *
+ * The child is _NOT_ connected on error.
+ */
+int mem_reserve_connect(struct mem_reserve *new_child, struct mem_reserve *node)
+{
+	int ret;
+
+	WARN_ON(!new_child->name);
+
+	mutex_lock(&mem_reserve_mutex);
+	new_child->parent = node;
+	list_add(&new_child->siblings, &node->children);
+	ret = __mem_reserve_add(node, new_child->pages, new_child->limit);
+	if (ret) {
+		new_child->parent = NULL;
+		list_del_init(&new_child->siblings);
+	}
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+
+EXPORT_SYMBOL_GPL(mem_reserve_connect);
+
+/**
+ * mem_reserve_disconnect - sever a nodes connection to the reserve tree
+ * @node - the node to disconnect
+ *
+ * Could, in theory, return -ENOMEM, but since disconnecting a node _should_
+ * only decrease the reserves that _should_ not happen.
+ */
+int mem_reserve_disconnect(struct mem_reserve *node)
+{
+	int ret;
+
+	BUG_ON(!node->parent);
+
+	mutex_lock(&mem_reserve_mutex);
+	ret = __mem_reserve_add(node->parent, -node->pages, -node->limit);
+	if (!ret) {
+		node->parent = NULL;
+		list_del_init(&node->siblings);
+	}
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+
+EXPORT_SYMBOL_GPL(mem_reserve_disconnect);
+
+#ifdef CONFIG_PROC_FS
+
+/*
+ * Simple output of the reserve tree in: /proc/reserve_info
+ * Example:
+ *
+ * localhost ~ # cat /proc/reserve_info
+ * total reserve                  8156K (0/544817)
+ *   total network reserve          8156K (0/544817)
+ *     network TX reserve             196K (0/49)
+ *       protocol TX pages              196K (0/49)
+ *     network RX reserve             7960K (0/544768)
+ *       IPv6 route cache               1372K (0/4096)
+ *       IPv4 route cache               5468K (0/16384)
+ *       SKB data reserve               1120K (0/524288)
+ *         IPv6 fragment cache            560K (0/262144)
+ *         IPv4 fragment cache            560K (0/262144)
+ */
+
+static void mem_reserve_show_item(struct seq_file *m, struct mem_reserve *res,
+				  int nesting)
+{
+	int i;
+	struct mem_reserve *child;
+
+	for (i = 0; i < nesting; i++)
+		seq_puts(m, "  ");
+
+	seq_printf(m, "%-30s %ldK (%ld/%ld)\n",
+		   res->name, res->pages << (PAGE_SHIFT - 10),
+		   res->usage, res->limit);
+
+	list_for_each_entry(child, &res->children, siblings)
+		mem_reserve_show_item(m, child, nesting+1);
+}
+
+static int mem_reserve_show(struct seq_file *m, void *v)
+{
+	mutex_lock(&mem_reserve_mutex);
+	mem_reserve_show_item(m, &mem_reserve_root, 0);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return 0;
+}
+
+static int mem_reserve_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mem_reserve_show, NULL);
+}
+
+static const struct file_operations mem_reserve_opterations = {
+	.open = mem_reserve_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static __init int mem_reserve_proc_init(void)
+{
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("reserve_info", S_IRUSR, NULL);
+	if (entry)
+		entry->proc_fops = &mem_reserve_opterations;
+
+	return 0;
+}
+
+__initcall(mem_reserve_proc_init);
+
+#endif
+
+/*
+ * alloc_page helpers
+ */
+
+/**
+ * mem_reserve_pages_set - set reserves size in pages
+ * @res - reserve to set
+ * @pages - size in pages to set it to
+ *
+ * Returns -ENOMEM when it fails to set the reserve. On failure the old size
+ * is preserved.
+ */
+int mem_reserve_pages_set(struct mem_reserve *res, long pages)
+{
+	int ret;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages -= res->pages;
+	ret = __mem_reserve_add(res, pages, pages);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+
+EXPORT_SYMBOL_GPL(mem_reserve_pages_set);
+
+/**
+ * mem_reserve_pages_add - change the size in a relative way
+ * @res - reserve to change
+ * @pages - number of pages to add (or subtract when negative)
+ *
+ * Similar to mem_reserve_pages_set, except that the argument is relative instead
+ * of absolute.
+ *
+ * Returns -ENOMEM when it fails to increase.
+ */
+int mem_reserve_pages_add(struct mem_reserve *res, long pages)
+{
+	int ret;
+
+	mutex_lock(&mem_reserve_mutex);
+	ret = __mem_reserve_add(res, pages, pages);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+
+/**
+ * mem_reserve_pages_charge - charge page usage to a reserve
+ * @res - reserve to charge
+ * @pages - size to charge
+ * @overcommit - disregard the usage limit (use with caution!)
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages, int overcommit)
+{
+	return __mem_reserve_charge(res, pages, overcommit);
+}
+
+EXPORT_SYMBOL_GPL(mem_reserve_pages_charge);
+
+/*
+ * kmalloc helpers
+ */
+
+/**
+ * mem_reserve_kmalloc_set - set this reserve to bytes worth of kmalloc
+ * @res - reserve to change
+ * @bytes - size in bytes to reserve
+ *
+ * Returns -ENOMEM on failure.
+ */
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes)
+{
+	int ret;
+	long pages;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages = kestimate(GFP_ATOMIC, bytes);
+	pages -= res->pages;
+	bytes -= res->limit;
+	ret = __mem_reserve_add(res, pages, bytes);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+
+EXPORT_SYMBOL_GPL(mem_reserve_kmalloc_set);
+
+/**
+ * mem_reserve_kmalloc_charge - charge bytes to a reserve
+ * @res - reserve to charge
+ * @bytes - bytes to charge
+ * @overcommit - disregard the usage limit (use with caution!)
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes,
+			       int overcommit)
+{
+	if (bytes < 0)
+		bytes = -roundup_pow_of_two(-bytes);
+	else
+		bytes = roundup_pow_of_two(bytes);
+
+	return __mem_reserve_charge(res, bytes, overcommit);
+}
+
+EXPORT_SYMBOL_GPL(mem_reserve_kmalloc_charge);
+
+/*
+ * kmem_cache helpers
+ */
+
+/**
+ * mem_reserve_kmem_cache_set - set reserve to @objects worth of kmem_cache_alloc of @s
+ * @res - reserve to set
+ * @s - kmem_cache to reserve from
+ * @objects - number of objects to reserve
+ *
+ * Returns -ENOMEM on failure.
+ */
+int mem_reserve_kmem_cache_set(struct mem_reserve *res, struct kmem_cache *s,
+			       int objects)
+{
+	int ret;
+	long pages;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages = kmem_estimate_pages(s, GFP_ATOMIC, objects);
+	pages -= res->pages;
+	objects -= res->limit;
+	ret = __mem_reserve_add(res, pages, objects);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+
+EXPORT_SYMBOL_GPL(mem_reserve_kmem_cache_set);
+
+/**
+ * mem_reserve_kmem_cache_charge - charge (or uncharge) usage of objs
+ * @res - reserve to charge
+ * @objs - objects to charge for
+ * @overcommit - disregard the usage limit (use with caution!)
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res, long objs,
+				  int overcommit)
+{
+	return __mem_reserve_charge(res, objs, overcommit);
+}
+
+EXPORT_SYMBOL_GPL(mem_reserve_kmem_cache_charge);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 12/33] selinux: tag avc cache alloc as non-critical
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 11/33] mm: memory reserve management Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 13/33] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
                   ` (21 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra, James Morris

[-- Attachment #1: mm-selinux-emergency.patch --]
[-- Type: text/plain, Size: 999 bytes --]

Failing to allocate a cache entry will only harm performance not correctness.
Do not consume valuable reserve pages for something like that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: James Morris <jmorris@namei.org>
---
 security/selinux/avc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-2/security/selinux/avc.c
===================================================================
--- linux-2.6-2.orig/security/selinux/avc.c
+++ linux-2.6-2/security/selinux/avc.c
@@ -334,7 +334,7 @@ static struct avc_node *avc_alloc_node(v
 {
 	struct avc_node *node;
 
-	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
 	if (!node)
 		goto out;
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 13/33] net: wrap sk->sk_backlog_rcv()
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 12/33] selinux: tag avc cache alloc as non-critical Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 14/33] net: packet split receive api Peter Zijlstra
                   ` (20 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: net-backlog.patch --]
[-- Type: text/plain, Size: 2637 bytes --]

Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h   |    5 +++++
 net/core/sock.c      |    4 ++--
 net/ipv4/tcp.c       |    2 +-
 net/ipv4/tcp_timer.c |    2 +-
 4 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -485,6 +485,11 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	return sk->sk_backlog_rcv(sk, skb);
+}
+
 #define sk_wait_event(__sk, __timeo, __condition)			\
 	({	int __rc;						\
 		release_sock(__sk);					\
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -320,7 +320,7 @@ int sk_receive_skb(struct sock *sk, stru
 		 */
 		mutex_acquire(&sk->sk_lock.dep_map, 0, 1, _RET_IP_);
 
-		rc = sk->sk_backlog_rcv(sk, skb);
+		rc = sk_backlog_rcv(sk, skb);
 
 		mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
 	} else
@@ -1312,7 +1312,7 @@ static void __release_sock(struct sock *
 			struct sk_buff *next = skb->next;
 
 			skb->next = NULL;
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 			/*
 			 * We are in process context here with softirqs
Index: linux-2.6/net/ipv4/tcp.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp.c
+++ linux-2.6/net/ipv4/tcp.c
@@ -1134,7 +1134,7 @@ static void tcp_prequeue_process(struct 
 	 * necessary */
 	local_bh_disable();
 	while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-		sk->sk_backlog_rcv(sk, skb);
+		sk_backlog_rcv(sk, skb);
 	local_bh_enable();
 
 	/* Clear memory counter. */
Index: linux-2.6/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_timer.c
+++ linux-2.6/net/ipv4/tcp_timer.c
@@ -196,7 +196,7 @@ static void tcp_delack_timer(unsigned lo
 		NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED);
 
 		while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
-			sk->sk_backlog_rcv(sk, skb);
+			sk_backlog_rcv(sk, skb);
 
 		tp->ucopy.memory = 0;
 	}

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 14/33] net: packet split receive api
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 13/33] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 15/33] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
                   ` (19 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: net-ps_rx.patch --]
[-- Type: text/plain, Size: 5914 bytes --]

Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs. Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/net/e1000/e1000_main.c |    8 ++------
 drivers/net/sky2.c             |   16 ++++++----------
 include/linux/skbuff.h         |   23 +++++++++++++++++++++++
 net/core/skbuff.c              |   20 ++++++++++++++++++++
 4 files changed, 51 insertions(+), 16 deletions(-)

Index: linux-2.6/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6.orig/drivers/net/e1000/e1000_main.c
+++ linux-2.6/drivers/net/e1000/e1000_main.c
@@ -4407,12 +4407,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
 			pci_unmap_page(pdev, ps_page_dma->ps_page_dma[j],
 					PAGE_SIZE, PCI_DMA_FROMDEVICE);
 			ps_page_dma->ps_page_dma[j] = 0;
-			skb_fill_page_desc(skb, j, ps_page->ps_page[j], 0,
-			                   length);
+			skb_add_rx_frag(skb, j, ps_page->ps_page[j], 0, length);
 			ps_page->ps_page[j] = NULL;
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 		}
 
 		/* strip the ethernet crc, problem is we're using pages now so
@@ -4618,7 +4614,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
 			if (j < adapter->rx_ps_pages) {
 				if (likely(!ps_page->ps_page[j])) {
 					ps_page->ps_page[j] =
-						alloc_page(GFP_ATOMIC);
+						netdev_alloc_page(netdev);
 					if (unlikely(!ps_page->ps_page[j])) {
 						adapter->alloc_rx_buff_failed++;
 						goto no_buffers;
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -846,6 +846,9 @@ static inline void skb_fill_page_desc(st
 	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+			    int off, int size);
+
 #define SKB_PAGE_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->nr_frags)
 #define SKB_FRAG_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->frag_list)
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
@@ -1339,6 +1342,26 @@ static inline struct sk_buff *netdev_all
 	return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+
+/**
+ *	netdev_alloc_page - allocate a page for ps-rx on a specific device
+ *	@dev: network device to receive on
+ *
+ * 	Allocate a new page node local to the specified device.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *netdev_alloc_page(struct net_device *dev)
+{
+	return __netdev_alloc_page(dev, GFP_ATOMIC);
+}
+
+static inline void netdev_free_page(struct net_device *dev, struct page *page)
+{
+	__free_page(page);
+}
+
 /**
  *	skb_clone_writable - is the header of a clone writable
  *	@skb: buffer to check
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -263,6 +263,24 @@ struct sk_buff *__netdev_alloc_skb(struc
 	return skb;
 }
 
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
+	struct page *page;
+
+	page = alloc_pages_node(node, gfp_mask, 0);
+	return page;
+}
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+		int size)
+{
+	skb_fill_page_desc(skb, i, page, off, size);
+	skb->len += size;
+	skb->data_len += size;
+	skb->truesize += size;
+}
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
@@ -2464,6 +2482,8 @@ EXPORT_SYMBOL(kfree_skb);
 EXPORT_SYMBOL(__pskb_pull_tail);
 EXPORT_SYMBOL(__alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_skb);
+EXPORT_SYMBOL(__netdev_alloc_page);
+EXPORT_SYMBOL(skb_add_rx_frag);
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);
 EXPORT_SYMBOL(skb_checksum);
Index: linux-2.6/drivers/net/sky2.c
===================================================================
--- linux-2.6.orig/drivers/net/sky2.c
+++ linux-2.6/drivers/net/sky2.c
@@ -1173,7 +1173,7 @@ static struct sk_buff *sky2_rx_alloc(str
 	skb_reserve(skb, ALIGN(p, RX_SKB_ALIGN) - p);
 
 	for (i = 0; i < sky2->rx_nfrags; i++) {
-		struct page *page = alloc_page(GFP_ATOMIC);
+		struct page *page = netdev_alloc_page(sky2->netdev);
 
 		if (!page)
 			goto free_partial;
@@ -2089,8 +2089,8 @@ static struct sk_buff *receive_copy(stru
 }
 
 /* Adjust length of skb with fragments to match received data */
-static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
-			  unsigned int length)
+static void skb_put_frags(struct sky2_port *sky2, struct sk_buff *skb,
+			  unsigned int hdr_space, unsigned int length)
 {
 	int i, num_frags;
 	unsigned int size;
@@ -2107,15 +2107,11 @@ static void skb_put_frags(struct sk_buff
 
 		if (length == 0) {
 			/* don't need this page */
-			__free_page(frag->page);
+			netdev_free_page(sky2->netdev, frag->page);
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
-
-			frag->size = size;
-			skb->data_len += size;
-			skb->truesize += size;
-			skb->len += size;
+			skb_add_rx_frag(skb, i, frag->page, 0, size);
 			length -= size;
 		}
 	}
@@ -2142,7 +2138,7 @@ static struct sk_buff *receive_new(struc
 	sky2_rx_map_skb(sky2->hw->pdev, re, hdr_space);
 
 	if (skb_shinfo(skb)->nr_frags)
-		skb_put_frags(skb, hdr_space, length);
+		skb_put_frags(sky2, skb, hdr_space, length);
 	else
 		skb_put(skb, length);
 	return skb;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 15/33] net: sk_allocation() - concentrate socket related allocations
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 14/33] net: packet split receive api Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 16/33] netvm: network reserve infrastructure Peter Zijlstra
                   ` (18 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: net-sk_allocation.patch --]
[-- Type: text/plain, Size: 5260 bytes --]

Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h    |    7 ++++++-
 net/ipv4/tcp_output.c |   11 ++++++-----
 net/ipv6/tcp_ipv6.c   |   14 +++++++++-----
 3 files changed, 21 insertions(+), 11 deletions(-)

Index: linux-2.6/net/ipv4/tcp_output.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_output.c
+++ linux-2.6/net/ipv4/tcp_output.c
@@ -2081,7 +2081,7 @@ void tcp_send_fin(struct sock *sk)
 	} else {
 		/* Socket is locked, keep trying until memory is available. */
 		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+			skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
 			if (skb)
 				break;
 			yield();
@@ -2114,7 +2114,7 @@ void tcp_send_active_reset(struct sock *
 	struct sk_buff *skb;
 
 	/* NOTE: No TCP options attached and we never retransmit this. */
-	skb = alloc_skb(MAX_TCP_HEADER, priority);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
 	if (!skb) {
 		NET_INC_STATS(LINUX_MIB_TCPABORTFAILED);
 		return;
@@ -2187,7 +2187,8 @@ struct sk_buff * tcp_make_synack(struct 
 	__u8 *md5_hash_location;
 #endif
 
-	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
+	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1,
+			sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return NULL;
 
@@ -2446,7 +2447,7 @@ void tcp_send_ack(struct sock *sk)
 		 * tcp_transmit_skb() will set the ownership to this
 		 * sock.
 		 */
-		buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+		buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 		if (buff == NULL) {
 			inet_csk_schedule_ack(sk);
 			inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2488,7 +2489,7 @@ static int tcp_xmit_probe_skb(struct soc
 	struct sk_buff *skb;
 
 	/* We don't queue it, tcp_transmit_skb() sets ownership. */
-	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return -1;
 
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -419,6 +419,11 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+	return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
@@ -1212,7 +1217,7 @@ static inline struct sk_buff *sk_stream_
 	int hdr_len;
 
 	hdr_len = SKB_DATA_ALIGN(sk->sk_prot->max_header);
-	skb = alloc_skb_fclone(size + hdr_len, gfp);
+	skb = alloc_skb_fclone(size + hdr_len, sk_allocation(sk, gfp));
 	if (skb) {
 		skb->truesize += mem;
 		if (sk_stream_wmem_schedule(sk, skb->truesize)) {
Index: linux-2.6/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6/net/ipv6/tcp_ipv6.c
@@ -573,7 +573,8 @@ static int tcp_v6_md5_do_add(struct sock
 	} else {
 		/* reallocate new list if current one is full. */
 		if (!tp->md5sig_info) {
-			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), GFP_ATOMIC);
+			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
+					sk_allocation(sk, GFP_ATOMIC));
 			if (!tp->md5sig_info) {
 				kfree(newkey);
 				return -ENOMEM;
@@ -583,7 +584,8 @@ static int tcp_v6_md5_do_add(struct sock
 		tcp_alloc_md5sig_pool();
 		if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
 			keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
-				       (tp->md5sig_info->entries6 + 1)), GFP_ATOMIC);
+				       (tp->md5sig_info->entries6 + 1)),
+				       sk_allocation(sk, GFP_ATOMIC));
 
 			if (!keys) {
 				tcp_free_md5sig_pool();
@@ -709,7 +711,7 @@ static int tcp_v6_parse_md5_keys (struct
 		struct tcp_sock *tp = tcp_sk(sk);
 		struct tcp_md5sig_info *p;
 
-		p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+		p = kzalloc(sizeof(struct tcp_md5sig_info), sk->sk_allocation);
 		if (!p)
 			return -ENOMEM;
 
@@ -1006,7 +1008,7 @@ static void tcp_v6_send_reset(struct soc
 	 */
 
 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-			 GFP_ATOMIC);
+			 sk_allocation(sk, GFP_ATOMIC));
 	if (buff == NULL)
 		return;
 
@@ -1085,10 +1087,12 @@ static void tcp_v6_send_ack(struct tcp_t
 	struct tcp_md5sig_key *key;
 	struct tcp_md5sig_key tw_key;
 #endif
+	gfp_t gfp_mask = GFP_ATOMIC;
 
 #ifdef CONFIG_TCP_MD5SIG
 	if (!tw && skb->sk) {
 		key = tcp_v6_md5_do_lookup(skb->sk, &ipv6_hdr(skb)->daddr);
+		gfp_mask = sk_allocation(skb->sk, gfp_mask);
 	} else if (tw && tw->tw_md5_keylen) {
 		tw_key.key = tw->tw_md5_key;
 		tw_key.keylen = tw->tw_md5_keylen;
@@ -1106,7 +1110,7 @@ static void tcp_v6_send_ack(struct tcp_t
 #endif
 
 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-			 GFP_ATOMIC);
+			 gfp_mask);
 	if (buff == NULL)
 		return;
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 16/33] netvm: network reserve infrastructure
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 15/33] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 17/33] sysctl: propagate conv errors Peter Zijlstra
                   ` (17 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-reserve.patch --]
[-- Type: text/plain, Size: 7579 bytes --]

Provide the basic infrastructure to reserve and charge/account network memory.

We provide the following reserve tree:

1)  total network reserve
2)    network TX reserve
3)      protocol TX pages
4)    network RX reserve
5)      SKB data reserve

[1] is used to make all the network reserves a single subtree, for easy
manipulation.

[2] and [4] are merely for eastetic reasons.

The TX pages reserve [3] is assumed bounded by it being the upper bound of
memory that can be used for sending pages (not quite true, but good enough)

The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
against in the fallback path.

The consumers for these reserves are sockets marked with:
  SOCK_MEMALLOC

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |   35 +++++++++++++++-
 net/Kconfig        |    3 +
 net/core/sock.c    |  113 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 150 insertions(+), 1 deletion(-)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -50,6 +50,7 @@
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/mm.h>
 #include <linux/security.h>
+#include <linux/reserve.h>
 
 #include <linux/filter.h>
 
@@ -397,6 +398,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -419,9 +421,40 @@ static inline int sock_flag(struct sock 
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_memalloc(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern atomic_t memalloc_socks;
+
+extern struct mem_reserve net_rx_reserve;
+extern struct mem_reserve net_skb_reserve;
+
+static inline int sk_memalloc_socks(void)
+{
+	return atomic_read(&memalloc_socks);
+}
+
+extern int rx_emergency_get(int bytes);
+extern int rx_emergency_get_overcommit(int bytes);
+extern void rx_emergency_put(int bytes);
+
+extern int sk_adjust_memalloc(int socks, long tx_reserve_pages);
+extern int sk_set_memalloc(struct sock *sk);
+extern int sk_clear_memalloc(struct sock *sk);
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-	return gfp_mask;
+	return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/reserve.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -213,6 +214,111 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+atomic_t memalloc_socks;
+
+static struct mem_reserve net_reserve;
+struct mem_reserve net_rx_reserve;
+struct mem_reserve net_skb_reserve;
+static struct mem_reserve net_tx_reserve;
+static struct mem_reserve net_tx_pages;
+
+EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */
+EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */
+
+/*
+ * is there room for another emergency packet?
+ */
+static int __rx_emergency_get(int bytes, bool overcommit)
+{
+	return mem_reserve_kmalloc_charge(&net_skb_reserve, bytes, overcommit);
+}
+
+int rx_emergency_get(int bytes)
+{
+	return __rx_emergency_get(bytes, false);
+}
+
+int rx_emergency_get_overcommit(int bytes)
+{
+	return __rx_emergency_get(bytes, true);
+}
+
+void rx_emergency_put(int bytes)
+{
+	mem_reserve_kmalloc_charge(&net_skb_reserve, -bytes, 0);
+}
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_MEMALLOC sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for RX pages.
+ */
+int sk_adjust_memalloc(int socks, long tx_reserve_pages)
+{
+	int nr_socks;
+	int err;
+
+	err = mem_reserve_pages_add(&net_tx_pages, tx_reserve_pages);
+	if (err)
+		return err;
+
+	nr_socks = atomic_read(&memalloc_socks);
+	if (!nr_socks && socks > 0)
+		err = mem_reserve_connect(&net_reserve, &mem_reserve_root);
+	nr_socks = atomic_add_return(socks, &memalloc_socks);
+	if (!nr_socks && socks)
+		err = mem_reserve_disconnect(&net_reserve);
+
+	if (err)
+		mem_reserve_pages_add(&net_tx_pages, -tx_reserve_pages);
+
+	return err;
+}
+
+/**
+ *	sk_set_memalloc - sets %SOCK_MEMALLOC
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_memalloc(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_MEMALLOC);
+#ifndef CONFIG_NETVM
+	BUG();
+#endif
+	if (!set) {
+		int err = sk_adjust_memalloc(1, 0);
+		if (err)
+			return err;
+
+		sock_set_flag(sk, SOCK_MEMALLOC);
+		sk->sk_allocation |= __GFP_MEMALLOC;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
+int sk_clear_memalloc(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_MEMALLOC);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_MEMALLOC);
+		sk->sk_allocation &= ~__GFP_MEMALLOC;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -919,6 +1025,7 @@ void sk_free(struct sock *sk)
 	struct sk_filter *filter;
 	struct module *owner = sk->sk_prot_creator->owner;
 
+	sk_clear_memalloc(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
@@ -1049,6 +1156,12 @@ void __init sk_init(void)
 		sysctl_wmem_max = 131071;
 		sysctl_rmem_max = 131071;
 	}
+
+	mem_reserve_init(&net_reserve, "total network reserve", NULL);
+	mem_reserve_init(&net_rx_reserve, "network RX reserve", &net_reserve);
+	mem_reserve_init(&net_skb_reserve, "SKB data reserve", &net_rx_reserve);
+	mem_reserve_init(&net_tx_reserve, "network TX reserve", &net_reserve);
+	mem_reserve_init(&net_tx_pages, "protocol TX pages", &net_tx_reserve);
 }
 
 /*
Index: linux-2.6/net/Kconfig
===================================================================
--- linux-2.6.orig/net/Kconfig
+++ linux-2.6/net/Kconfig
@@ -237,6 +237,9 @@ endmenu
 source "net/rfkill/Kconfig"
 source "net/9p/Kconfig"
 
+config NETVM
+	def_bool n
+
 endif   # if NET
 endmenu # Networking
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 17/33] sysctl: propagate conv errors
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 16/33] netvm: network reserve infrastructure Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 18/33] netvm: INET reserves Peter Zijlstra
                   ` (16 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: sysctl_parse_error.patch --]
[-- Type: text/plain, Size: 1476 bytes --]

Currently conv routines will only generate -EINVAL, allow for other
errors to be propagetd.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sysctl.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -1732,6 +1732,7 @@ static int __do_proc_dointvec(void *tbl_
 	int *i, vleft, first=1, neg, val;
 	unsigned long lval;
 	size_t left, len;
+	int ret = 0;
 	
 	char buf[TMPBUFLEN], *p;
 	char __user *s = buffer;
@@ -1787,14 +1788,16 @@ static int __do_proc_dointvec(void *tbl_
 			s += len;
 			left -= len;
 
-			if (conv(&neg, &lval, i, 1, data))
+			ret = conv(&neg, &lval, i, 1, data);
+			if (ret)
 				break;
 		} else {
 			p = buf;
 			if (!first)
 				*p++ = '\t';
 	
-			if (conv(&neg, &lval, i, 0, data))
+			ret = conv(&neg, &lval, i, 0, data);
+			if (ret)
 				break;
 
 			sprintf(p, "%s%lu", neg ? "-" : "", lval);
@@ -1823,11 +1826,9 @@ static int __do_proc_dointvec(void *tbl_
 			left--;
 		}
 	}
-	if (write && first)
-		return -EINVAL;
 	*lenp -= left;
 	*ppos += *lenp;
-	return 0;
+	return ret;
 #undef TMPBUFLEN
 }
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 18/33] netvm: INET reserves.
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 17/33] sysctl: propagate conv errors Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 19/33] netvm: hook skb allocation to reserves Peter Zijlstra
                   ` (15 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-reserve-inet.patch --]
[-- Type: text/plain, Size: 12290 bytes --]

Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Reserve the route cache under generic RX reserve, its usage is bounded by
the high reclaim watermark, and thus does not need further accounting.

Reserve the ip-fragement caches under SKB data reserve, these add to the
SKB RX limit. By ensuring we can at least receive as much data as fits in
the reassmbly line we avoid fragment attack deadlocks.

Use proc conv() routines to update these limits and return -ENOMEM to user
space.

Adds to the reserve tree:

  total network reserve      
    network TX reserve       
      protocol TX pages      
    network RX reserve       
+     IPv6 route cache       
+     IPv4 route cache       
      SKB data reserve       
+       IPv6 fragment cache  
+       IPv4 fragment cache  

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sysctl.h     |   11 +++++++++++
 kernel/sysctl.c            |    8 ++++++--
 net/ipv4/ip_fragment.c     |    7 +++++++
 net/ipv4/route.c           |   30 +++++++++++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c |   24 +++++++++++++++++++++++-
 net/ipv6/reassembly.c      |    7 +++++++
 net/ipv6/route.c           |   31 ++++++++++++++++++++++++++++++-
 net/ipv6/sysctl_net_ipv6.c |   24 +++++++++++++++++++++++-
 8 files changed, 136 insertions(+), 6 deletions(-)

Index: linux-2.6/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/sysctl_net_ipv4.c
+++ linux-2.6/net/ipv4/sysctl_net_ipv4.c
@@ -18,6 +18,7 @@
 #include <net/route.h>
 #include <net/tcp.h>
 #include <net/cipso_ipv4.h>
+#include <linux/reserve.h>
 
 /* From af_inet.c */
 extern int sysctl_ip_nonlocal_bind;
@@ -186,6 +187,27 @@ static int strategy_allowed_congestion_c
 
 }
 
+extern struct mem_reserve ipv4_frag_reserve;
+
+static int do_proc_dointvec_fragment_conv(int *negp, unsigned long *lvalp,
+				 int *valp, int write, void *data)
+{
+	if (write) {
+		long value = *negp ? -*lvalp : *lvalp;
+		int err = mem_reserve_kmalloc_set(&ipv4_frag_reserve, value);
+		if (err)
+			return err;
+	}
+	return do_proc_dointvec_conv(negp, lvalp, valp, write, data);
+}
+
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
+			do_proc_dointvec_fragment_conv, NULL);
+}
+
 ctl_table ipv4_table[] = {
 	{
 		.ctl_name	= NET_IPV4_TCP_TIMESTAMPS,
@@ -291,7 +313,7 @@ ctl_table ipv4_table[] = {
 		.data		= &sysctl_ipfrag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- linux-2.6.orig/net/ipv6/sysctl_net_ipv6.c
+++ linux-2.6/net/ipv6/sysctl_net_ipv6.c
@@ -12,9 +12,31 @@
 #include <net/ndisc.h>
 #include <net/ipv6.h>
 #include <net/addrconf.h>
+#include <linux/reserve.h>
 
 #ifdef CONFIG_SYSCTL
 
+extern struct mem_reserve ipv6_frag_reserve;
+
+static int do_proc_dointvec_fragment_conv(int *negp, unsigned long *lvalp,
+				 int *valp, int write, void *data)
+{
+	if (write) {
+		long value = *negp ? -*lvalp : *lvalp;
+		int err = mem_reserve_kmalloc_set(&ipv6_frag_reserve, value);
+		if (err)
+			return err;
+	}
+	return do_proc_dointvec_conv(negp, lvalp, valp, write, data);
+}
+
+static int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
+			do_proc_dointvec_fragment_conv, NULL);
+}
+
 static ctl_table ipv6_table[] = {
 	{
 		.ctl_name	= NET_IPV6_ROUTE,
@@ -44,7 +66,7 @@ static ctl_table ipv6_table[] = {
 		.data		= &sysctl_ip6frag_high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment
 	},
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_LOW_THRESH,
Index: linux-2.6/net/ipv4/ip_fragment.c
===================================================================
--- linux-2.6.orig/net/ipv4/ip_fragment.c
+++ linux-2.6/net/ipv4/ip_fragment.c
@@ -43,6 +43,7 @@
 #include <linux/udp.h>
 #include <linux/inet.h>
 #include <linux/netfilter_ipv4.h>
+#include <linux/reserve.h>
 
 /* NOTE. Logic of IP defragmentation is parallel to corresponding IPv6
  * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c
@@ -733,6 +734,8 @@ struct sk_buff *ip_defrag(struct sk_buff
 	return NULL;
 }
 
+struct mem_reserve ipv4_frag_reserve;
+
 void __init ipfrag_init(void)
 {
 	ipfrag_hash_rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
@@ -742,6 +745,10 @@ void __init ipfrag_init(void)
 	ipfrag_secret_timer.function = ipfrag_secret_rebuild;
 	ipfrag_secret_timer.expires = jiffies + sysctl_ipfrag_secret_interval;
 	add_timer(&ipfrag_secret_timer);
+
+	mem_reserve_init(&ipv4_frag_reserve, "IPv4 fragment cache",
+			 &net_skb_reserve);
+	mem_reserve_kmalloc_set(&ipv4_frag_reserve, sysctl_ipfrag_high_thresh);
 }
 
 EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6/net/ipv6/reassembly.c
===================================================================
--- linux-2.6.orig/net/ipv6/reassembly.c
+++ linux-2.6/net/ipv6/reassembly.c
@@ -42,6 +42,7 @@
 #include <linux/icmpv6.h>
 #include <linux/random.h>
 #include <linux/jhash.h>
+#include <linux/reserve.h>
 
 #include <net/sock.h>
 #include <net/snmp.h>
@@ -770,6 +771,8 @@ static struct inet6_protocol frag_protoc
 	.flags		=	INET6_PROTO_NOPOLICY,
 };
 
+struct mem_reserve ipv6_frag_reserve;
+
 void __init ipv6_frag_init(void)
 {
 	if (inet6_add_protocol(&frag_protocol, IPPROTO_FRAGMENT) < 0)
@@ -782,4 +785,8 @@ void __init ipv6_frag_init(void)
 	ip6_frag_secret_timer.function = ip6_frag_secret_rebuild;
 	ip6_frag_secret_timer.expires = jiffies + sysctl_ip6frag_secret_interval;
 	add_timer(&ip6_frag_secret_timer);
+
+	mem_reserve_init(&ipv6_frag_reserve, "IPv6 fragment cache",
+			&net_skb_reserve);
+	mem_reserve_kmalloc_set(&ipv6_frag_reserve, sysctl_ip6frag_high_thresh);
 }
Index: linux-2.6/net/ipv4/route.c
===================================================================
--- linux-2.6.orig/net/ipv4/route.c
+++ linux-2.6/net/ipv4/route.c
@@ -108,6 +108,7 @@
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
+#include <linux/reserve.h>
 
 #define RT_FL_TOS(oldflp) \
     ((u32)(oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK)))
@@ -2698,6 +2699,28 @@ static int ipv4_sysctl_rtcache_flush_str
 	return 0;
 }
 
+static struct mem_reserve ipv4_route_reserve;
+
+static int do_proc_dointvec_route_conv(int *negp, unsigned long *lvalp,
+				 int *valp, int write, void *data)
+{
+	if (write) {
+		long value = *negp ? -*lvalp : *lvalp;
+		int err = mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+				ipv4_dst_ops.kmem_cachep, value);
+		if (err)
+			return err;
+	}
+	return do_proc_dointvec_conv(negp, lvalp, valp, write, data);
+}
+
+static int proc_dointvec_route(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
+			do_proc_dointvec_route_conv, NULL);
+}
+
 ctl_table ipv4_route_table[] = {
 	{
 		.ctl_name 	= NET_IPV4_ROUTE_FLUSH,
@@ -2740,7 +2763,7 @@ ctl_table ipv4_route_table[] = {
 		.data		= &ip_rt_max_size,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_dointvec_route,
 	},
 	{
 		/*  Deprecated. Use gc_min_interval_ms */
@@ -2970,6 +2993,11 @@ int __init ip_rt_init(void)
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;
 
+	mem_reserve_init(&ipv4_route_reserve, "IPv4 route cache",
+			&net_rx_reserve);
+	mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+			ipv4_dst_ops.kmem_cachep, ip_rt_max_size);
+
 	devinet_init();
 	ip_fib_init();
 
Index: linux-2.6/net/ipv6/route.c
===================================================================
--- linux-2.6.orig/net/ipv6/route.c
+++ linux-2.6/net/ipv6/route.c
@@ -38,6 +38,7 @@
 #include <linux/in6.h>
 #include <linux/init.h>
 #include <linux/if_arp.h>
+#include <linux/reserve.h>
 
 #ifdef 	CONFIG_PROC_FS
 #include <linux/proc_fs.h>
@@ -2454,6 +2455,28 @@ int ipv6_sysctl_rtcache_flush(ctl_table 
 		return -EINVAL;
 }
 
+static struct mem_reserve ipv6_route_reserve;
+
+static int do_proc_dointvec_route6_conv(int *negp, unsigned long *lvalp,
+				 int *valp, int write, void *data)
+{
+	if (write) {
+		long value = *negp ? -*lvalp : *lvalp;
+		int err = mem_reserve_kmem_cache_set(&ipv6_route_reserve,
+				ip6_dst_ops.kmem_cachep, value);
+		if (err)
+			return err;
+	}
+	return do_proc_dointvec_conv(negp, lvalp, valp, write, data);
+}
+
+static int proc_dointvec_route6(ctl_table *table, int write, struct file *filp,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
+			do_proc_dointvec_route6_conv, NULL);
+}
+
 ctl_table ipv6_route_table[] = {
 	{
 		.procname	=	"flush",
@@ -2476,7 +2499,7 @@ ctl_table ipv6_route_table[] = {
 		.data		=	&ip6_rt_max_size,
 		.maxlen		=	sizeof(int),
 		.mode		=	0644,
-		.proc_handler	=	&proc_dointvec,
+         	.proc_handler	=	&proc_dointvec_route6,
 	},
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2564,6 +2587,12 @@ void __init ip6_route_init(void)
 
 	proc_net_fops_create(&init_net, "rt6_stats", S_IRUGO, &rt6_stats_seq_fops);
 #endif
+
+	mem_reserve_init(&ipv6_route_reserve, "IPv6 route cache",
+			&net_rx_reserve);
+	mem_reserve_kmem_cache_set(&ipv6_route_reserve,
+			ip6_dst_ops.kmem_cachep, ip6_rt_max_size);
+
 #ifdef CONFIG_XFRM
 	xfrm6_init();
 #endif
Index: linux-2.6/include/linux/sysctl.h
===================================================================
--- linux-2.6.orig/include/linux/sysctl.h
+++ linux-2.6/include/linux/sysctl.h
@@ -966,6 +966,17 @@ typedef int proc_handler (struct ctl_tab
 
 extern int proc_dostring(struct ctl_table *, int, struct file *,
 			 void __user *, size_t *, loff_t *);
+
+extern int do_proc_dointvec_conv(int *negp, unsigned long *lvalp,
+				 int *valp,
+				 int write, void *data);
+
+extern int do_proc_dointvec(ctl_table *table, int write, struct file *filp,
+		  void __user *buffer, size_t *lenp, loff_t *ppos,
+		  int (*conv)(int *negp, unsigned long *lvalp, int *valp,
+			      int write, void *data),
+		  void *data);
+
 extern int proc_dointvec(struct ctl_table *, int, struct file *,
 			 void __user *, size_t *, loff_t *);
 extern int proc_dointvec_bset(struct ctl_table *, int, struct file *,
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -1702,7 +1702,7 @@ int proc_dostring(struct ctl_table *tabl
 }
 
 
-static int do_proc_dointvec_conv(int *negp, unsigned long *lvalp,
+int do_proc_dointvec_conv(int *negp, unsigned long *lvalp,
 				 int *valp,
 				 int write, void *data)
 {
@@ -1721,6 +1721,8 @@ static int do_proc_dointvec_conv(int *ne
 	return 0;
 }
 
+EXPORT_SYMBOL(do_proc_dointvec_conv);
+
 static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
 		  int write, struct file *filp, void __user *buffer,
 		  size_t *lenp, loff_t *ppos,
@@ -1832,7 +1834,7 @@ static int __do_proc_dointvec(void *tbl_
 #undef TMPBUFLEN
 }
 
-static int do_proc_dointvec(struct ctl_table *table, int write, struct file *filp,
+int do_proc_dointvec(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos,
 		  int (*conv)(int *negp, unsigned long *lvalp, int *valp,
 			      int write, void *data),
@@ -1842,6 +1844,8 @@ static int do_proc_dointvec(struct ctl_t
 			buffer, lenp, ppos, conv, data);
 }
 
+EXPORT_SYMBOL(do_proc_dointvec);
+
 /**
  * proc_dointvec - read a vector of integers
  * @table: the sysctl table

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 19/33] netvm: hook skb allocation to reserves
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 18/33] netvm: INET reserves Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 20/33] netvm: filter emergency skbs Peter Zijlstra
                   ` (14 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-skbuff-reserve.patch --]
[-- Type: text/plain, Size: 15047 bytes --]

Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. SKBs allocated from the reserve are tagged in
skb->emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page->index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()->dataref. 

Note that the decision to distinguish between regular and emergency SKBs allows
the accounting overhead to be limited to the later kind.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm_types.h |    1 
 include/linux/skbuff.h   |   25 +++++-
 net/core/skbuff.c        |  173 +++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 173 insertions(+), 26 deletions(-)

Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -289,7 +289,8 @@ struct sk_buff {
 	__u8			pkt_type:3,
 				fclone:2,
 				ipvs_property:1,
-				nf_trace:1;
+				nf_trace:1,
+				emergency:1;
 	__be16			protocol;
 
 	void			(*destructor)(struct sk_buff *skb);
@@ -341,10 +342,22 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
+static inline bool skb_emergency(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETVM
+	return unlikely(skb->emergency);
+#else
+	return false;
+#endif
+}
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone, int node);
+				   gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -354,7 +367,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern void	       kfree_skbmem(struct sk_buff *skb);
@@ -1297,7 +1310,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
@@ -1343,6 +1357,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1359,7 +1374,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-	__free_page(page);
+	__netdev_free_page(dev, page);
 }
 
 /**
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -179,21 +179,28 @@ EXPORT_SYMBOL(skb_truesize_bug);
  *	%GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone, int node)
+			    int flags, int node)
 {
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	int emergency = 0, memalloc = sk_memalloc_socks();
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
+#ifdef CONFIG_NETVM
+	if (memalloc && (flags & SKB_ALLOC_RX))
+		gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
 
+retry_alloc:
+#endif
 	/* Get the HEAD */
 	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
 	if (!skb)
-		goto out;
+		goto noskb;
 
-	size = SKB_DATA_ALIGN(size);
 	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
 			gfp_mask, node);
 	if (!data)
@@ -203,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	 * See comment in sk_buff definition, just before the 'tail' member
 	 */
 	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->emergency = emergency;
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -219,7 +227,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -227,12 +235,31 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->emergency = skb->emergency;
 	}
 out:
 	return skb;
+
 nodata:
 	kmem_cache_free(cache, skb);
 	skb = NULL;
+noskb:
+#ifdef CONFIG_NETVM
+	/* Attempt emergency allocation when RX skb. */
+	if (likely(!(flags & SKB_ALLOC_RX) || !memalloc))
+		goto out;
+
+	if (!emergency) {
+		if (rx_emergency_get(size)) {
+			gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+			gfp_mask |= __GFP_MEMALLOC;
+			emergency = 1;
+			goto retry_alloc;
+		}
+	} else
+		rx_emergency_put(size);
+#endif
+
 	goto out;
 }
 
@@ -255,7 +282,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -268,10 +295,34 @@ struct page *__netdev_alloc_page(struct 
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct page *page;
 
+#ifdef CONFIG_NETVM
+	gfp_mask |= __GFP_NOMEMALLOC | __GFP_NOWARN;
+#endif
+
 	page = alloc_pages_node(node, gfp_mask, 0);
+
+#ifdef CONFIG_NETVM
+	if (!page && rx_emergency_get(PAGE_SIZE)) {
+		gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_NOWARN);
+		gfp_mask |= __GFP_MEMALLOC;
+		page = alloc_pages_node(node, gfp_mask, 0);
+		if (!page)
+			rx_emergency_put(PAGE_SIZE);
+	}
+#endif
+
 	return page;
 }
 
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+#ifdef CONFIG_NETVM
+	if (unlikely(page->reserve))
+		rx_emergency_put(PAGE_SIZE);
+#endif
+	__free_page(page);
+}
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
@@ -279,6 +330,34 @@ void skb_add_rx_frag(struct sk_buff *skb
 	skb->len += size;
 	skb->data_len += size;
 	skb->truesize += size;
+
+#ifdef CONFIG_NETVM
+	/*
+	 * Fix-up the emergency accounting; make sure all pages match
+	 * skb->emergency.
+	 *
+	 * This relies on page->reserve to be preserved between
+	 * the call to __netdev_alloc_page() and this call.
+	 */
+	if (skb_emergency(skb)) {
+		/*
+		 * If the page was not an emergency alloc (ALLOC_NO_WATERMARK)
+		 * we can use overcommit accounting, since we already have the
+		 * memory.
+		 */
+		if (!page->reserve)
+			rx_emergency_get_overcommit(PAGE_SIZE);
+		atomic_set(&page->frag_count, 1);
+	} else if (unlikely(page->reserve)) {
+		/*
+		 * Rare case; the skb wasn't allocated under pressure but
+		 * the page was. We need to return the page. This can offset
+		 * the accounting a little, but its a constant shift, it does
+		 * not accumulate.
+		 */
+		rx_emergency_put(PAGE_SIZE);
+	}
+#endif
 }
 
 static void skb_drop_list(struct sk_buff **listp)
@@ -307,21 +386,45 @@ static void skb_clone_fraglist(struct sk
 		skb_get(list);
 }
 
+static inline void skb_get_page(struct sk_buff *skb, struct page *page)
+{
+	get_page(page);
+	if (skb_emergency(skb))
+		atomic_inc(&page->frag_count);
+}
+
+static inline void skb_put_page(struct sk_buff *skb, struct page *page)
+{
+	if (skb_emergency(skb) && atomic_dec_and_test(&page->frag_count))
+		rx_emergency_put(PAGE_SIZE);
+	put_page(page);
+}
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
+		int size;
+
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+		size = skb->end;
+#else
+		size = skb->end - skb->head;
+#endif
+
 		if (skb_shinfo(skb)->nr_frags) {
 			int i;
 			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-				put_page(skb_shinfo(skb)->frags[i].page);
+				skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 		}
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
 
 		kfree(skb->head);
+		if (skb_emergency(skb))
+			rx_emergency_put(size);
 	}
 }
 
@@ -440,6 +543,9 @@ struct sk_buff *skb_clone(struct sk_buff
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
 	} else {
+		if (skb_emergency(skb))
+			gfp_mask |= __GFP_MEMALLOC;
+
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -477,6 +583,7 @@ struct sk_buff *skb_clone(struct sk_buff
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
 #endif
+	C(emergency);
 	C(protocol);
 	n->destructor = NULL;
 	C(mark);
@@ -565,6 +672,14 @@ static void copy_skb_header(struct sk_bu
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
 
+static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
+{
+	if (skb_emergency(skb))
+		return SKB_ALLOC_RX;
+
+	return 0;
+}
+
 /**
  *	skb_copy	-	create private copy of an sk_buff
  *	@skb: buffer to copy
@@ -585,15 +700,17 @@ static void copy_skb_header(struct sk_bu
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 {
 	int headerlen = skb->data - skb->head;
+	int size;
 	/*
 	 *	Allocate the copy buffer
 	 */
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end + skb->data_len, gfp_mask);
+	size = skb->end + skb->data_len;
 #else
-	n = alloc_skb(skb->end - skb->head + skb->data_len, gfp_mask);
+	size = skb->end - skb->head + skb->data_len;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), -1);
 	if (!n)
 		return NULL;
 
@@ -628,12 +745,14 @@ struct sk_buff *pskb_copy(struct sk_buff
 	/*
 	 *	Allocate the copy buffer
 	 */
+	int size;
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end, gfp_mask);
+	size = skb->end;
 #else
-	n = alloc_skb(skb->end - skb->head, gfp_mask);
+	size = skb->end - skb->head;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), -1);
 	if (!n)
 		goto out;
 
@@ -652,8 +771,9 @@ struct sk_buff *pskb_copy(struct sk_buff
 		int i;
 
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
-			get_page(skb_shinfo(n)->frags[i].page);
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			skb_shinfo(n)->frags[i] = *frag;
+			skb_get_page(n, frag->page);
 		}
 		skb_shinfo(n)->nr_frags = i;
 	}
@@ -701,6 +821,14 @@ int pskb_expand_head(struct sk_buff *skb
 
 	size = SKB_DATA_ALIGN(size);
 
+	if (skb_emergency(skb)) {
+		if (rx_emergency_get(size))
+			gfp_mask |= __GFP_MEMALLOC;
+		else
+			goto nodata;
+	} else
+		gfp_mask |= __GFP_NOMEMALLOC;
+
 	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;
@@ -716,7 +844,7 @@ int pskb_expand_head(struct sk_buff *skb
 	       sizeof(struct skb_shared_info));
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-		get_page(skb_shinfo(skb)->frags[i].page);
+		skb_get_page(skb, skb_shinfo(skb)->frags[i].page);
 
 	if (skb_shinfo(skb)->frag_list)
 		skb_clone_fraglist(skb);
@@ -795,8 +923,8 @@ struct sk_buff *skb_copy_expand(const st
 	/*
 	 *	Allocate the copy buffer
 	 */
-	struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
-				      gfp_mask);
+	struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+				        gfp_mask, skb_alloc_rx_flag(skb), -1);
 	int oldheadroom = skb_headroom(skb);
 	int head_copy_len, head_copy_off;
 	int off;
@@ -913,7 +1041,7 @@ drop_pages:
 		skb_shinfo(skb)->nr_frags = i;
 
 		for (; i < nfrags; i++)
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
@@ -1082,7 +1210,7 @@ pull_pages:
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -1854,6 +1982,7 @@ static inline void skb_split_no_header(s
 			skb_shinfo(skb1)->frags[k] = skb_shinfo(skb)->frags[i];
 
 			if (pos < len) {
+				struct page *page = skb_shinfo(skb)->frags[i].page;
 				/* Split frag.
 				 * We have two variants in this case:
 				 * 1. Move all the frag to the second
@@ -1862,7 +1991,7 @@ static inline void skb_split_no_header(s
 				 *    where splitting is expensive.
 				 * 2. Split is accurately. We make this.
 				 */
-				get_page(skb_shinfo(skb)->frags[i].page);
+				skb_get_page(skb1, page);
 				skb_shinfo(skb1)->frags[0].page_offset += len - pos;
 				skb_shinfo(skb1)->frags[0].size -= len - pos;
 				skb_shinfo(skb)->frags[i].size	= len - pos;
@@ -2193,7 +2322,8 @@ struct sk_buff *skb_segment(struct sk_bu
 		if (hsize > len || !sg)
 			hsize = len;
 
-		nskb = alloc_skb(hsize + doffset + headroom, GFP_ATOMIC);
+		nskb = __alloc_skb(hsize + doffset + headroom, GFP_ATOMIC,
+				   skb_alloc_rx_flag(skb), -1);
 		if (unlikely(!nskb))
 			goto err;
 
@@ -2238,7 +2368,7 @@ struct sk_buff *skb_segment(struct sk_bu
 			BUG_ON(i >= nfrags);
 
 			*frag = skb_shinfo(skb)->frags[i];
-			get_page(frag->page);
+			skb_get_page(nskb, frag->page);
 			size = frag->size;
 
 			if (pos < offset) {
@@ -2483,6 +2613,7 @@ EXPORT_SYMBOL(__pskb_pull_tail);
 EXPORT_SYMBOL(__alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_page);
+EXPORT_SYMBOL(__netdev_free_page);
 EXPORT_SYMBOL(skb_add_rx_frag);
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -71,6 +71,7 @@ struct page {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
 		int reserve;		/* page_alloc: page is a reserve page */
+		atomic_t frag_count;	/* skb fragment use count */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 20/33] netvm: filter emergency skbs.
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (18 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 19/33] netvm: hook skb allocation to reserves Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 21/33] netvm: prevent a TCP specific deadlock Peter Zijlstra
                   ` (13 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-sk_filter.patch --]
[-- Type: text/plain, Size: 990 bytes --]

Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

The correctness of this approach relies on the fact that networks must be
assumed lossy.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -930,6 +930,9 @@ static inline int sk_filter(struct sock 
 {
 	int err;
 	struct sk_filter *filter;
+
+	if (skb_emergency(skb) && !sk_has_memalloc(sk))
+		return -ENOMEM;
 	
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 21/33] netvm: prevent a TCP specific deadlock
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (19 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 20/33] netvm: filter emergency skbs Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 22/33] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
                   ` (12 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm-tcp-deadlock.patch --]
[-- Type: text/plain, Size: 2654 bytes --]

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    7 ++++---
 net/core/stream.c  |    5 +++--
 2 files changed, 7 insertions(+), 5 deletions(-)

Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -743,7 +743,8 @@ static inline struct inode *SOCK_INODE(s
 }
 
 extern void __sk_stream_mem_reclaim(struct sock *sk);
-extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
+extern int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb,
+		int size, int kind);
 
 #define SK_STREAM_MEM_QUANTUM ((int)PAGE_SIZE)
 
@@ -761,13 +762,13 @@ static inline void sk_stream_mem_reclaim
 static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
 	return (int)skb->truesize <= sk->sk_forward_alloc ||
-		sk_stream_mem_schedule(sk, skb->truesize, 1);
+		sk_stream_mem_schedule(sk, skb, skb->truesize, 1);
 }
 
 static inline int sk_stream_wmem_schedule(struct sock *sk, int size)
 {
 	return size <= sk->sk_forward_alloc ||
-	       sk_stream_mem_schedule(sk, size, 0);
+	       sk_stream_mem_schedule(sk, NULL, size, 0);
 }
 
 /* Used by processes to "lock" a socket state, so that
Index: linux-2.6/net/core/stream.c
===================================================================
--- linux-2.6.orig/net/core/stream.c
+++ linux-2.6/net/core/stream.c
@@ -207,7 +207,7 @@ void __sk_stream_mem_reclaim(struct sock
 
 EXPORT_SYMBOL(__sk_stream_mem_reclaim);
 
-int sk_stream_mem_schedule(struct sock *sk, int size, int kind)
+int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, int size, int kind)
 {
 	int amt = sk_stream_pages(size);
 
@@ -224,7 +224,8 @@ int sk_stream_mem_schedule(struct sock *
 	/* Over hard limit. */
 	if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[2]) {
 		sk->sk_prot->enter_memory_pressure();
-		goto suppress_allocation;
+		if (!skb || (skb && !skb_emergency(skb)))
+			goto suppress_allocation;
 	}
 
 	/* Under pressure. */

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 22/33] netfilter: NF_QUEUE vs emergency skbs
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (20 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 21/33] netvm: prevent a TCP specific deadlock Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 23/33] netvm: skb processing Peter Zijlstra
                   ` (11 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: emergency-nf_queue.patch --]
[-- Type: text/plain, Size: 1088 bytes --]

Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 net/netfilter/core.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/net/netfilter/core.c
===================================================================
--- linux-2.6.orig/net/netfilter/core.c
+++ linux-2.6/net/netfilter/core.c
@@ -181,9 +181,12 @@ next_hook:
 		ret = 1;
 		goto unlock;
 	} else if (verdict == NF_DROP) {
+drop:
 		kfree_skb(*pskb);
 		ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK)  == NF_QUEUE) {
+		if (skb_emergency(*pskb))
+			goto drop;
 		NFDEBUG("nf_hook: Verdict = QUEUE.\n");
 		if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 23/33] netvm: skb processing
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (21 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 22/33] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 21:26   ` Stephen Hemminger
  2007-10-30 21:26   ` Stephen Hemminger
  2007-10-30 16:04 ` [PATCH 24/33] mm: prepare swap entry methods for use in page methods Peter Zijlstra
                   ` (10 subsequent siblings)
  33 siblings, 2 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: netvm.patch --]
[-- Type: text/plain, Size: 4745 bytes --]

In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency SKBs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/net/sock.h |    5 +++++
 net/core/dev.c     |   44 ++++++++++++++++++++++++++++++++++++++------
 net/core/sock.c    |   18 ++++++++++++++++++
 3 files changed, 61 insertions(+), 6 deletions(-)

Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk
 	struct net_device *orig_dev;
 	int ret = NET_RX_DROP;
 	__be16 type;
+	unsigned long pflags = current->flags;
+
+	/* Emergency skb are special, they should
+	 *  - be delivered to SOCK_MEMALLOC sockets only
+	 *  - stay away from userspace
+	 *  - have bounded memory usage
+	 *
+	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+	 * This saves us from propagating the allocation context down to all
+	 * allocation sites.
+	 */
+	if (skb_emergency(skb))
+		current->flags |= PF_MEMALLOC;
 
 	/* if we've gotten here through NAPI, check netpoll */
 	if (netpoll_receive_skb(skb))
-		return NET_RX_DROP;
+		goto out;
 
 	if (!skb->tstamp.tv64)
 		net_timestamp(skb);
@@ -1990,7 +2003,7 @@ int netif_receive_skb(struct sk_buff *sk
 	orig_dev = skb_bond(skb);
 
 	if (!orig_dev)
-		return NET_RX_DROP;
+		goto out;
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
@@ -2009,6 +2022,9 @@ int netif_receive_skb(struct sk_buff *sk
 	}
 #endif
 
+	if (skb_emergency(skb))
+		goto skip_taps;
+
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
@@ -2017,6 +2033,7 @@ int netif_receive_skb(struct sk_buff *sk
 		}
 	}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
 	if (pt_prev) {
 		ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk
 
 	if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
 		kfree_skb(skb);
-		goto out;
+		goto unlock;
 	}
 
 	skb->tc_verd = 0;
 ncls:
 #endif
 
+	if (skb_emergency(skb))
+		switch(skb->protocol) {
+			case __constant_htons(ETH_P_ARP):
+			case __constant_htons(ETH_P_IP):
+			case __constant_htons(ETH_P_IPV6):
+			case __constant_htons(ETH_P_8021Q):
+				break;
+
+			default:
+				goto drop;
+		}
+
 	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
@@ -2056,6 +2085,7 @@ ncls:
 	if (pt_prev) {
 		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
+drop:
 		kfree_skb(skb);
 		/* Jamal, now you will not able to escape explaining
 		 * me how you were going to use this. :-)
@@ -2063,8 +2093,10 @@ ncls:
 		ret = NET_RX_DROP;
 	}
 
-out:
+unlock:
 	rcu_read_unlock();
+out:
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 	return ret;
 }
 
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -523,8 +523,13 @@ static inline void sk_add_backlog(struct
 	skb->next = NULL;
 }
 
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+	if (skb_emergency(skb))
+		return __sk_backlog_rcv(sk, skb);
+
 	return sk->sk_backlog_rcv(sk, skb);
 }
 
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_clear_memalloc);
 
+#ifdef CONFIG_NETVM
+int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	int ret;
+	unsigned long pflags = current->flags;
+
+	/* these should have been dropped before queueing */
+	BUG_ON(!sk_has_memalloc(sk));
+
+	current->flags |= PF_MEMALLOC;
+	ret = sk->sk_backlog_rcv(sk, skb);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
+
+	return ret;
+}
+EXPORT_SYMBOL(__sk_backlog_rcv);
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 24/33] mm: prepare swap entry methods for use in page methods
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (22 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 23/33] netvm: skb processing Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 25/33] mm: add support for non block device backed swap files Peter Zijlstra
                   ` (9 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-swap_entry_methods.patch --]
[-- Type: text/plain, Size: 5407 bytes --]

Move around the swap entry methods in preparation for use from
page methods.

Also provide a function to obtain the swap_info_struct backing
a swap cache page.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm.h      |    8 ++++++++
 include/linux/swap.h    |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/swapops.h |   44 --------------------------------------------
 mm/swapfile.c           |    1 +
 4 files changed, 57 insertions(+), 44 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -12,6 +12,7 @@
 #include <linux/prio_tree.h>
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
+#include <linux/swap.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -573,6 +574,13 @@ static inline struct address_space *page
 	return mapping;
 }
 
+static inline struct swap_info_struct *page_swap_info(struct page *page)
+{
+	swp_entry_t swap = { .val = page_private(page) };
+	BUG_ON(!PageSwapCache(page));
+	return get_swap_info_struct(swp_type(swap));
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -80,6 +80,50 @@ typedef struct {
 } swp_entry_t;
 
 /*
+ * swapcache pages are stored in the swapper_space radix tree.  We want to
+ * get good packing density in that tree, so the index should be dense in
+ * the low-order bits.
+ *
+ * We arrange the `type' and `offset' fields so that `type' is at the five
+ * high-order bits of the swp_entry_t and `offset' is right-aligned in the
+ * remaining bits.
+ *
+ * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
+ */
+#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
+#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
+
+/*
+ * Store a type+offset into a swp_entry_t in an arch-independent format
+ */
+static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
+{
+	swp_entry_t ret;
+
+	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
+			(offset & SWP_OFFSET_MASK(ret));
+	return ret;
+}
+
+/*
+ * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline unsigned swp_type(swp_entry_t entry)
+{
+	return (entry.val >> SWP_TYPE_SHIFT(entry));
+}
+
+/*
+ * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline pgoff_t swp_offset(swp_entry_t entry)
+{
+	return entry.val & SWP_OFFSET_MASK(entry);
+}
+
+/*
  * current->reclaim_state points to one of these when a task is running
  * memory reclaim
  */
@@ -326,6 +370,10 @@ static inline int valid_swaphandles(swp_
 	return 0;
 }
 
+static inline struct swap_info_struct *get_swap_info_struct(unsigned type)
+{
+	return NULL;
+}
 #define can_share_swap_page(p)			(page_mapcount(p) == 1)
 
 static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
Index: linux-2.6/include/linux/swapops.h
===================================================================
--- linux-2.6.orig/include/linux/swapops.h
+++ linux-2.6/include/linux/swapops.h
@@ -1,48 +1,4 @@
 /*
- * swapcache pages are stored in the swapper_space radix tree.  We want to
- * get good packing density in that tree, so the index should be dense in
- * the low-order bits.
- *
- * We arrange the `type' and `offset' fields so that `type' is at the five
- * high-order bits of the swp_entry_t and `offset' is right-aligned in the
- * remaining bits.
- *
- * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
- */
-#define SWP_TYPE_SHIFT(e)	(sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
-#define SWP_OFFSET_MASK(e)	((1UL << SWP_TYPE_SHIFT(e)) - 1)
-
-/*
- * Store a type+offset into a swp_entry_t in an arch-independent format
- */
-static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
-{
-	swp_entry_t ret;
-
-	ret.val = (type << SWP_TYPE_SHIFT(ret)) |
-			(offset & SWP_OFFSET_MASK(ret));
-	return ret;
-}
-
-/*
- * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline unsigned swp_type(swp_entry_t entry)
-{
-	return (entry.val >> SWP_TYPE_SHIFT(entry));
-}
-
-/*
- * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline pgoff_t swp_offset(swp_entry_t entry)
-{
-	return entry.val & SWP_OFFSET_MASK(entry);
-}
-
-/*
  * Convert the arch-dependent pte representation of a swp_entry_t into an
  * arch-independent swp_entry_t.
  */
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -1768,6 +1768,7 @@ get_swap_info_struct(unsigned type)
 {
 	return &swap_info[type];
 }
+EXPORT_SYMBOL_GPL(get_swap_info_struct);
 
 /*
  * swap_lock prevents swap_map being freed. Don't grab an extra

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 25/33] mm: add support for non block device backed swap files
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (23 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 24/33] mm: prepare swap entry methods for use in page methods Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 26/33] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
                   ` (8 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-swapfile.patch --]
[-- Type: text/plain, Size: 8897 bytes --]

A new addres_space_operations method is added:
  int swapfile(struct address_space *, int)

When during sys_swapon() this method is found and returns no error the 
swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/filesystems/Locking |    9 +++++
 include/linux/buffer_head.h       |    2 -
 include/linux/fs.h                |    1 
 include/linux/swap.h              |    3 +
 mm/Kconfig                        |    3 +
 mm/page_io.c                      |   58 ++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c                   |    5 +++
 mm/swapfile.c                     |   22 +++++++++++++-
 8 files changed, 101 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -164,6 +164,7 @@ enum {
 	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
 	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
 	SWP_ACTIVE	= (SWP_USED | SWP_WRITEOK),
+	SWP_FILE	= (1 << 2),	/* file swap area */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
@@ -264,6 +265,8 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
 extern void end_swap_bio_read(struct bio *bio, int err);
 
 /* linux/mm/swap_state.c */
Index: linux-2.6/mm/page_io.c
===================================================================
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/buffer_head.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -102,6 +103,18 @@ int swap_writepage(struct page *page, st
 		unlock_page(page);
 		goto out;
 	}
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE) {
+			ret = sis->swap_file->f_mapping->
+				a_ops->writepage(page, wbc);
+			if (!ret)
+				count_vm_event(PSWPOUT);
+			return ret;
+		}
+	}
+#endif
 	bio = get_swap_bio(GFP_NOIO, page_private(page), page,
 				end_swap_bio_write);
 	if (bio == NULL) {
@@ -120,6 +133,39 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_SWAP_FILE
+void swap_sync_page(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations * a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		if (a_ops->sync_page)
+			a_ops->sync_page(page);
+	} else
+		block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	if (sis->flags & SWP_FILE) {
+		const struct address_space_operations * a_ops =
+			sis->swap_file->f_mapping->a_ops;
+		int (*spd)(struct page *) = a_ops->set_page_dirty;
+#ifdef CONFIG_BLOCK
+		if (!spd)
+			spd = __set_page_dirty_buffers;
+#endif
+		return (*spd)(page);
+	}
+
+	return __set_page_dirty_nobuffers(page);
+}
+#endif
+
 int swap_readpage(struct file *file, struct page *page)
 {
 	struct bio *bio;
@@ -127,6 +173,18 @@ int swap_readpage(struct file *file, str
 
 	BUG_ON(!PageLocked(page));
 	ClearPageUptodate(page);
+#ifdef CONFIG_SWAP_FILE
+	{
+		struct swap_info_struct *sis = page_swap_info(page);
+		if (sis->flags & SWP_FILE) {
+			ret = sis->swap_file->f_mapping->
+				a_ops->readpage(sis->swap_file, page);
+			if (!ret)
+				count_vm_event(PSWPIN);
+			return ret;
+		}
+	}
+#endif
 	bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
 				end_swap_bio_read);
 	if (bio == NULL) {
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -27,8 +27,13 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
+#ifdef CONFIG_SWAP_FILE
+	.sync_page	= swap_sync_page,
+	.set_page_dirty	= swap_set_page_dirty,
+#else
 	.sync_page	= block_sync_page,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
+#endif
 	.migratepage	= migrate_page,
 };
 
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -988,6 +988,13 @@ static void destroy_swap_extents(struct 
 		list_del(&se->list);
 		kfree(se);
 	}
+#ifdef CONFIG_SWAP_FILE
+	if (sis->flags & SWP_FILE) {
+		sis->flags &= ~SWP_FILE;
+		sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 0);
+	}
+#endif
 }
 
 /*
@@ -1080,6 +1087,19 @@ static int setup_swap_extents(struct swa
 		goto done;
 	}
 
+#ifdef CONFIG_SWAP_FILE
+	if (sis->swap_file->f_mapping->a_ops->swapfile) {
+		ret = sis->swap_file->f_mapping->a_ops->
+			swapfile(sis->swap_file->f_mapping, 1);
+		if (!ret) {
+			sis->flags |= SWP_FILE;
+			ret = add_swap_extent(sis, 0, sis->max, 0);
+			*span = sis->pages;
+		}
+		goto done;
+	}
+#endif
+
 	blkbits = inode->i_blkbits;
 	blocks_per_page = PAGE_SIZE >> blkbits;
 
@@ -1644,7 +1664,7 @@ asmlinkage long sys_swapon(const char __
 
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
-	p->flags = SWP_ACTIVE;
+	p->flags |= SWP_WRITEOK;
 	nr_swap_pages += nr_good_pages;
 	total_swap_pages += nr_good_pages;
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -485,6 +485,7 @@ struct address_space_operations {
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	int (*swapfile)(struct address_space *, int);
 };
 
 /*
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -174,6 +174,7 @@ prototypes:
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
 	int (*launder_page) (struct page *);
+	int (*swapfile) (struct address_space *, int);
 
 locking rules:
 	All except set_page_dirty may block
@@ -195,6 +196,7 @@ invalidatepage:		no	yes
 releasepage:		no	yes
 direct_IO:		no
 launder_page:		no	yes
+swapfile		no
 
 	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
@@ -294,6 +296,13 @@ cleaned, or an error value if not. Note 
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
+	->swapfile() will be called with a non zero argument on address spaces
+backing non block device backed swapfiles. A return value of zero indicates
+success. In which case this address space can be used for backing swapspace.
+The swapspace operations will be proxied to the address space operations.
+Swapoff will call this method with a zero argument to release the address
+space.
+
 	Note: currently almost all instances of address_space methods are
 using BKL for internal serialization and that's one of the worst sources
 of contention. Normally they are calling library functions (in fs/buffer.c)
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig
+++ linux-2.6/mm/Kconfig
@@ -186,6 +186,9 @@ config BOUNCE
 	def_bool y
 	depends on BLOCK && MMU && (ZONE_DMA || HIGHMEM)
 
+config SWAP_FILE
+	def_bool n
+
 config NR_QUICK
 	int
 	depends on QUICKLIST
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -329,7 +329,7 @@ static inline void invalidate_inode_buff
 static inline int remove_inode_buffers(struct inode *inode) { return 1; }
 static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
 static inline void invalidate_bdev(struct block_device *bdev) {}
-
+static inline void block_sync_page(struct page *) { }
 
 #endif /* CONFIG_BLOCK */
 #endif /* _LINUX_BUFFER_HEAD_H */

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 26/33] mm: methods for teaching filesystems about PG_swapcache pages
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (24 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 25/33] mm: add support for non block device backed swap files Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 27/33] nfs: remove mempools Peter Zijlstra
                   ` (7 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: mm-page_file_methods.patch --]
[-- Type: text/plain, Size: 3017 bytes --]

In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page->index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file->f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm.h      |   26 ++++++++++++++++++++++++++
 include/linux/pagemap.h |    2 +-
 2 files changed, 27 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -13,6 +13,7 @@
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
 #include <linux/swap.h>
+#include <linux/fs.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -581,6 +582,16 @@ static inline struct swap_info_struct *p
 	return get_swap_info_struct(swp_type(swap));
 }
 
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+	if (unlikely(PageSwapCache(page)))
+		return page_swap_info(page)->swap_file->f_mapping;
+#endif
+	return page->mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
@@ -598,6 +609,21 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
+ * Return the file index of the page. Regular pagecache pages use ->index
+ * whereas swapcache pages use swp_offset(->private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+	if (unlikely(PageSwapCache(page))) {
+		swp_entry_t swap = { .val = page_private(page) };
+		return swp_offset(swap);
+	}
+#endif
+	return page->index;
+}
+
+/*
  * The atomic page->_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
  * using atomic_inc_and_test and atomic_add_negative(-1).
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -145,7 +145,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
+	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 27/33] nfs: remove mempools
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (25 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 26/33] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 28/33] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-no-mempool.patch --]
[-- Type: text/plain, Size: 5144 bytes --]

With the introduction of the shared dirty page accounting in .19, NFS should
not be able to surpise the VM with all dirty pages. Thus it should always be
able to free some memory. Hence no more need for mempools.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/read.c  |   15 +++------------
 fs/nfs/write.c |   27 +++++----------------------
 2 files changed, 8 insertions(+), 34 deletions(-)

Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
 static const struct rpc_call_ops nfs_read_full_ops;
 
 static struct kmem_cache *nfs_rdata_cachep;
-static mempool_t *nfs_rdata_mempool;
-
-#define MIN_POOL_READ	(32)
 
 struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
 {
-	struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
+	struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_rdata_mempool);
+				kmem_cache_free(nfs_rdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct
 	struct nfs_read_data *p = container_of(head, struct nfs_read_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_rdata_mempool);
+	kmem_cache_free(nfs_rdata_cachep, p);
 }
 
 static void nfs_readdata_free(struct nfs_read_data *rdata)
@@ -597,16 +594,10 @@ int __init nfs_init_readpagecache(void)
 	if (nfs_rdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
-						     nfs_rdata_cachep);
-	if (nfs_rdata_mempool == NULL)
-		return -ENOMEM;
-
 	return 0;
 }
 
 void nfs_destroy_readpagecache(void)
 {
-	mempool_destroy(nfs_rdata_mempool);
 	kmem_cache_destroy(nfs_rdata_cachep);
 }
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -28,9 +28,6 @@
 
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
-#define MIN_POOL_WRITE		(32)
-#define MIN_POOL_COMMIT		(4)
-
 /*
  * Local function declarations
  */
@@ -44,12 +41,10 @@ static const struct rpc_call_ops nfs_wri
 static const struct rpc_call_ops nfs_commit_ops;
 
 static struct kmem_cache *nfs_wdata_cachep;
-static mempool_t *nfs_wdata_mempool;
-static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -63,7 +58,7 @@ static void nfs_commit_rcu_free(struct r
 	struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_commit_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 void nfs_commit_free(struct nfs_write_data *wdata)
@@ -73,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -84,7 +79,7 @@ struct nfs_write_data *nfs_writedata_all
 		else {
 			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
 			if (!p->pagevec) {
-				mempool_free(p, nfs_wdata_mempool);
+				kmem_cache_free(nfs_wdata_cachep, p);
 				p = NULL;
 			}
 		}
@@ -97,7 +92,7 @@ static void nfs_writedata_rcu_free(struc
 	struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu);
 	if (p && (p->pagevec != &p->page_array[0]))
 		kfree(p->pagevec);
-	mempool_free(p, nfs_wdata_mempool);
+	kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 static void nfs_writedata_free(struct nfs_write_data *wdata)
@@ -1474,16 +1469,6 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_wdata_cachep == NULL)
 		return -ENOMEM;
 
-	nfs_wdata_mempool = mempool_create_slab_pool(MIN_POOL_WRITE,
-						     nfs_wdata_cachep);
-	if (nfs_wdata_mempool == NULL)
-		return -ENOMEM;
-
-	nfs_commit_mempool = mempool_create_slab_pool(MIN_POOL_COMMIT,
-						      nfs_wdata_cachep);
-	if (nfs_commit_mempool == NULL)
-		return -ENOMEM;
-
 	/*
 	 * NFS congestion size, scale with available memory.
 	 *
@@ -1509,8 +1494,6 @@ int __init nfs_init_writepagecache(void)
 
 void nfs_destroy_writepagecache(void)
 {
-	mempool_destroy(nfs_commit_mempool);
-	mempool_destroy(nfs_wdata_mempool);
 	kmem_cache_destroy(nfs_wdata_cachep);
 }
 

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 28/33] nfs: teach the NFS client how to treat PG_swapcache pages
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (26 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 27/33] nfs: remove mempools Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-31  8:52   ` Christoph Hellwig
  2007-10-30 16:04 ` [PATCH 29/33] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
                   ` (5 subsequent siblings)
  33 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-swapcache.patch --]
[-- Type: text/plain, Size: 12772 bytes --]

Replace all relevant occurences of page->index and page->mapping in the NFS
client with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/file.c     |    8 ++++----
 fs/nfs/internal.h |    7 ++++---
 fs/nfs/pagelist.c |    6 +++---
 fs/nfs/read.c     |    6 +++---
 fs/nfs/write.c    |   49 +++++++++++++++++++++++++------------------------
 5 files changed, 39 insertions(+), 37 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -357,7 +357,7 @@ static void nfs_invalidate_page(struct p
 	if (offset != 0)
 		return;
 	/* Cancel any unstarted writes on this page */
-	nfs_wb_page_cancel(page->mapping->host, page);
+	nfs_wb_page_cancel(page_file_mapping(page)->host, page);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
@@ -368,7 +368,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_launder_page(struct page *page)
 {
-	return nfs_wb_page(page->mapping->host, page);
+	return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
 const struct address_space_operations nfs_file_aops = {
@@ -397,13 +397,13 @@ static int nfs_vm_page_mkwrite(struct vm
 	loff_t offset;
 
 	lock_page(page);
-	mapping = page->mapping;
+	mapping = page_file_mapping(page);
 	if (mapping != vma->vm_file->f_path.dentry->d_inode->i_mapping) {
 		unlock_page(page);
 		return -EINVAL;
 	}
 	pagelen = nfs_page_length(page);
-	offset = (loff_t)page->index << PAGE_CACHE_SHIFT;
+	offset = (loff_t)page_file_index(page) << PAGE_CACHE_SHIFT;
 	unlock_page(page);
 
 	/*
Index: linux-2.6/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -77,11 +77,11 @@ nfs_create_request(struct nfs_open_conte
 	 * update_nfs_request below if the region is not locked. */
 	req->wb_page    = page;
 	atomic_set(&req->wb_complete, 0);
-	req->wb_index	= page->index;
+	req->wb_index	= page_file_index(page);
 	page_cache_get(page);
 	BUG_ON(PagePrivate(page));
 	BUG_ON(!PageLocked(page));
-	BUG_ON(page->mapping->host != inode);
+	BUG_ON(page_file_mapping(page)->host != inode);
 	req->wb_offset  = offset;
 	req->wb_pgbase	= offset;
 	req->wb_bytes   = count;
@@ -383,7 +383,7 @@ void nfs_pageio_cond_complete(struct nfs
  * nfs_scan_list - Scan a list for matching requests
  * @nfsi: NFS inode
  * @dst: Destination list
- * @idx_start: lower bound of page->index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  * @tag: tag to scan for
  *
Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -460,11 +460,11 @@ static const struct rpc_call_ops nfs_rea
 int nfs_readpage(struct file *file, struct page *page)
 {
 	struct nfs_open_context *ctx;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	int		error;
 
 	dprintk("NFS: nfs_readpage (%p %ld@%lu)\n",
-		page, PAGE_CACHE_SIZE, page->index);
+		page, PAGE_CACHE_SIZE, page_file_index(page));
 	nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
 	nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -511,7 +511,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
 	struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *new;
 	unsigned int len;
 	int error;
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -126,7 +126,7 @@ static struct nfs_page *nfs_page_find_re
 
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *req = NULL;
 
 	spin_lock(&inode->i_lock);
@@ -138,13 +138,13 @@ static struct nfs_page *nfs_page_find_re
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	loff_t end, i_size = i_size_read(inode);
 	pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
 
-	if (i_size > 0 && page->index < end_index)
+	if (i_size > 0 && page_file_index(page) < end_index)
 		return;
-	end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + ((loff_t)offset+count);
+	end = page_offset(page) + ((loff_t)offset+count);
 	if (i_size >= end)
 		return;
 	nfs_inc_stats(inode, NFSIOS_EXTENDWRITE);
@@ -155,7 +155,7 @@ static void nfs_grow_file(struct page *p
 static void nfs_set_pageerror(struct page *page)
 {
 	SetPageError(page);
-	nfs_zap_mapping(page->mapping->host, page->mapping);
+	nfs_zap_mapping(page_file_mapping(page)->host, page_file_mapping(page));
 }
 
 /* We can set the PG_uptodate flag if we see that a write request
@@ -187,7 +187,7 @@ static int nfs_writepage_setup(struct nf
 		ret = PTR_ERR(req);
 		if (ret != -EBUSY)
 			return ret;
-		ret = nfs_wb_page(page->mapping->host, page);
+		ret = nfs_wb_page(page_file_mapping(page)->host, page);
 		if (ret != 0)
 			return ret;
 	}
@@ -221,7 +221,7 @@ static int nfs_set_page_writeback(struct
 	int ret = test_set_page_writeback(page);
 
 	if (!ret) {
-		struct inode *inode = page->mapping->host;
+		struct inode *inode = page_file_mapping(page)->host;
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		if (atomic_long_inc_return(&nfss->writeback) >
@@ -233,7 +233,7 @@ static int nfs_set_page_writeback(struct
 
 static void nfs_end_page_writeback(struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
@@ -248,7 +248,7 @@ static void nfs_end_page_writeback(struc
 static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio,
 				struct page *page)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_inode *nfsi = NFS_I(inode);
 	struct nfs_page *req;
 	int ret;
@@ -294,7 +294,7 @@ static int nfs_page_async_flush(struct n
 
 static int nfs_do_writepage(struct page *page, struct writeback_control *wbc, struct nfs_pageio_descriptor *pgio)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = page_file_mapping(page)->host;
 
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE);
 	nfs_add_stats(inode, NFSIOS_WRITEPAGES, 1);
@@ -311,7 +311,7 @@ static int nfs_writepage_locked(struct p
 	struct nfs_pageio_descriptor pgio;
 	int err;
 
-	nfs_pageio_init_write(&pgio, page->mapping->host, wb_priority(wbc));
+	nfs_pageio_init_write(&pgio, page_file_mapping(page)->host, wb_priority(wbc));
 	err = nfs_do_writepage(page, wbc, &pgio);
 	nfs_pageio_complete(&pgio);
 	if (err < 0)
@@ -442,7 +442,8 @@ nfs_mark_request_commit(struct nfs_page 
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+	inc_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
+			BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -529,7 +530,7 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+		dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
 				BDI_RECLAIMABLE);
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
@@ -543,7 +544,7 @@ static void nfs_cancel_commit_list(struc
  * nfs_scan_commit - Scan an inode for commit requests
  * @inode: NFS inode to scan
  * @dst: destination list
- * @idx_start: lower bound of page->index to scan.
+ * @idx_start: lower bound of page_file_index(page) to scan.
  * @npages: idx_start + npages sets the upper bound to scan.
  *
  * Moves requests from the inode's 'commit' request list.
@@ -579,7 +580,7 @@ static inline int nfs_scan_commit(struct
 static struct nfs_page * nfs_update_request(struct nfs_open_context* ctx,
 		struct page *page, unsigned int offset, unsigned int bytes)
 {
-	struct address_space *mapping = page->mapping;
+	struct address_space *mapping = page_file_mapping(page);
 	struct inode *inode = mapping->host;
 	struct nfs_page		*req, *new = NULL;
 	pgoff_t		rqend, end;
@@ -681,7 +682,7 @@ int nfs_flush_incompatible(struct file *
 		nfs_release_request(req);
 		if (!do_flush)
 			return 0;
-		status = nfs_wb_page(page->mapping->host, page);
+		status = nfs_wb_page(page_file_mapping(page)->host, page);
 	} while (status == 0);
 	return status;
 }
@@ -696,7 +697,7 @@ int nfs_updatepage(struct file *file, st
 		unsigned int offset, unsigned int count)
 {
 	struct nfs_open_context *ctx = nfs_file_open_context(file);
-	struct inode	*inode = page->mapping->host;
+	struct inode	*inode = page_file_mapping(page)->host;
 	int		status = 0;
 
 	nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE);
@@ -952,7 +953,7 @@ static void nfs_writeback_done_partial(s
 	}
 
 	if (nfs_write_need_commit(data)) {
-		struct inode *inode = page->mapping->host;
+		struct inode *inode = page_file_mapping(page)->host;
 
 		spin_lock(&inode->i_lock);
 		if (test_bit(PG_NEED_RESCHED, &req->wb_flags)) {
@@ -1191,7 +1192,7 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+		dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
 				BDI_RECLAIMABLE);
 		nfs_clear_page_tag_locked(req);
 	}
@@ -1218,7 +1219,7 @@ static void nfs_commit_done(struct rpc_t
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+		dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
 				BDI_RECLAIMABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
@@ -1384,7 +1385,7 @@ int nfs_wb_page_cancel(struct inode *ino
 	loff_t range_start = page_offset(page);
 	loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
 	struct writeback_control wbc = {
-		.bdi = page->mapping->backing_dev_info,
+		.bdi = page_file_mapping(page)->backing_dev_info,
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
 		.range_start = range_start,
@@ -1417,7 +1418,7 @@ int nfs_wb_page_cancel(struct inode *ino
 	}
 	if (!PagePrivate(page))
 		return 0;
-	ret = nfs_sync_mapping_wait(page->mapping, &wbc, FLUSH_INVALIDATE);
+	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, FLUSH_INVALIDATE);
 out:
 	return ret;
 }
@@ -1428,7 +1429,7 @@ static int nfs_wb_page_priority(struct i
 	loff_t range_start = page_offset(page);
 	loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
 	struct writeback_control wbc = {
-		.bdi = page->mapping->backing_dev_info,
+		.bdi = page_file_mapping(page)->backing_dev_info,
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
 		.range_start = range_start,
@@ -1444,7 +1445,7 @@ static int nfs_wb_page_priority(struct i
 	}
 	if (!PagePrivate(page))
 		return 0;
-	ret = nfs_sync_mapping_wait(page->mapping, &wbc, how);
+	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 	if (ret >= 0)
 		return 0;
 out:
Index: linux-2.6/fs/nfs/internal.h
===================================================================
--- linux-2.6.orig/fs/nfs/internal.h
+++ linux-2.6/fs/nfs/internal.h
@@ -248,13 +248,14 @@ void nfs_super_set_maxbytes(struct super
 static inline
 unsigned int nfs_page_length(struct page *page)
 {
-	loff_t i_size = i_size_read(page->mapping->host);
+	loff_t i_size = i_size_read(page_file_mapping(page)->host);
 
 	if (i_size > 0) {
+		pgoff_t page_index = page_file_index(page);
 		pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
-		if (page->index < end_index)
+		if (page_index < end_index)
 			return PAGE_CACHE_SIZE;
-		if (page->index == end_index)
+		if (page_index == end_index)
 			return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
 	}
 	return 0;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 29/33] nfs: disable data cache revalidation for swapfiles
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (27 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 28/33] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 30/33] nfs: swap vs nfs_writepage Peter Zijlstra
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-swapper.patch --]
[-- Type: text/plain, Size: 5511 bytes --]

Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really 
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page->private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/inode.c |    6 ++++
 fs/nfs/write.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 65 insertions(+), 14 deletions(-)

Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -744,6 +744,12 @@ int nfs_revalidate_mapping_nolock(struct
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int ret = 0;
 
+	/*
+	 * swapfiles are not supposed to be shared.
+	 */
+	if (IS_SWAPFILE(inode))
+		goto out;
+
 	if ((nfsi->cache_validity & NFS_INO_REVAL_PAGECACHE)
 			|| nfs_attribute_timeout(inode) || NFS_STALE(inode)) {
 		ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -112,25 +112,62 @@ static void nfs_context_set_write_error(
 	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *
+__nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page, int get)
 {
 	struct nfs_page *req = NULL;
 
-	if (PagePrivate(page)) {
+	if (PagePrivate(page))
 		req = (struct nfs_page *)page_private(page);
-		if (req != NULL)
-			kref_get(&req->wb_kref);
-	}
+	else if (unlikely(PageSwapCache(page)))
+		req = radix_tree_lookup(&nfsi->nfs_page_tree, page_file_index(page));
+
+	if (get && req)
+		kref_get(&req->wb_kref);
+
 	return req;
 }
 
+static inline struct nfs_page *
+nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
+{
+	return __nfs_page_find_request_locked(nfsi, page, 1);
+}
+
+static int __nfs_page_has_request(struct page *page)
+{
+	struct inode *inode = page_file_mapping(page)->host;
+	struct nfs_page *req = NULL;
+
+	spin_lock(&inode->i_lock);
+	req = __nfs_page_find_request_locked(NFS_I(inode), page, 0);
+	spin_unlock(&inode->i_lock);
+
+	/*
+	 * hole here plugged by the caller holding onto PG_locked
+	 */
+
+	return req != NULL;
+}
+
+static inline int nfs_page_has_request(struct page *page)
+{
+	if (PagePrivate(page))
+		return 1;
+
+	if (unlikely(PageSwapCache(page)))
+		return __nfs_page_has_request(page);
+
+	return 0;
+}
+
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
 	struct inode *inode = page_file_mapping(page)->host;
 	struct nfs_page *req = NULL;
 
 	spin_lock(&inode->i_lock);
-	req = nfs_page_find_request_locked(page);
+	req = nfs_page_find_request_locked(NFS_I(inode), page);
 	spin_unlock(&inode->i_lock);
 	return req;
 }
@@ -255,7 +292,7 @@ static int nfs_page_async_flush(struct n
 
 	spin_lock(&inode->i_lock);
 	for(;;) {
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(nfsi, page);
 		if (req == NULL) {
 			spin_unlock(&inode->i_lock);
 			return 0;
@@ -374,8 +411,14 @@ static int nfs_inode_add_request(struct 
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
-	SetPagePrivate(req->wb_page);
-	set_page_private(req->wb_page, (unsigned long)req);
+	/*
+	 * Swap-space should not get truncated. Hence no need to plug the race
+	 * with invalidate/truncate.
+	 */
+	if (likely(!PageSwapCache(req->wb_page))) {
+		SetPagePrivate(req->wb_page);
+		set_page_private(req->wb_page, (unsigned long)req);
+	}
 	nfsi->npages++;
 	kref_get(&req->wb_kref);
 	return 0;
@@ -392,8 +435,10 @@ static void nfs_inode_remove_request(str
 	BUG_ON (!NFS_WBACK_BUSY(req));
 
 	spin_lock(&inode->i_lock);
-	set_page_private(req->wb_page, 0);
-	ClearPagePrivate(req->wb_page);
+	if (likely(!PageSwapCache(req->wb_page))) {
+		set_page_private(req->wb_page, 0);
+		ClearPagePrivate(req->wb_page);
+	}
 	radix_tree_delete(&nfsi->nfs_page_tree, req->wb_index);
 	nfsi->npages--;
 	if (!nfsi->npages) {
@@ -592,7 +637,7 @@ static struct nfs_page * nfs_update_requ
 		 * A request for the page we wish to update
 		 */
 		spin_lock(&inode->i_lock);
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(NFS_I(inode), page);
 		if (req) {
 			if (!nfs_lock_request_dontget(req)) {
 				int error;
@@ -1416,7 +1461,7 @@ int nfs_wb_page_cancel(struct inode *ino
 		if (ret < 0)
 			goto out;
 	}
-	if (!PagePrivate(page))
+	if (!nfs_page_has_request(page))
 		return 0;
 	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, FLUSH_INVALIDATE);
 out:
@@ -1443,7 +1488,7 @@ static int nfs_wb_page_priority(struct i
 		if (ret < 0)
 			goto out;
 	}
-	if (!PagePrivate(page))
+	if (!nfs_page_has_request(page))
 		return 0;
 	ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
 	if (ret >= 0)

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 30/33] nfs: swap vs nfs_writepage
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (28 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 29/33] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 31/33] nfs: enable swap on NFS Peter Zijlstra
                   ` (3 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-fix-writepage.patch --]
[-- Type: text/plain, Size: 1409 bytes --]

For now just use the ->writepage() path for swap traffic. Trond would like
to see ->swap_page() or some such additional a_op.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/write.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -336,6 +336,29 @@ static int nfs_do_writepage(struct page 
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE);
 	nfs_add_stats(inode, NFSIOS_WRITEPAGES, 1);
 
+	if (unlikely(IS_SWAPFILE(inode))) {
+		struct rpc_cred *cred;
+		struct nfs_open_context *ctx;
+		int status;
+
+		cred = rpcauth_lookupcred(NFS_CLIENT(inode)->cl_auth, 0);
+		if (IS_ERR(cred))
+			return PTR_ERR(cred);
+
+		ctx = nfs_find_open_context(inode, cred, FMODE_WRITE);
+		if (!ctx)
+			return -EBADF;
+
+		status = nfs_writepage_setup(ctx, page, 0, nfs_page_length(page));
+
+		put_nfs_open_context(ctx);
+
+		if (status < 0) {
+			nfs_set_pageerror(page);
+			return status;
+		}
+	}
+
 	nfs_pageio_cond_complete(pgio, page->index);
 	return nfs_page_async_flush(pgio, page);
 }

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 31/33] nfs: enable swap on NFS
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (29 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 30/33] nfs: swap vs nfs_writepage Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 32/33] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
                   ` (2 subsequent siblings)
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-swapfile.patch --]
[-- Type: text/plain, Size: 9536 bytes --]

Provide an a_ops->swapfile() implementation for NFS. This will set the
NFS socket to SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well
as reset SOCK_MEMALLOC before engaging the protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_MEMALLOC should allow us to receive the packets
required for the TCP connection buildup.

(swapping continues over a server reset during heavy network traffic)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/Kconfig                  |   18 ++++++++++++
 fs/nfs/file.c               |   10 ++++++
 include/linux/sunrpc/xprt.h |    5 ++-
 net/sunrpc/sched.c          |    9 ++++--
 net/sunrpc/xprtsock.c       |   63 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 102 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -371,6 +371,13 @@ static int nfs_launder_page(struct page 
 	return nfs_wb_page(page_file_mapping(page)->host, page);
 }
 
+#ifdef CONFIG_NFS_SWAP
+static int nfs_swapfile(struct address_space *mapping, int enable)
+{
+	return xs_swapper(NFS_CLIENT(mapping->host)->cl_xprt, enable);
+}
+#endif
+
 const struct address_space_operations nfs_file_aops = {
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
@@ -385,6 +392,9 @@ const struct address_space_operations nf
 	.direct_IO = nfs_direct_IO,
 #endif
 	.launder_page = nfs_launder_page,
+#ifdef CONFIG_NFS_SWAP
+	.swapfile = nfs_swapfile,
+#endif
 };
 
 static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
Index: linux-2.6/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6/include/linux/sunrpc/xprt.h
@@ -143,7 +143,9 @@ struct rpc_xprt {
 	unsigned int		max_reqs;	/* total slots */
 	unsigned long		state;		/* transport state */
 	unsigned char		shutdown   : 1,	/* being shut down */
-				resvport   : 1; /* use a reserved port */
+				resvport   : 1, /* use a reserved port */
+				swapper    : 1; /* we're swapping over this
+						   transport */
 	unsigned int		bind_index;	/* bind function index */
 
 	/*
@@ -246,6 +248,7 @@ struct rpc_rqst *	xprt_lookup_rqst(struc
 void			xprt_complete_rqst(struct rpc_task *task, int copied);
 void			xprt_release_rqst_cong(struct rpc_task *task);
 void			xprt_disconnect(struct rpc_xprt *xprt);
+int			xs_swapper(struct rpc_xprt *xprt, int enable);
 
 /*
  * Reserved bit positions in xprt->state
Index: linux-2.6/net/sunrpc/sched.c
===================================================================
--- linux-2.6.orig/net/sunrpc/sched.c
+++ linux-2.6/net/sunrpc/sched.c
@@ -761,7 +761,10 @@ struct rpc_buffer {
 void *rpc_malloc(struct rpc_task *task, size_t size)
 {
 	struct rpc_buffer *buf;
-	gfp_t gfp = RPC_IS_SWAPPER(task) ? GFP_ATOMIC : GFP_NOWAIT;
+	gfp_t gfp = GFP_NOWAIT;
+
+	if (RPC_IS_SWAPPER(task))
+		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
@@ -817,6 +820,8 @@ void rpc_init_task(struct rpc_task *task
 	atomic_set(&task->tk_count, 1);
 	task->tk_client = clnt;
 	task->tk_flags  = flags;
+	if (clnt->cl_xprt->swapper)
+		task->tk_flags |= RPC_TASK_SWAPPER;
 	task->tk_ops = tk_ops;
 	if (tk_ops->rpc_call_prepare != NULL)
 		task->tk_action = rpc_prepare_task;
@@ -853,7 +858,7 @@ void rpc_init_task(struct rpc_task *task
 static struct rpc_task *
 rpc_alloc_task(void)
 {
-	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS);
+	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO);
 }
 
 static void rpc_free_task(struct rcu_head *rcu)
Index: linux-2.6/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6.orig/net/sunrpc/xprtsock.c
+++ linux-2.6/net/sunrpc/xprtsock.c
@@ -1397,6 +1397,9 @@ static void xs_udp_finish_connecting(str
 		transport->sock = sock;
 		transport->inet = sk;
 
+		if (xprt->swapper)
+			sk_set_memalloc(sk);
+
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 	xs_udp_do_set_buffer_size(xprt);
@@ -1414,11 +1417,15 @@ static void xs_udp_connect_worker4(struc
   		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1441,6 +1448,7 @@ static void xs_udp_connect_worker4(struc
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
@@ -1455,11 +1463,15 @@ static void xs_udp_connect_worker6(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
@@ -1482,6 +1494,7 @@ static void xs_udp_connect_worker6(struc
 out:
 	xprt_wake_pending_tasks(xprt, status);
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /*
@@ -1541,6 +1554,9 @@ static int xs_tcp_finish_connecting(stru
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 
+	if (xprt->swapper)
+		sk_set_memalloc(transport->inet);
+
 	/* Tell the socket layer to start connecting... */
 	xprt->stat.connect_count++;
 	xprt->stat.connect_start = jiffies;
@@ -1559,11 +1575,15 @@ static void xs_tcp_connect_worker4(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	if (!sock) {
 		/* start from scratch */
 		if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
@@ -1606,6 +1626,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
@@ -1620,11 +1641,15 @@ static void xs_tcp_connect_worker6(struc
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int err, status = -EIO;
 
 	if (xprt->shutdown || !xprt_bound(xprt))
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	if (!sock) {
 		/* start from scratch */
 		if ((err = sock_create_kern(PF_INET6, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
@@ -1666,6 +1691,7 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 out_clear:
 	xprt_clear_connecting(xprt);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
@@ -1985,6 +2011,43 @@ int init_socket_xprt(void)
 	return 0;
 }
 
+#ifdef CONFIG_SUNRPC_SWAP
+#define RPC_BUF_RESERVE_PAGES \
+	kestimate_single(sizeof(struct rpc_rqst), GFP_KERNEL, RPC_MAX_SLOT_TABLE)
+#define RPC_RESERVE_PAGES	(RPC_BUF_RESERVE_PAGES + TX_RESERVE_PAGES)
+
+/**
+ * xs_swapper - Tag this transport as being used for swap.
+ * @xprt: transport to tag
+ * @enable: enable/disable
+ *
+ */
+int xs_swapper(struct rpc_xprt *xprt, int enable)
+{
+	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+	int err = 0;
+
+	if (enable) {
+		/*
+		 * keep one extra sock reference so the reserve won't dip
+		 * when the socket gets reconnected.
+		 */
+		err = sk_adjust_memalloc(1, RPC_RESERVE_PAGES);
+		if (!err) {
+			sk_set_memalloc(transport->inet);
+			xprt->swapper = 1;
+		}
+	} else if (xprt->swapper) {
+		xprt->swapper = 0;
+		sk_clear_memalloc(transport->inet);
+		sk_adjust_memalloc(-1, -RPC_RESERVE_PAGES);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xs_swapper);
+#endif
+
 /**
  * cleanup_socket_xprt - remove xprtsock's sysctls, unregister
  *
Index: linux-2.6/fs/Kconfig
===================================================================
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -1690,6 +1690,18 @@ config NFS_DIRECTIO
 	  causes open() to return EINVAL if a file residing in NFS is
 	  opened with the O_DIRECT flag.
 
+config NFS_SWAP
+	bool "Provide swap over NFS support"
+	default n
+	depends on NFS_FS
+	select SUNRPC_SWAP
+	help
+	  This option enables swapon to work on files located on NFS mounts.
+
+	  For more details, see Documentation/vm_deadlock.txt
+
+	  If unsure, say N.
+
 config NFSD
 	tristate "NFS server support"
 	depends on INET
@@ -1824,6 +1836,12 @@ config SUNRPC_BIND34
 	  If unsure, say N to get traditional behavior (version 2 rpcbind
 	  requests only).
 
+config SUNRPC_SWAP
+	def_bool n
+	depends on SUNRPC
+	select NETVM
+	select SWAP_FILE
+
 config RPCSEC_GSS_KRB5
 	tristate "Secure RPC: Kerberos V mechanism (EXPERIMENTAL)"
 	depends on SUNRPC && EXPERIMENTAL

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 32/33] nfs: fix various memory recursions possible with swap over NFS.
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (30 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 31/33] nfs: enable swap on NFS Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-30 16:04 ` [PATCH 33/33] nfs: do not warn on radix tree node allocation failures Peter Zijlstra
  2007-10-31  3:26 ` [PATCH 00/33] Swap over NFS -v14 Nick Piggin
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs-alloc-recursions.patch --]
[-- Type: text/plain, Size: 2086 bytes --]

GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/pagelist.c |    2 +-
 fs/nfs/write.c    |    6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -44,7 +44,7 @@ static struct kmem_cache *nfs_wdata_cach
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -68,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+	struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -77,7 +77,7 @@ struct nfs_write_data *nfs_writedata_all
 		if (pagecount <= ARRAY_SIZE(p->page_array))
 			p->pagevec = p->page_array;
 		else {
-			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
+			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOIO);
 			if (!p->pagevec) {
 				kmem_cache_free(nfs_wdata_cachep, p);
 				p = NULL;
Index: linux-2.6/fs/nfs/pagelist.c
===================================================================
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -27,7 +27,7 @@ static inline struct nfs_page *
 nfs_page_alloc(void)
 {
 	struct nfs_page	*p;
-	p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL);
+	p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO);
 	if (p) {
 		memset(p, 0, sizeof(*p));
 		INIT_LIST_HEAD(&p->wb_list);

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 33/33] nfs: do not warn on radix tree node allocation failures
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (31 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 32/33] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
@ 2007-10-30 16:04 ` Peter Zijlstra
  2007-10-31  3:26 ` [PATCH 00/33] Swap over NFS -v14 Nick Piggin
  33 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 16:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust
  Cc: Peter Zijlstra

[-- Attachment #1: nfs_radix_nowarn.patch --]
[-- Type: text/plain, Size: 2651 bytes --]

GFP_ATOMIC failures are rather common, no not warn about them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/inode.c |    2 +-
 fs/nfs/write.c |   10 ++++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -1172,7 +1172,7 @@ static void init_once(struct kmem_cache 
 	INIT_LIST_HEAD(&nfsi->open_files);
 	INIT_LIST_HEAD(&nfsi->access_cache_entry_lru);
 	INIT_LIST_HEAD(&nfsi->access_cache_inode_lru);
-	INIT_RADIX_TREE(&nfsi->nfs_page_tree, GFP_ATOMIC);
+	INIT_RADIX_TREE(&nfsi->nfs_page_tree, GFP_ATOMIC|__GFP_NOWARN);
 	nfsi->ncommit = 0;
 	nfsi->npages = 0;
 	nfs4_init_once(nfsi);
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -652,6 +652,7 @@ static struct nfs_page * nfs_update_requ
 	struct inode *inode = mapping->host;
 	struct nfs_page		*req, *new = NULL;
 	pgoff_t		rqend, end;
+	int error;
 
 	end = offset + bytes;
 
@@ -659,6 +660,10 @@ static struct nfs_page * nfs_update_requ
 		/* Loop over all inode entries and see if we find
 		 * A request for the page we wish to update
 		 */
+		error = radix_tree_preload(GFP_NOIO);
+		if (error)
+			return ERR_PTR(error);
+
 		spin_lock(&inode->i_lock);
 		req = nfs_page_find_request_locked(NFS_I(inode), page);
 		if (req) {
@@ -666,6 +671,7 @@ static struct nfs_page * nfs_update_requ
 				int error;
 
 				spin_unlock(&inode->i_lock);
+				radix_tree_preload_end();
 				error = nfs_wait_on_request(req);
 				nfs_release_request(req);
 				if (error < 0) {
@@ -676,6 +682,7 @@ static struct nfs_page * nfs_update_requ
 				continue;
 			}
 			spin_unlock(&inode->i_lock);
+			radix_tree_preload_end();
 			if (new)
 				nfs_release_request(new);
 			break;
@@ -687,13 +694,16 @@ static struct nfs_page * nfs_update_requ
 			error = nfs_inode_add_request(inode, new);
 			if (error) {
 				spin_unlock(&inode->i_lock);
+				radix_tree_preload_end();
 				nfs_unlock_request(new);
 				return ERR_PTR(error);
 			}
 			spin_unlock(&inode->i_lock);
+			radix_tree_preload_end();
 			return new;
 		}
 		spin_unlock(&inode->i_lock);
+		radix_tree_preload_end();
 
 		new = nfs_create_request(ctx, inode, page, offset, bytes);
 		if (IS_ERR(new))

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 23/33] netvm: skb processing
  2007-10-30 16:04 ` [PATCH 23/33] netvm: skb processing Peter Zijlstra
  2007-10-30 21:26   ` Stephen Hemminger
@ 2007-10-30 21:26   ` Stephen Hemminger
  2007-10-30 21:44     ` Peter Zijlstra
  1 sibling, 1 reply; 72+ messages in thread
From: Stephen Hemminger @ 2007-10-30 21:26 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

On Tue, 30 Oct 2007 17:04:24 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> In order to make sure emergency packets receive all memory needed to proceed
> ensure processing of emergency SKBs happens under PF_MEMALLOC.
> 
> Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.
> 
> Skip taps, since those are user-space again.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/net/sock.h |    5 +++++
>  net/core/dev.c     |   44 ++++++++++++++++++++++++++++++++++++++------
>  net/core/sock.c    |   18 ++++++++++++++++++
>  3 files changed, 61 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6/net/core/dev.c
> ===================================================================
> --- linux-2.6.orig/net/core/dev.c
> +++ linux-2.6/net/core/dev.c
> @@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk
>  	struct net_device *orig_dev;
>  	int ret = NET_RX_DROP;
>  	__be16 type;
> +	unsigned long pflags = current->flags;
> +
> +	/* Emergency skb are special, they should
> +	 *  - be delivered to SOCK_MEMALLOC sockets only
> +	 *  - stay away from userspace
> +	 *  - have bounded memory usage
> +	 *
> +	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
> +	 * This saves us from propagating the allocation context down to all
> +	 * allocation sites.
> +	 */
> +	if (skb_emergency(skb))
> +		current->flags |= PF_MEMALLOC;
>  
>  	/* if we've gotten here through NAPI, check netpoll */
>  	if (netpoll_receive_skb(skb))
> -		return NET_RX_DROP;
> +		goto out;

Why the change? doesn't gcc optimize the common exit case anyway?

>  
>  	if (!skb->tstamp.tv64)
>  		net_timestamp(skb);
> @@ -1990,7 +2003,7 @@ int netif_receive_skb(struct sk_buff *sk
>  	orig_dev = skb_bond(skb);
>  
>  	if (!orig_dev)
> -		return NET_RX_DROP;
> +		goto out;
>  
>  	__get_cpu_var(netdev_rx_stat).total++;
>  
> @@ -2009,6 +2022,9 @@ int netif_receive_skb(struct sk_buff *sk
>  	}
>  #endif
>  
> +	if (skb_emergency(skb))
> +		goto skip_taps;
> +
>  	list_for_each_entry_rcu(ptype, &ptype_all, list) {
>  		if (!ptype->dev || ptype->dev == skb->dev) {
>  			if (pt_prev)
> @@ -2017,6 +2033,7 @@ int netif_receive_skb(struct sk_buff *sk
>  		}
>  	}
>  
> +skip_taps:
>  #ifdef CONFIG_NET_CLS_ACT
>  	if (pt_prev) {
>  		ret = deliver_skb(skb, pt_prev, orig_dev);
> @@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk
>  
>  	if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
>  		kfree_skb(skb);
> -		goto out;
> +		goto unlock;
>  	}
>  
>  	skb->tc_verd = 0;
>  ncls:
>  #endif
>  
> +	if (skb_emergency(skb))
> +		switch(skb->protocol) {
> +			case __constant_htons(ETH_P_ARP):
> +			case __constant_htons(ETH_P_IP):
> +			case __constant_htons(ETH_P_IPV6):
> +			case __constant_htons(ETH_P_8021Q):
> +				break;

Indentation is wrong, and hard coding protocol values as spcial case
seems bad here. What about vlan's, etc?

> +			default:
> +				goto drop;
> +		}
> +
>  	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
>  	if (!skb)
> -		goto out;
> +		goto unlock;
>  	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
>  	if (!skb)
> -		goto out;
> +		goto unlock;
>  
>  	type = skb->protocol;
>  	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
> @@ -2056,6 +2085,7 @@ ncls:
>  	if (pt_prev) {
>  		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>  	} else {
> +drop:
>  		kfree_skb(skb);
>  		/* Jamal, now you will not able to escape explaining
>  		 * me how you were going to use this. :-)
> @@ -2063,8 +2093,10 @@ ncls:
>  		ret = NET_RX_DROP;
>  	}
>  
> -out:
> +unlock:
>  	rcu_read_unlock();
> +out:
> +	tsk_restore_flags(current, pflags, PF_MEMALLOC);
>  	return ret;
>  }
>  
> Index: linux-2.6/include/net/sock.h
> ===================================================================
> --- linux-2.6.orig/include/net/sock.h
> +++ linux-2.6/include/net/sock.h
> @@ -523,8 +523,13 @@ static inline void sk_add_backlog(struct
>  	skb->next = NULL;
>  }
>  
> +extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
> +
>  static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
>  {
> +	if (skb_emergency(skb))
> +		return __sk_backlog_rcv(sk, skb);
> +
>  	return sk->sk_backlog_rcv(sk, skb);
>  }
>  
> Index: linux-2.6/net/core/sock.c
> ===================================================================
> --- linux-2.6.orig/net/core/sock.c
> +++ linux-2.6/net/core/sock.c
> @@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk)
>  }
>  EXPORT_SYMBOL_GPL(sk_clear_memalloc);
>  
> +#ifdef CONFIG_NETVM
> +int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
> +{
> +	int ret;
> +	unsigned long pflags = current->flags;
> +
> +	/* these should have been dropped before queueing */
> +	BUG_ON(!sk_has_memalloc(sk));
> +
> +	current->flags |= PF_MEMALLOC;
> +	ret = sk->sk_backlog_rcv(sk, skb);
> +	tsk_restore_flags(current, pflags, PF_MEMALLOC);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(__sk_backlog_rcv);
> +#endif
> +
>  static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
>  {
>  	struct timeval tv;


I am still not convinced that this solves the problem well enough
to be useful.  Can you really survive a heavy memory overcommit?
In other words, can you prove that the added complexity causes the system
to survive a real test where otherwise it would not?


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 23/33] netvm: skb processing
  2007-10-30 16:04 ` [PATCH 23/33] netvm: skb processing Peter Zijlstra
@ 2007-10-30 21:26   ` Stephen Hemminger
  2007-10-30 21:26   ` Stephen Hemminger
  1 sibling, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2007-10-30 21:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

On Tue, 30 Oct 2007 17:04:24 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> In order to make sure emergency packets receive all memory needed to proceed
> ensure processing of emergency SKBs happens under PF_MEMALLOC.
> 
> Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.
> 
> Skip taps, since those are user-space again.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/net/sock.h |    5 +++++
>  net/core/dev.c     |   44 ++++++++++++++++++++++++++++++++++++++------
>  net/core/sock.c    |   18 ++++++++++++++++++
>  3 files changed, 61 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6/net/core/dev.c
> ===================================================================
> --- linux-2.6.orig/net/core/dev.c
> +++ linux-2.6/net/core/dev.c
> @@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk
>  	struct net_device *orig_dev;
>  	int ret = NET_RX_DROP;
>  	__be16 type;
> +	unsigned long pflags = current->flags;
> +
> +	/* Emergency skb are special, they should
> +	 *  - be delivered to SOCK_MEMALLOC sockets only
> +	 *  - stay away from userspace
> +	 *  - have bounded memory usage
> +	 *
> +	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
> +	 * This saves us from propagating the allocation context down to all
> +	 * allocation sites.
> +	 */
> +	if (skb_emergency(skb))
> +		current->flags |= PF_MEMALLOC;
>  
>  	/* if we've gotten here through NAPI, check netpoll */
>  	if (netpoll_receive_skb(skb))
> -		return NET_RX_DROP;
> +		goto out;

Why the change? doesn't gcc optimize the common exit case anyway?

>  
>  	if (!skb->tstamp.tv64)
>  		net_timestamp(skb);
> @@ -1990,7 +2003,7 @@ int netif_receive_skb(struct sk_buff *sk
>  	orig_dev = skb_bond(skb);
>  
>  	if (!orig_dev)
> -		return NET_RX_DROP;
> +		goto out;
>  
>  	__get_cpu_var(netdev_rx_stat).total++;
>  
> @@ -2009,6 +2022,9 @@ int netif_receive_skb(struct sk_buff *sk
>  	}
>  #endif
>  
> +	if (skb_emergency(skb))
> +		goto skip_taps;
> +
>  	list_for_each_entry_rcu(ptype, &ptype_all, list) {
>  		if (!ptype->dev || ptype->dev == skb->dev) {
>  			if (pt_prev)
> @@ -2017,6 +2033,7 @@ int netif_receive_skb(struct sk_buff *sk
>  		}
>  	}
>  
> +skip_taps:
>  #ifdef CONFIG_NET_CLS_ACT
>  	if (pt_prev) {
>  		ret = deliver_skb(skb, pt_prev, orig_dev);
> @@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk
>  
>  	if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
>  		kfree_skb(skb);
> -		goto out;
> +		goto unlock;
>  	}
>  
>  	skb->tc_verd = 0;
>  ncls:
>  #endif
>  
> +	if (skb_emergency(skb))
> +		switch(skb->protocol) {
> +			case __constant_htons(ETH_P_ARP):
> +			case __constant_htons(ETH_P_IP):
> +			case __constant_htons(ETH_P_IPV6):
> +			case __constant_htons(ETH_P_8021Q):
> +				break;

Indentation is wrong, and hard coding protocol values as spcial case
seems bad here. What about vlan's, etc?

> +			default:
> +				goto drop;
> +		}
> +
>  	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
>  	if (!skb)
> -		goto out;
> +		goto unlock;
>  	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
>  	if (!skb)
> -		goto out;
> +		goto unlock;
>  
>  	type = skb->protocol;
>  	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
> @@ -2056,6 +2085,7 @@ ncls:
>  	if (pt_prev) {
>  		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>  	} else {
> +drop:
>  		kfree_skb(skb);
>  		/* Jamal, now you will not able to escape explaining
>  		 * me how you were going to use this. :-)
> @@ -2063,8 +2093,10 @@ ncls:
>  		ret = NET_RX_DROP;
>  	}
>  
> -out:
> +unlock:
>  	rcu_read_unlock();
> +out:
> +	tsk_restore_flags(current, pflags, PF_MEMALLOC);
>  	return ret;
>  }
>  
> Index: linux-2.6/include/net/sock.h
> ===================================================================
> --- linux-2.6.orig/include/net/sock.h
> +++ linux-2.6/include/net/sock.h
> @@ -523,8 +523,13 @@ static inline void sk_add_backlog(struct
>  	skb->next = NULL;
>  }
>  
> +extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
> +
>  static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
>  {
> +	if (skb_emergency(skb))
> +		return __sk_backlog_rcv(sk, skb);
> +
>  	return sk->sk_backlog_rcv(sk, skb);
>  }
>  
> Index: linux-2.6/net/core/sock.c
> ===================================================================
> --- linux-2.6.orig/net/core/sock.c
> +++ linux-2.6/net/core/sock.c
> @@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk)
>  }
>  EXPORT_SYMBOL_GPL(sk_clear_memalloc);
>  
> +#ifdef CONFIG_NETVM
> +int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
> +{
> +	int ret;
> +	unsigned long pflags = current->flags;
> +
> +	/* these should have been dropped before queueing */
> +	BUG_ON(!sk_has_memalloc(sk));
> +
> +	current->flags |= PF_MEMALLOC;
> +	ret = sk->sk_backlog_rcv(sk, skb);
> +	tsk_restore_flags(current, pflags, PF_MEMALLOC);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(__sk_backlog_rcv);
> +#endif
> +
>  static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
>  {
>  	struct timeval tv;


I am still not convinced that this solves the problem well enough
to be useful.  Can you really survive a heavy memory overcommit?
In other words, can you prove that the added complexity causes the system
to survive a real test where otherwise it would not?


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 23/33] netvm: skb processing
  2007-10-30 21:26   ` Stephen Hemminger
@ 2007-10-30 21:44     ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-30 21:44 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	Trond Myklebust

On Tue, 2007-10-30 at 14:26 -0700, Stephen Hemminger wrote:
> On Tue, 30 Oct 2007 17:04:24 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > In order to make sure emergency packets receive all memory needed to proceed
> > ensure processing of emergency SKBs happens under PF_MEMALLOC.
> > 
> > Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.
> > 
> > Skip taps, since those are user-space again.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  include/net/sock.h |    5 +++++
> >  net/core/dev.c     |   44 ++++++++++++++++++++++++++++++++++++++------
> >  net/core/sock.c    |   18 ++++++++++++++++++
> >  3 files changed, 61 insertions(+), 6 deletions(-)
> > 
> > Index: linux-2.6/net/core/dev.c
> > ===================================================================
> > --- linux-2.6.orig/net/core/dev.c
> > +++ linux-2.6/net/core/dev.c
> > @@ -1976,10 +1976,23 @@ int netif_receive_skb(struct sk_buff *sk
> >  	struct net_device *orig_dev;
> >  	int ret = NET_RX_DROP;
> >  	__be16 type;
> > +	unsigned long pflags = current->flags;
> > +
> > +	/* Emergency skb are special, they should
> > +	 *  - be delivered to SOCK_MEMALLOC sockets only
> > +	 *  - stay away from userspace
> > +	 *  - have bounded memory usage
> > +	 *
> > +	 * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
> > +	 * This saves us from propagating the allocation context down to all
> > +	 * allocation sites.
> > +	 */
> > +	if (skb_emergency(skb))
> > +		current->flags |= PF_MEMALLOC;
> >  
> >  	/* if we've gotten here through NAPI, check netpoll */
> >  	if (netpoll_receive_skb(skb))
> > -		return NET_RX_DROP;
> > +		goto out;
> 
> Why the change? doesn't gcc optimize the common exit case anyway?

It needs to unset PF_MEMALLOC at the exit.

> > @@ -2029,19 +2046,31 @@ int netif_receive_skb(struct sk_buff *sk
> >  
> >  	if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
> >  		kfree_skb(skb);
> > -		goto out;
> > +		goto unlock;
> >  	}
> >  
> >  	skb->tc_verd = 0;
> >  ncls:
> >  #endif
> >  
> > +	if (skb_emergency(skb))
> > +		switch(skb->protocol) {
> > +			case __constant_htons(ETH_P_ARP):
> > +			case __constant_htons(ETH_P_IP):
> > +			case __constant_htons(ETH_P_IPV6):
> > +			case __constant_htons(ETH_P_8021Q):
> > +				break;
> 
> Indentation is wrong, and hard coding protocol values as spcial case
> seems bad here. What about vlan's, etc?

The other protocols needs analysis on what memory allocations occur
during packet processing, if anything is done that is not yet accounted
for (skb, route cache) then that needs to be added to a reserve, if
there are any paths that could touch user-space, those need to be
handled.

I've started looking at a few others, but its hard and difficult work if
one is not familiar with the protocols.


> > @@ -2063,8 +2093,10 @@ ncls:
> >  		ret = NET_RX_DROP;
> >  	}
> >  
> > -out:
> > +unlock:
> >  	rcu_read_unlock();
> > +out:
> > +	tsk_restore_flags(current, pflags, PF_MEMALLOC);
> >  	return ret;
> >  }

Its that tsk_restore_flags() there what requires the s/return/goto/
stuff you noted earlier.

> I am still not convinced that this solves the problem well enough
> to be useful.  Can you really survive a heavy memory overcommit?

On a machine with mem=128M, I've ran 4 processes of 64M, 2 file backed
with the files on NFS, 2 anonymous. The processes just cycle through the
memory using writes. This is a 100% overcommit.

During these tests I've ran various network loads.

I've shut down the NFS server, waited for say 15 minutes, and restarted
the NFS server, and the machine came back up and continued.

> In other words, can you prove that the added complexity causes the system
> to survive a real test where otherwise it would not?

I've put some statistics in the skb reserve allocations, those are most
definately used. I'm quite certain the machine would lock up solid
without it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
                   ` (32 preceding siblings ...)
  2007-10-30 16:04 ` [PATCH 33/33] nfs: do not warn on radix tree node allocation failures Peter Zijlstra
@ 2007-10-31  3:26 ` Nick Piggin
  2007-10-31  4:37   ` David Miller, Nick Piggin
  2007-10-31 11:27   ` Peter Zijlstra
  33 siblings, 2 replies; 72+ messages in thread
From: Nick Piggin @ 2007-10-31  3:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Hi,
>
> Another posting of the full swap over NFS series.

Hi,

Is it really worth all the added complexity of making swap
over NFS files work, given that you could use a network block
device instead?

Also, have you ensured that page_file_index, page_file_mapping
and page_offset are only ever used on anonymous pages when the
page is locked? (otherwise PageSwapCache could change)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
  2007-10-30 16:04 ` [PATCH 03/33] mm: slub: add knowledge of reserve pages Peter Zijlstra
@ 2007-10-31  3:37   ` Nick Piggin
  2007-10-31 10:42     ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2007-10-31  3:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> contexts that are entitled to it.
>
> Care is taken to only touch the SLUB slow path.
>
> This is done to ensure reserve pages don't leak out and get consumed.

I think this is generally a good idea (to prevent slab allocators
from stealing reserve). However I naively think the implementation
is a bit overengineered and thus has a few holes.

Humour me, what was the problem with failing the slab allocation
(actually, not fail but just call into the page allocator to do
correct waiting  / reclaim) in the slowpath if the process fails the
watermark checks?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves
  2007-10-30 16:04 ` [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
@ 2007-10-31  3:40   ` Nick Piggin
  0 siblings, 0 replies; 72+ messages in thread
From: Nick Piggin @ 2007-10-31  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Allow the mempool to use the memalloc reserves when all else fails and
> the allocation context would otherwise allow it.

I don't see what this is for. The whole point of when I fixed this
to *not* use the memalloc reserves is because processes that were
otherwise allowed to use those reserves, were. They should not.



> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  mm/mempool.c |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/mm/mempool.c
> ===================================================================
> --- linux-2.6.orig/mm/mempool.c
> +++ linux-2.6/mm/mempool.c
> @@ -14,6 +14,7 @@
>  #include <linux/mempool.h>
>  #include <linux/blkdev.h>
>  #include <linux/writeback.h>
> +#include "internal.h"
>
>  static void add_element(mempool_t *pool, void *element)
>  {
> @@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf
>  	void *element;
>  	unsigned long flags;
>  	wait_queue_t wait;
> -	gfp_t gfp_temp;
> +	gfp_t gfp_temp, gfp_orig = gfp_mask;
>
>  	might_sleep_if(gfp_mask & __GFP_WAIT);
>
> @@ -228,6 +229,15 @@ repeat_alloc:
>  	}
>  	spin_unlock_irqrestore(&pool->lock, flags);
>
> +	/* if we really had right to the emergency reserves try those */
> +	if (gfp_to_alloc_flags(gfp_orig) & ALLOC_NO_WATERMARKS) {
> +		if (gfp_temp & __GFP_NOMEMALLOC) {
> +			gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
> +			goto repeat_alloc;
> +		} else
> +			gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
> +	}
> +
>  	/* We must not sleep in the GFP_ATOMIC case */
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return NULL;
>
> --

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 05/33] mm: kmem_estimate_pages()
  2007-10-30 16:04 ` [PATCH 05/33] mm: kmem_estimate_pages() Peter Zijlstra
@ 2007-10-31  3:43   ` Nick Piggin
  2007-10-31 10:42     ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2007-10-31  3:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Provide a method to get the upper bound on the pages needed to allocate
> a given number of objects from a given kmem_cache.
>

Fair enough, but just to make it a bit easier, can you provide a
little reason of why in this patch (or reference the patch number
where you use it, or put it together with the patch where you use
it, etc.).

Thanks,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
  2007-10-30 16:04 ` [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
@ 2007-10-31  3:51   ` Nick Piggin
  2007-10-31 10:42     ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2007-10-31  3:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
> a borrowed context save current->flags, ksoftirqd will have its own
> task_struct.


What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
that some networking thing must be picking it up in a subsequent patch,
but I'm too lazy to look!)... Again, can you have more of a rationale in
your patch headers, or ref the patch that uses it... thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK
  2007-10-30 16:04 ` [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
@ 2007-10-31  3:52   ` Nick Piggin
  2007-10-31 10:45     ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2007-10-31  3:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
> wide - which they are per setup_per_zone_pages_min(), when we scrape the
> barrel, do it properly.
>

IIRC it's actually not too uncommon to have allocations coming here via
page reclaim. It's not exactly clear that you want to break mempolicies
at this point.


> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  mm/page_alloc.c |    6 ++++++
>  1 file changed, 6 insertions(+)
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1638,6 +1638,12 @@ restart:
>  rebalance:
>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>  nofail_alloc:
> +		/*
> +		 * break out of mempolicy boundaries
> +		 */
> +		zonelist = NODE_DATA(numa_node_id())->node_zonelists +
> +			gfp_zone(gfp_mask);
> +
>  		/* go through the zonelist yet again, ignoring mins */
>  		page = get_page_from_freelist(gfp_mask, order, zonelist,
>  				ALLOC_NO_WATERMARKS);
>
> --
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31  4:37   ` David Miller, Nick Piggin
@ 2007-10-31  4:04     ` Nick Piggin
  2007-10-31 14:03       ` Byron Stanoszek
  2007-10-31  8:50     ` Christoph Hellwig
  2007-10-31  9:53     ` Peter Zijlstra
  2 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2007-10-31  4:04 UTC (permalink / raw)
  To: David Miller
  Cc: a.p.zijlstra, torvalds, akpm, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 15:37, David Miller wrote:
> From: Nick Piggin <nickpiggin@yahoo.com.au>
> Date: Wed, 31 Oct 2007 14:26:32 +1100
>
> > Is it really worth all the added complexity of making swap
> > over NFS files work, given that you could use a network block
> > device instead?
>
> Don't be misled.  Swapping over NFS is just a scarecrow for the
> seemingly real impetus behind these changes which is network storage
> stuff like iSCSI.

Oh, I'm OK with the network reserves stuff (not the actual patch,
which I'm not really qualified to review, but at least the idea
of it...).

And also I'm not as such against the idea of swap over network.

However, specifically the change to make swapfiles work through
the filesystem layer (ATM it goes straight to the block layer,
modulo some initialisation stuff which uses block filesystem-
specific calls).

I mean, I assume that anybody trying to swap over network *today*
has to be using a network block device anyway, so the idea of
just being able to transparently improve that case seems better
than adding new complexities for seemingly not much gain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31  3:26 ` [PATCH 00/33] Swap over NFS -v14 Nick Piggin
@ 2007-10-31  4:37   ` David Miller, Nick Piggin
  2007-10-31  4:04     ` Nick Piggin
                       ` (2 more replies)
  2007-10-31 11:27   ` Peter Zijlstra
  1 sibling, 3 replies; 72+ messages in thread
From: David Miller, Nick Piggin @ 2007-10-31  4:37 UTC (permalink / raw)
  To: nickpiggin
  Cc: a.p.zijlstra, torvalds, akpm, linux-kernel, linux-mm, netdev,
	trond.myklebust

> Is it really worth all the added complexity of making swap
> over NFS files work, given that you could use a network block
> device instead?

Don't be misled.  Swapping over NFS is just a scarecrow for the
seemingly real impetus behind these changes which is network storage
stuff like iSCSI.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31  4:37   ` David Miller, Nick Piggin
  2007-10-31  4:04     ` Nick Piggin
@ 2007-10-31  8:50     ` Christoph Hellwig
  2007-10-31 10:56       ` Peter Zijlstra
  2007-10-31  9:53     ` Peter Zijlstra
  2 siblings, 1 reply; 72+ messages in thread
From: Christoph Hellwig @ 2007-10-31  8:50 UTC (permalink / raw)
  To: David Miller
  Cc: nickpiggin, a.p.zijlstra, torvalds, akpm, linux-kernel, linux-mm,
	netdev, trond.myklebust

On Tue, Oct 30, 2007 at 09:37:53PM -0700, David Miller wrote:
> Don't be misled.  Swapping over NFS is just a scarecrow for the
> seemingly real impetus behind these changes which is network storage
> stuff like iSCSI.

So can we please do swap over network storage only first?  All these
VM bits look conceptually sane to me, while the changes to the swap
code to support nfs are real crackpipe material.   Then again doing
that part properly by adding address_space methods for swap I/O without
the abuse might be a really good idea, especially as the way we
do swapfiles on block-based filesystems is an horrible hack already.

So please get the VM bits for swap over network blockdevices in first,
and then we can look into a complete revamp of the swapfile support
that cleans up the current mess and adds support for nfs insted of
making the mess even worse.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 28/33] nfs: teach the NFS client how to treat PG_swapcache pages
  2007-10-30 16:04 ` [PATCH 28/33] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
@ 2007-10-31  8:52   ` Christoph Hellwig
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Hellwig @ 2007-10-31  8:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Tue, Oct 30, 2007 at 05:04:29PM +0100, Peter Zijlstra wrote:
> Replace all relevant occurences of page->index and page->mapping in the NFS
> client with the new page_file_index() and page_file_mapping() functions.

As discussed personally and on the list a strong NACK for this.  Swapcache
pages have no business at all ever coming through ->writepage(s).  If you
really want to support swap over NFS that can only be done properly by
adding separate methods to write out and read in pages separated from the
pagecache.  Incidentally that would also clean up the mess we have with
swap files on "normal" filesystems using ->bmap and bypassing the filesystem
later on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31  4:37   ` David Miller, Nick Piggin
  2007-10-31  4:04     ` Nick Piggin
  2007-10-31  8:50     ` Christoph Hellwig
@ 2007-10-31  9:53     ` Peter Zijlstra
  2 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31  9:53 UTC (permalink / raw)
  To: David Miller
  Cc: nickpiggin, torvalds, akpm, linux-kernel, linux-mm, netdev,
	trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]

On Tue, 2007-10-30 at 21:37 -0700, David Miller wrote:
> From: Nick Piggin <nickpiggin@yahoo.com.au>
> Date: Wed, 31 Oct 2007 14:26:32 +1100
> 
> > Is it really worth all the added complexity of making swap
> > over NFS files work, given that you could use a network block
> > device instead?
> 
> Don't be misled.  Swapping over NFS is just a scarecrow for the
> seemingly real impetus behind these changes which is network storage
> stuff like iSCSI.

Not quite, yes, iSCSI is also on the 'want' list of quite a few people,
but swap over NFS on its own is also a feature of great demand.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
  2007-10-31  3:37   ` Nick Piggin
@ 2007-10-31 10:42     ` Peter Zijlstra
  2007-10-31 10:46       ` Nick Piggin
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 10:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 1232 bytes --]

On Wed, 2007-10-31 at 14:37 +1100, Nick Piggin wrote:
> On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> > contexts that are entitled to it.
> >
> > Care is taken to only touch the SLUB slow path.
> >
> > This is done to ensure reserve pages don't leak out and get consumed.
> 
> I think this is generally a good idea (to prevent slab allocators
> from stealing reserve). However I naively think the implementation
> is a bit overengineered and thus has a few holes.
> 
> Humour me, what was the problem with failing the slab allocation
> (actually, not fail but just call into the page allocator to do
> correct waiting  / reclaim) in the slowpath if the process fails the
> watermark checks?

Ah, we actually need slabs below the watermarks. Its just that once I
allocated those slabs using __GFP_MEMALLOC/PF_MEMALLOC I don't want
allocation contexts that do not have rights to those pages to walk off
with objects.

So, this generic reserve framework still uses the slab allocator to
provide certain kind of objects (kmalloc, kmem_cache_alloc), it just
separates those that are and are not entitled to the reserves.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 05/33] mm: kmem_estimate_pages()
  2007-10-31  3:43   ` Nick Piggin
@ 2007-10-31 10:42     ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 10:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 649 bytes --]

On Wed, 2007-10-31 at 14:43 +1100, Nick Piggin wrote:
> On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > Provide a method to get the upper bound on the pages needed to allocate
> > a given number of objects from a given kmem_cache.
> >
> 
> Fair enough, but just to make it a bit easier, can you provide a
> little reason of why in this patch (or reference the patch number
> where you use it, or put it together with the patch where you use
> it, etc.).

A generic reserve framework, as seen in patch 11/23, needs to be able
convert from a object demand (kmalloc() bytes, kmem_cache_alloc()
objects) to a page reserve.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
  2007-10-31  3:51   ` Nick Piggin
@ 2007-10-31 10:42     ` Peter Zijlstra
  2007-10-31 10:49       ` Nick Piggin
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 10:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

On Wed, 2007-10-31 at 14:51 +1100, Nick Piggin wrote:
> On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
> > a borrowed context save current->flags, ksoftirqd will have its own
> > task_struct.
> 
> 
> What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
> that some networking thing must be picking it up in a subsequent patch,
> but I'm too lazy to look!)... Again, can you have more of a rationale in
> your patch headers, or ref the patch that uses it... thanks

Right, I knew I was forgetting something in these changelogs.

The network stack does quite a bit of packet processing from softirq
context. Once you start swapping over network, some of the packets want
to be processed under PF_MEMALLOC.

See patch 23/33.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK
  2007-10-31  3:52   ` Nick Piggin
@ 2007-10-31 10:45     ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 10:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

On Wed, 2007-10-31 at 14:52 +1100, Nick Piggin wrote:
> On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
> > wide - which they are per setup_per_zone_pages_min(), when we scrape the
> > barrel, do it properly.
> >
> 
> IIRC it's actually not too uncommon to have allocations coming here via
> page reclaim. It's not exactly clear that you want to break mempolicies
> at this point.

Hmm, the way I see it is that mempolicies are mainly for user-space
allocations, reserve allocations are always kernel allocations. These
already break mempolicies - for example hardirq context allocations.

Also, as it stands, the reserve is spread out evenly over all
zones/nodes (excluding highmem), so by restricting ourselves to a
subset, we don't have access to the full reserve.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
  2007-10-31 10:42     ` Peter Zijlstra
@ 2007-10-31 10:46       ` Nick Piggin
  2007-10-31 12:17         ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2007-10-31 10:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
> On Wed, 2007-10-31 at 14:37 +1100, Nick Piggin wrote:
> > On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> > > contexts that are entitled to it.
> > >
> > > Care is taken to only touch the SLUB slow path.
> > >
> > > This is done to ensure reserve pages don't leak out and get consumed.
> >
> > I think this is generally a good idea (to prevent slab allocators
> > from stealing reserve). However I naively think the implementation
> > is a bit overengineered and thus has a few holes.
> >
> > Humour me, what was the problem with failing the slab allocation
> > (actually, not fail but just call into the page allocator to do
> > correct waiting  / reclaim) in the slowpath if the process fails the
> > watermark checks?
>
> Ah, we actually need slabs below the watermarks.

Right, I'd still allow those guys to allocate slabs. Provided they
have the right allocation context, right?


> Its just that once I 
> allocated those slabs using __GFP_MEMALLOC/PF_MEMALLOC I don't want
> allocation contexts that do not have rights to those pages to walk off
> with objects.

And I'd prevent these ones from doing so.

Without keeping track of "reserve" pages, which doesn't feel
too clean.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
  2007-10-31 10:42     ` Peter Zijlstra
@ 2007-10-31 10:49       ` Nick Piggin
  2007-10-31 13:06         ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2007-10-31 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
> On Wed, 2007-10-31 at 14:51 +1100, Nick Piggin wrote:
> > On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > > Allow PF_MEMALLOC to be set in softirq context. When running softirqs
> > > from a borrowed context save current->flags, ksoftirqd will have its
> > > own task_struct.
> >
> > What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
> > that some networking thing must be picking it up in a subsequent patch,
> > but I'm too lazy to look!)... Again, can you have more of a rationale in
> > your patch headers, or ref the patch that uses it... thanks
>
> Right, I knew I was forgetting something in these changelogs.
>
> The network stack does quite a bit of packet processing from softirq
> context. Once you start swapping over network, some of the packets want
> to be processed under PF_MEMALLOC.

Hmm... what about processing from interrupt context?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31  8:50     ` Christoph Hellwig
@ 2007-10-31 10:56       ` Peter Zijlstra
  2007-10-31 11:18         ` NBD was " Pavel Machek
  2007-10-31 14:54         ` Mike Snitzer
  0 siblings, 2 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 10:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Miller, nickpiggin, torvalds, akpm, linux-kernel, linux-mm,
	netdev, trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 1657 bytes --]

On Wed, 2007-10-31 at 08:50 +0000, Christoph Hellwig wrote:
> On Tue, Oct 30, 2007 at 09:37:53PM -0700, David Miller wrote:
> > Don't be misled.  Swapping over NFS is just a scarecrow for the
> > seemingly real impetus behind these changes which is network storage
> > stuff like iSCSI.
> 
> So can we please do swap over network storage only first?  All these
> VM bits look conceptually sane to me, while the changes to the swap
> code to support nfs are real crackpipe material.

Yeah, I know how you stand on that. I just wanted to post all this
before going off into the woods reworking it all.

> Then again doing
> that part properly by adding address_space methods for swap I/O without
> the abuse might be a really good idea, especially as the way we
> do swapfiles on block-based filesystems is an horrible hack already.

Is planned. What do you think of the proposed a_ops extension to
accomplish this? That is,

->swapfile() - is this address space willing to back swap
->swapout() - write out a page
->swapin() - read in a page

> So please get the VM bits for swap over network blockdevices in first,

Trouble with that part is that we don't have any sane network block
devices atm, NBD is utter crap, and iSCSI is too complex to be called
sane.

Maybe Evgeniy's Distributed storage thingy would work, will have a look
at that.

> and then we can look into a complete revamp of the swapfile support
> that cleans up the current mess and adds support for nfs insted of
> making the mess even worse.

Sure, concrete suggestion are always welcome. Just being told something
is utter crap only goes so far.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* NBD was Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 10:56       ` Peter Zijlstra
@ 2007-10-31 11:18         ` Pavel Machek
  2007-10-31 11:24           ` Peter Zijlstra
  2007-10-31 14:54         ` Mike Snitzer
  1 sibling, 1 reply; 72+ messages in thread
From: Pavel Machek @ 2007-10-31 11:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, David Miller, nickpiggin, torvalds, akpm,
	linux-kernel, linux-mm, netdev, trond.myklebust

Hi!

> > So please get the VM bits for swap over network blockdevices in first,
> 
> Trouble with that part is that we don't have any sane network block
> devices atm, NBD is utter crap, and iSCSI is too complex to be called
> sane.

Hey, NBD was designed to be _simple_. And I think it works okay in
that area.. so can you elaborate on "utter crap"? [Ok, performance is
not great.]

Plus, I'd suggest you to look at ata-over-ethernet. It is in tree
today, quite simple, but should have better performance than nbd.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: NBD was Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 11:18         ` NBD was " Pavel Machek
@ 2007-10-31 11:24           ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 11:24 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Christoph Hellwig, David Miller, nickpiggin, torvalds, akpm,
	linux-kernel, linux-mm, netdev, trond.myklebust, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 1202 bytes --]

On Wed, 2007-10-31 at 12:18 +0100, Pavel Machek wrote:
> Hi!
> 
> > > So please get the VM bits for swap over network blockdevices in first,
> > 
> > Trouble with that part is that we don't have any sane network block
> > devices atm, NBD is utter crap, and iSCSI is too complex to be called
> > sane.
> 
> Hey, NBD was designed to be _simple_. And I think it works okay in
> that area.. so can you elaborate on "utter crap"? [Ok, performance is
> not great.]

Yeah, sorry, perhaps I was overly strong.

It doesn't work for me, because:

  - it does connection management in user-space, which makes it
    impossible to reconnect. I'd want a full kernel based client.

  - it had some plugging issues, and after talking to Jens about it
    he suggested a rewrite using ->make_request() ala AoE. [ sorry if
    I'm short on details here, it was a long time ago, and I
    forgot, maybe Jens remembers ]

> Plus, I'd suggest you to look at ata-over-ethernet. It is in tree
> today, quite simple, but should have better performance than nbd.

Ah, right, I keep forgetting about that one. The only draw-back to that
on is, is that its raw ethernet, and not some IP protocol.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
  2007-10-31 12:17         ` Peter Zijlstra
@ 2007-10-31 11:25           ` Nick Piggin
  2007-10-31 12:54             ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2007-10-31 11:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wednesday 31 October 2007 23:17, Peter Zijlstra wrote:
> On Wed, 2007-10-31 at 21:46 +1100, Nick Piggin wrote:

> > And I'd prevent these ones from doing so.
> >
> > Without keeping track of "reserve" pages, which doesn't feel
> > too clean.
>
> The problem with that is that once a slab was allocated with the right
> allocation context, anybody can get objects from these slabs.

[snip]

I understand that.


> So we either reserve a page per object, which for 32 byte objects is a
> large waste, or we stop anybody who doesn't have the right permissions
> from obtaining objects. I took the latter approach.

What I'm saying is that the slab allocator slowpath should always
just check watermarks against the current task. Instead of this
->reserve stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31  3:26 ` [PATCH 00/33] Swap over NFS -v14 Nick Piggin
  2007-10-31  4:37   ` David Miller, Nick Piggin
@ 2007-10-31 11:27   ` Peter Zijlstra
  2007-10-31 12:16     ` Jeff Garzik
  1 sibling, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 11:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 968 bytes --]

On Wed, 2007-10-31 at 14:26 +1100, Nick Piggin wrote:
> On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > Hi,
> >
> > Another posting of the full swap over NFS series.
> 
> Hi,
> 
> Is it really worth all the added complexity of making swap
> over NFS files work, given that you could use a network block
> device instead?

As it stands, we don't have a usable network block device IMHO.
NFS is by far the most used and usable network storage solution out
there, anybody with half a brain knows how to set it up and use it.

> Also, have you ensured that page_file_index, page_file_mapping
> and page_offset are only ever used on anonymous pages when the
> page is locked? (otherwise PageSwapCache could change)

Good point, I hope so, both ->readpage() and ->writepage() take a locked
page, I'd have to look if it remains locked throughout the NFS call
chain.

Then again, it might become obsolete with the extended swap a_ops.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 11:27   ` Peter Zijlstra
@ 2007-10-31 12:16     ` Jeff Garzik
  2007-10-31 12:56       ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Jeff Garzik @ 2007-10-31 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust

1) I absolutely agree that NFS is far more prominent and useful than any 
network block device, at the present time.

2) Nonetheless, swap over NFS is a pretty rare case.  I view this work 
as interesting, but I really don't see a huge need, for swapping over 
NBD or swapping over NFS.  I tend to think swapping to a remote resource 
starts to approach "migration" rather than merely swapping.  Yes, we can 
do it...  but given the lack of burning need one must examine the price.

3) You note
> Swap over network has the problem that the network subsystem does not use fixed
> sized allocations, but heavily relies on kmalloc(). This makes mempools
> unusable.

True, but IMO there are mitigating factors that should be researched and 
taken into account:

a) To give you some net driver background/history, most mainstream net 
drivers were coded to allocate RX skbs of size 1538, under the theory 
that they would all be allocating out of the same underlying slab cache. 
  It would not be difficult to update a great many of the [non-jumbo] 
cases to create a fixed size allocation pattern.

b) Spare-time experiments and anecdotal evidence points to RX and TX skb 
recycling as a potentially valuable area of research.  If you are able 
to do something like that, then memory suddenly becomes a lot more 
bounded and predictable.

So my gut feeling is that taking a hard look at how net drivers function 
in the field should give you a lot of good ideas that approach the 
shared goal of making network memory allocations more predictable and 
bounded.

	Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
  2007-10-31 10:46       ` Nick Piggin
@ 2007-10-31 12:17         ` Peter Zijlstra
  2007-10-31 11:25           ` Nick Piggin
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 12:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 2278 bytes --]

On Wed, 2007-10-31 at 21:46 +1100, Nick Piggin wrote:
> On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
> > On Wed, 2007-10-31 at 14:37 +1100, Nick Piggin wrote:
> > > On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > > > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> > > > contexts that are entitled to it.
> > > >
> > > > Care is taken to only touch the SLUB slow path.
> > > >
> > > > This is done to ensure reserve pages don't leak out and get consumed.
> > >
> > > I think this is generally a good idea (to prevent slab allocators
> > > from stealing reserve). However I naively think the implementation
> > > is a bit overengineered and thus has a few holes.
> > >
> > > Humour me, what was the problem with failing the slab allocation
> > > (actually, not fail but just call into the page allocator to do
> > > correct waiting  / reclaim) in the slowpath if the process fails the
> > > watermark checks?
> >
> > Ah, we actually need slabs below the watermarks.
> 
> Right, I'd still allow those guys to allocate slabs. Provided they
> have the right allocation context, right?
> 
> 
> > Its just that once I 
> > allocated those slabs using __GFP_MEMALLOC/PF_MEMALLOC I don't want
> > allocation contexts that do not have rights to those pages to walk off
> > with objects.
> 
> And I'd prevent these ones from doing so.
> 
> Without keeping track of "reserve" pages, which doesn't feel
> too clean.

The problem with that is that once a slab was allocated with the right
allocation context, anybody can get objects from these slabs.


low memory, and empty slab:

task A                        task B

kmem_cache_alloc() = NULL

                              current->flags |= PF_MEMALLOC
                              kmem_cache_alloc() = obj
                              (slab != NULL)

kmem_cache_alloc() = obj
kmem_cache_alloc() = obj
kmem_cache_alloc() = obj


And now task A, who doesn't have the right permissions walks
away with all our reserve memory.

So we either reserve a page per object, which for 32 byte objects is a
large waste, or we stop anybody who doesn't have the right permissions
from obtaining objects. I took the latter approach.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
  2007-10-31 11:25           ` Nick Piggin
@ 2007-10-31 12:54             ` Peter Zijlstra
  2007-10-31 13:08               ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 12:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 1028 bytes --]

On Wed, 2007-10-31 at 22:25 +1100, Nick Piggin wrote:
> On Wednesday 31 October 2007 23:17, Peter Zijlstra wrote:
> > On Wed, 2007-10-31 at 21:46 +1100, Nick Piggin wrote:
> 
> > > And I'd prevent these ones from doing so.
> > >
> > > Without keeping track of "reserve" pages, which doesn't feel
> > > too clean.
> >
> > The problem with that is that once a slab was allocated with the right
> > allocation context, anybody can get objects from these slabs.
> 
> [snip]
> 
> I understand that.
> 
> 
> > So we either reserve a page per object, which for 32 byte objects is a
> > large waste, or we stop anybody who doesn't have the right permissions
> > from obtaining objects. I took the latter approach.
> 
> What I'm saying is that the slab allocator slowpath should always
> just check watermarks against the current task. Instead of this
> ->reserve stuff.

So what you say is to allocate a slab every time we take the slow path,
even when we already have one?

That sounds rather sub-optimal.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 12:16     ` Jeff Garzik
@ 2007-10-31 12:56       ` Peter Zijlstra
  2007-10-31 13:18         ` Arnaldo Carvalho de Melo
                           ` (3 more replies)
  0 siblings, 4 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 12:56 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Nick Piggin, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 2911 bytes --]

On Wed, 2007-10-31 at 08:16 -0400, Jeff Garzik wrote:
> Thoughts:
> 
> 1) I absolutely agree that NFS is far more prominent and useful than any 
> network block device, at the present time.
> 
> 
> 2) Nonetheless, swap over NFS is a pretty rare case.  I view this work 
> as interesting, but I really don't see a huge need, for swapping over 
> NBD or swapping over NFS.  I tend to think swapping to a remote resource 
> starts to approach "migration" rather than merely swapping.  Yes, we can 
> do it...  but given the lack of burning need one must examine the price.

There is a large corporate demand for this, which is why I'm doing this.

The typical usage scenarios are:
 - cluster/blades, where having local disks is a cost issue (maintenance
   of failures, heat, etc)
 - virtualisation, where dumping the storage on a networked storage unit
   makes for trivial migration and what not..

But please, people who want this (I'm sure some of you are reading) do
speak up. I'm just the motivated corporate drone implementing the
feature :-)

> 3) You note
> > Swap over network has the problem that the network subsystem does not use fixed
> > sized allocations, but heavily relies on kmalloc(). This makes mempools
> > unusable.
> 
> True, but IMO there are mitigating factors that should be researched and 
> taken into account:
> 
> a) To give you some net driver background/history, most mainstream net 
> drivers were coded to allocate RX skbs of size 1538, under the theory 
> that they would all be allocating out of the same underlying slab cache. 
>   It would not be difficult to update a great many of the [non-jumbo] 
> cases to create a fixed size allocation pattern.

One issue that comes to mind is how to ensure we'd still overflow the
IP-reassembly buffers. Currently those are managed on the number of
bytes present, not the number of fragments.

One of the goals of my approach was to not rewrite the network subsystem
to accomodate this feature (and I hope I succeeded).

> b) Spare-time experiments and anecdotal evidence points to RX and TX skb 
> recycling as a potentially valuable area of research.  If you are able 
> to do something like that, then memory suddenly becomes a lot more 
> bounded and predictable.
> 
> 
> So my gut feeling is that taking a hard look at how net drivers function 
> in the field should give you a lot of good ideas that approach the 
> shared goal of making network memory allocations more predictable and 
> bounded.

Note that being bounded only comes from dropping most packets before
trying them to a socket. That is the crucial part of the RX path, to
receive all packets from the NIC (regardless their size) but to not pass
them on to the network stack - unless they belong to a 'special' socket
that promises undelayed processing.

Thanks for these ideas, I'll look into them.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
  2007-10-31 10:49       ` Nick Piggin
@ 2007-10-31 13:06         ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 13:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

[-- Attachment #1: Type: text/plain, Size: 1195 bytes --]

On Wed, 2007-10-31 at 21:49 +1100, Nick Piggin wrote:
> On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
> > On Wed, 2007-10-31 at 14:51 +1100, Nick Piggin wrote:
> > > On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > > > Allow PF_MEMALLOC to be set in softirq context. When running softirqs
> > > > from a borrowed context save current->flags, ksoftirqd will have its
> > > > own task_struct.
> > >
> > > What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
> > > that some networking thing must be picking it up in a subsequent patch,
> > > but I'm too lazy to look!)... Again, can you have more of a rationale in
> > > your patch headers, or ref the patch that uses it... thanks
> >
> > Right, I knew I was forgetting something in these changelogs.
> >
> > The network stack does quite a bit of packet processing from softirq
> > context. Once you start swapping over network, some of the packets want
> > to be processed under PF_MEMALLOC.
> 
> Hmm... what about processing from interrupt context?

From what I could tell that is not done, ISR just fills the skb and
sticks it on an RX queue to be further processed by the softirq.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
  2007-10-31 12:54             ` Peter Zijlstra
@ 2007-10-31 13:08               ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2007-10-31 13:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, netdev,
	trond.myklebust

On Wed, 2007-10-31 at 13:54 +0100, Peter Zijlstra wrote:
> On Wed, 2007-10-31 at 22:25 +1100, Nick Piggin wrote:

> > What I'm saying is that the slab allocator slowpath should always
> > just check watermarks against the current task. Instead of this
> > ->reserve stuff.
> 
> So what you say is to allocate a slab every time we take the slow path,
> even when we already have one?

BTW, a task that does not have reserve permissions will already attempt
to allocate a new slab - this is done to probe the current watermarks.
If this succeeds the reserve status is lifted.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 12:56       ` Peter Zijlstra
@ 2007-10-31 13:18         ` Arnaldo Carvalho de Melo
  2007-10-31 13:44         ` Gregory Haskins
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 72+ messages in thread
From: Arnaldo Carvalho de Melo @ 2007-10-31 13:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeff Garzik, Nick Piggin, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, netdev, trond.myklebust

Em Wed, Oct 31, 2007 at 01:56:53PM +0100, Peter Zijlstra escreveu:
> On Wed, 2007-10-31 at 08:16 -0400, Jeff Garzik wrote:
> > Thoughts:
> > 
> > 1) I absolutely agree that NFS is far more prominent and useful than any 
> > network block device, at the present time.
> > 
> > 
> > 2) Nonetheless, swap over NFS is a pretty rare case.  I view this work 
> > as interesting, but I really don't see a huge need, for swapping over 
> > NBD or swapping over NFS.  I tend to think swapping to a remote resource 
> > starts to approach "migration" rather than merely swapping.  Yes, we can 
> > do it...  but given the lack of burning need one must examine the price.
> 
> There is a large corporate demand for this, which is why I'm doing this.
> 
> The typical usage scenarios are:
>  - cluster/blades, where having local disks is a cost issue (maintenance
>    of failures, heat, etc)
>  - virtualisation, where dumping the storage on a networked storage unit
>    makes for trivial migration and what not..
> 
> But please, people who want this (I'm sure some of you are reading) do
> speak up. I'm just the motivated corporate drone implementing the
> feature :-)

Keep it up, Dave already mentioned iSCSI, there is AoE, there are RT
sockets, you name it, the networking bits we've talked about several
times, they look OK, so I'm sorry for not going over all of them in
detail, but you have my support neverthless.

- Arnaldo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 12:56       ` Peter Zijlstra
  2007-10-31 13:18         ` Arnaldo Carvalho de Melo
@ 2007-10-31 13:44         ` Gregory Haskins
  2007-11-02  8:54         ` Pavel Machek
  2007-11-18 18:09         ` Robin Humble
  3 siblings, 0 replies; 72+ messages in thread
From: Gregory Haskins @ 2007-10-31 13:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeff Garzik, Nick Piggin, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, netdev, trond.myklebust

Peter Zijlstra wrote:

> 
> But please, people who want this (I'm sure some of you are reading) do
> speak up. I'm just the motivated corporate drone implementing the
> feature :-)

FWIW, I could have used a "swap to network technology X" like system at
my last job.  We were building a large networking switch with blades,
and the IO cards didn't have anywhere near the resources that the
control modules had (no persistent storage, small ram, etc).  We were
already doing userspace coredumps over NFS to the control cards.  It
would have been nice to swap as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31  4:04     ` Nick Piggin
@ 2007-10-31 14:03       ` Byron Stanoszek
  0 siblings, 0 replies; 72+ messages in thread
From: Byron Stanoszek @ 2007-10-31 14:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Miller, a.p.zijlstra, torvalds, akpm, linux-kernel,
	linux-mm, netdev, trond.myklebust

On Wed, 31 Oct 2007, Nick Piggin wrote:

> On Wednesday 31 October 2007 15:37, David Miller wrote:
>> From: Nick Piggin <nickpiggin@yahoo.com.au>
>> Date: Wed, 31 Oct 2007 14:26:32 +1100
>>
>>> Is it really worth all the added complexity of making swap
>>> over NFS files work, given that you could use a network block
>>> device instead?
>>
>> Don't be misled.  Swapping over NFS is just a scarecrow for the
>> seemingly real impetus behind these changes which is network storage
>> stuff like iSCSI.
>
> Oh, I'm OK with the network reserves stuff (not the actual patch,
> which I'm not really qualified to review, but at least the idea
> of it...).
>
> And also I'm not as such against the idea of swap over network.
>
> However, specifically the change to make swapfiles work through
> the filesystem layer (ATM it goes straight to the block layer,
> modulo some initialisation stuff which uses block filesystem-
> specific calls).
>
> I mean, I assume that anybody trying to swap over network *today*
> has to be using a network block device anyway, so the idea of
> just being able to transparently improve that case seems better
> than adding new complexities for seemingly not much gain.

I have some embedded diskless devices that have 16 MB of RAM and >500MB of
swap. Its root fs and swap device are both done over NBD because NFS is too
expensive in 16MB of RAM. Any memory contention (i.e needing memory to swap
memory over the network), however infrequent, causes the system to freeze when
about 50 MB of VM is used up. I would love to see some work done in this area.

  -Byron

--
Byron Stanoszek                         Ph: (330) 644-3059
Systems Programmer                      Fax: (330) 644-8110
Commercial Timesharing Inc.             Email: byron@comtime.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 10:56       ` Peter Zijlstra
  2007-10-31 11:18         ` NBD was " Pavel Machek
@ 2007-10-31 14:54         ` Mike Snitzer
  2007-10-31 16:31           ` Evgeniy Polyakov
  1 sibling, 1 reply; 72+ messages in thread
From: Mike Snitzer @ 2007-10-31 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, David Miller, nickpiggin, torvalds, akpm,
	linux-kernel, linux-mm, netdev, trond.myklebust, Evgeniy Polyakov

On 10/31/07, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2007-10-31 at 08:50 +0000, Christoph Hellwig wrote:
> > On Tue, Oct 30, 2007 at 09:37:53PM -0700, David Miller wrote:
> > > Don't be misled.  Swapping over NFS is just a scarecrow for the
> > > seemingly real impetus behind these changes which is network storage
> > > stuff like iSCSI.
> >
> > So can we please do swap over network storage only first?  All these
> > VM bits look conceptually sane to me, while the changes to the swap
> > code to support nfs are real crackpipe material.
>
> Yeah, I know how you stand on that. I just wanted to post all this
> before going off into the woods reworking it all.
...
> > So please get the VM bits for swap over network blockdevices in first,
>
> Trouble with that part is that we don't have any sane network block
> devices atm, NBD is utter crap, and iSCSI is too complex to be called
> sane.
>
> Maybe Evgeniy's Distributed storage thingy would work, will have a look
> at that.

Andrew recently asked Evgeniy if his DST was ready for merging; to
which Evgeniy basically said yes:
http://lkml.org/lkml/2007/10/27/54

It would be great if DST could be merged; whereby addressing the fact
that NBD is lacking for net-vm.  If DST were scrutinized in the
context of net-vm it should help it get the review that is needed for
merging.

Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 14:54         ` Mike Snitzer
@ 2007-10-31 16:31           ` Evgeniy Polyakov
  0 siblings, 0 replies; 72+ messages in thread
From: Evgeniy Polyakov @ 2007-10-31 16:31 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Peter Zijlstra, Christoph Hellwig, David Miller, nickpiggin,
	torvalds, akpm, linux-kernel, linux-mm, netdev, trond.myklebust

Hi.

On Wed, Oct 31, 2007 at 10:54:02AM -0400, Mike Snitzer (snitzer@gmail.com) wrote:
> > Trouble with that part is that we don't have any sane network block
> > devices atm, NBD is utter crap, and iSCSI is too complex to be called
> > sane.
> >
> > Maybe Evgeniy's Distributed storage thingy would work, will have a look
> > at that.
> 
> Andrew recently asked Evgeniy if his DST was ready for merging; to
> which Evgeniy basically said yes:
> http://lkml.org/lkml/2007/10/27/54
> 
> It would be great if DST could be merged; whereby addressing the fact
> that NBD is lacking for net-vm.  If DST were scrutinized in the
> context of net-vm it should help it get the review that is needed for
> merging.

By popular request I'm working on adding strong checksumming of the data
transferred, so I can not say that Andrew will want to merge this during
development phase. I expect to complete it quite soon (it is in testing
stage right now) though with new release scheduled this week. It will
also include some small features for userspace (hapiness).

Memory management is not changed.

-- 
	Evgeniy Polyakov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 12:56       ` Peter Zijlstra
  2007-10-31 13:18         ` Arnaldo Carvalho de Melo
  2007-10-31 13:44         ` Gregory Haskins
@ 2007-11-02  8:54         ` Pavel Machek
  2007-11-18 18:09         ` Robin Humble
  3 siblings, 0 replies; 72+ messages in thread
From: Pavel Machek @ 2007-11-02  8:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeff Garzik, Nick Piggin, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, netdev, trond.myklebust

Hi!

> > 2) Nonetheless, swap over NFS is a pretty rare case.  I view this work 
> > as interesting, but I really don't see a huge need, for swapping over 
> > NBD or swapping over NFS.  I tend to think swapping to a remote resource 
> > starts to approach "migration" rather than merely swapping.  Yes, we can 
> > do it...  but given the lack of burning need one must examine the price.
> 
> There is a large corporate demand for this, which is why I'm doing this.
> 
> The typical usage scenarios are:
>  - cluster/blades, where having local disks is a cost issue (maintenance
>    of failures, heat, etc)
>  - virtualisation, where dumping the storage on a networked storage unit
>    makes for trivial migration and what not..
> 
> But please, people who want this (I'm sure some of you are reading) do
> speak up. I'm just the motivated corporate drone implementing the
> feature :-)

I have wyse thin client here, geode (or something) cpu, 128MB flash,
256MB RAM (IIRC). You want to swap on this one, and no, you don't want
to swap to flash.
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/33] Swap over NFS -v14
  2007-10-31 12:56       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2007-11-02  8:54         ` Pavel Machek
@ 2007-11-18 18:09         ` Robin Humble
  3 siblings, 0 replies; 72+ messages in thread
From: Robin Humble @ 2007-11-18 18:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeff Garzik, Nick Piggin, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, netdev, trond.myklebust

<apologies for being insanely late into this thread>

On Wed, Oct 31, 2007 at 01:56:53PM +0100, Peter Zijlstra wrote:
>On Wed, 2007-10-31 at 08:16 -0400, Jeff Garzik wrote:
>> Thoughts:
>> 1) I absolutely agree that NFS is far more prominent and useful than any 
>> network block device, at the present time.
>> 
>> 2) Nonetheless, swap over NFS is a pretty rare case.  I view this work 
>> as interesting, but I really don't see a huge need, for swapping over 
>> NBD or swapping over NFS.  I tend to think swapping to a remote resource 
>> starts to approach "migration" rather than merely swapping.  Yes, we can 
>> do it...  but given the lack of burning need one must examine the price.
>
>There is a large corporate demand for this, which is why I'm doing this.
>
>The typical usage scenarios are:
> - cluster/blades, where having local disks is a cost issue (maintenance
>   of failures, heat, etc)

HPC clusters are increasingly diskless, especially at the high end.
for all the reasons you mention, but also because networks are faster
than disks.

>But please, people who want this (I'm sure some of you are reading) do
>speak up. I'm just the motivated corporate drone implementing the
>feature :-)

swap to iSCSI has worked well in the past with your anti-deadlock
patches, and I'd definitely like to see that continue and to be merged
into mainline!! swap-to-network is a highly desirable feature for
modern clusters.

performance and scalability of NFS is poor, so it's not a good option.

actually swap to a file on Lustre(*) would be best, but iSER and iSCSI
would be my next choices. iSER is better than iSCSI as it's ~5x faster
in practice, and InfiniBand seems to be here to stay.

hmmm - any idea what the issues are with RDMA in low memory situations?
presumably if DMA regions are mapped early then there's not actually
much of a problem? I might try it with tgtd's iSER...

cheers,
robin

(*) obviously not your responsibility. although Lustre (Sun/CFS) could
presumably use your infrastructure once you have it in mainline.


>> 3) You note
>> > Swap over network has the problem that the network subsystem does not use fixed
>> > sized allocations, but heavily relies on kmalloc(). This makes mempools
>> > unusable.
>> 
>> True, but IMO there are mitigating factors that should be researched and 
>> taken into account:
>> 
>> a) To give you some net driver background/history, most mainstream net 
>> drivers were coded to allocate RX skbs of size 1538, under the theory 
>> that they would all be allocating out of the same underlying slab cache. 
>>   It would not be difficult to update a great many of the [non-jumbo] 
>> cases to create a fixed size allocation pattern.
>
>One issue that comes to mind is how to ensure we'd still overflow the
>IP-reassembly buffers. Currently those are managed on the number of
>bytes present, not the number of fragments.
>
>One of the goals of my approach was to not rewrite the network subsystem
>to accomodate this feature (and I hope I succeeded).
>
>> b) Spare-time experiments and anecdotal evidence points to RX and TX skb 
>> recycling as a potentially valuable area of research.  If you are able 
>> to do something like that, then memory suddenly becomes a lot more 
>> bounded and predictable.
>> 
>> 
>> So my gut feeling is that taking a hard look at how net drivers function 
>> in the field should give you a lot of good ideas that approach the 
>> shared goal of making network memory allocations more predictable and 
>> bounded.
>
>Note that being bounded only comes from dropping most packets before
>trying them to a socket. That is the crucial part of the RX path, to
>receive all packets from the NIC (regardless their size) but to not pass
>them on to the network stack - unless they belong to a 'special' socket
>that promises undelayed processing.
>
>Thanks for these ideas, I'll look into them.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2007-11-18 18:09 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-30 16:04 [PATCH 00/33] Swap over NFS -v14 Peter Zijlstra
2007-10-30 16:04 ` [PATCH 01/33] mm: gfp_to_alloc_flags() Peter Zijlstra
2007-10-30 16:04 ` [PATCH 02/33] mm: tag reseve pages Peter Zijlstra
2007-10-30 16:04 ` [PATCH 03/33] mm: slub: add knowledge of reserve pages Peter Zijlstra
2007-10-31  3:37   ` Nick Piggin
2007-10-31 10:42     ` Peter Zijlstra
2007-10-31 10:46       ` Nick Piggin
2007-10-31 12:17         ` Peter Zijlstra
2007-10-31 11:25           ` Nick Piggin
2007-10-31 12:54             ` Peter Zijlstra
2007-10-31 13:08               ` Peter Zijlstra
2007-10-30 16:04 ` [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-10-31  3:40   ` Nick Piggin
2007-10-30 16:04 ` [PATCH 05/33] mm: kmem_estimate_pages() Peter Zijlstra
2007-10-31  3:43   ` Nick Piggin
2007-10-31 10:42     ` Peter Zijlstra
2007-10-30 16:04 ` [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2007-10-31  3:51   ` Nick Piggin
2007-10-31 10:42     ` Peter Zijlstra
2007-10-31 10:49       ` Nick Piggin
2007-10-31 13:06         ` Peter Zijlstra
2007-10-30 16:04 ` [PATCH 07/33] mm: serialize access to min_free_kbytes Peter Zijlstra
2007-10-30 16:04 ` [PATCH 08/33] mm: emergency pool Peter Zijlstra
2007-10-30 16:04 ` [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
2007-10-31  3:52   ` Nick Piggin
2007-10-31 10:45     ` Peter Zijlstra
2007-10-30 16:04 ` [PATCH 10/33] mm: __GFP_MEMALLOC Peter Zijlstra
2007-10-30 16:04 ` [PATCH 11/33] mm: memory reserve management Peter Zijlstra
2007-10-30 16:04 ` [PATCH 12/33] selinux: tag avc cache alloc as non-critical Peter Zijlstra
2007-10-30 16:04 ` [PATCH 13/33] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
2007-10-30 16:04 ` [PATCH 14/33] net: packet split receive api Peter Zijlstra
2007-10-30 16:04 ` [PATCH 15/33] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
2007-10-30 16:04 ` [PATCH 16/33] netvm: network reserve infrastructure Peter Zijlstra
2007-10-30 16:04 ` [PATCH 17/33] sysctl: propagate conv errors Peter Zijlstra
2007-10-30 16:04 ` [PATCH 18/33] netvm: INET reserves Peter Zijlstra
2007-10-30 16:04 ` [PATCH 19/33] netvm: hook skb allocation to reserves Peter Zijlstra
2007-10-30 16:04 ` [PATCH 20/33] netvm: filter emergency skbs Peter Zijlstra
2007-10-30 16:04 ` [PATCH 21/33] netvm: prevent a TCP specific deadlock Peter Zijlstra
2007-10-30 16:04 ` [PATCH 22/33] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
2007-10-30 16:04 ` [PATCH 23/33] netvm: skb processing Peter Zijlstra
2007-10-30 21:26   ` Stephen Hemminger
2007-10-30 21:26   ` Stephen Hemminger
2007-10-30 21:44     ` Peter Zijlstra
2007-10-30 16:04 ` [PATCH 24/33] mm: prepare swap entry methods for use in page methods Peter Zijlstra
2007-10-30 16:04 ` [PATCH 25/33] mm: add support for non block device backed swap files Peter Zijlstra
2007-10-30 16:04 ` [PATCH 26/33] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2007-10-30 16:04 ` [PATCH 27/33] nfs: remove mempools Peter Zijlstra
2007-10-30 16:04 ` [PATCH 28/33] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2007-10-31  8:52   ` Christoph Hellwig
2007-10-30 16:04 ` [PATCH 29/33] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
2007-10-30 16:04 ` [PATCH 30/33] nfs: swap vs nfs_writepage Peter Zijlstra
2007-10-30 16:04 ` [PATCH 31/33] nfs: enable swap on NFS Peter Zijlstra
2007-10-30 16:04 ` [PATCH 32/33] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
2007-10-30 16:04 ` [PATCH 33/33] nfs: do not warn on radix tree node allocation failures Peter Zijlstra
2007-10-31  3:26 ` [PATCH 00/33] Swap over NFS -v14 Nick Piggin
2007-10-31  4:37   ` David Miller, Nick Piggin
2007-10-31  4:04     ` Nick Piggin
2007-10-31 14:03       ` Byron Stanoszek
2007-10-31  8:50     ` Christoph Hellwig
2007-10-31 10:56       ` Peter Zijlstra
2007-10-31 11:18         ` NBD was " Pavel Machek
2007-10-31 11:24           ` Peter Zijlstra
2007-10-31 14:54         ` Mike Snitzer
2007-10-31 16:31           ` Evgeniy Polyakov
2007-10-31  9:53     ` Peter Zijlstra
2007-10-31 11:27   ` Peter Zijlstra
2007-10-31 12:16     ` Jeff Garzik
2007-10-31 12:56       ` Peter Zijlstra
2007-10-31 13:18         ` Arnaldo Carvalho de Melo
2007-10-31 13:44         ` Gregory Haskins
2007-11-02  8:54         ` Pavel Machek
2007-11-18 18:09         ` Robin Humble

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).