Netdev List
 help / color / mirror / Atom feed
* [PATCH 20/31] netvm: prevent a stream specific deadlock
From: Suresh Jayaraman @ 2009-10-01 14:08 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/net/sock.h   |    7 ++++---
 net/core/sock.c      |    2 +-
 net/ipv4/tcp_input.c |   12 ++++++------
 net/sctp/ulpevent.c  |    2 +-
 4 files changed, 12 insertions(+), 11 deletions(-)

Index: mmotm/include/net/sock.h
===================================================================
--- mmotm.orig/include/net/sock.h
+++ mmotm/include/net/sock.h
@@ -882,12 +882,13 @@ static inline int sk_wmem_schedule(struc
 		__sk_mem_schedule(sk, size, SK_MEM_SEND);
 }
 
-static inline int sk_rmem_schedule(struct sock *sk, int size)
+static inline int sk_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
-		__sk_mem_schedule(sk, size, SK_MEM_RECV);
+	return skb->truesize <= sk->sk_forward_alloc ||
+		__sk_mem_schedule(sk, skb->truesize, SK_MEM_RECV) ||
+		skb_emergency(skb);
 }
 
 static inline void sk_mem_reclaim(struct sock *sk)
Index: mmotm/net/core/sock.c
===================================================================
--- mmotm.orig/net/core/sock.c
+++ mmotm/net/core/sock.c
@@ -390,7 +390,7 @@ int sock_queue_rcv_skb(struct sock *sk,
 	if (err)
 		goto out;
 
-	if (!sk_rmem_schedule(sk, skb->truesize)) {
+	if (!sk_rmem_schedule(sk, skb)) {
 		err = -ENOBUFS;
 		goto out;
 	}
Index: mmotm/net/ipv4/tcp_input.c
===================================================================
--- mmotm.orig/net/ipv4/tcp_input.c
+++ mmotm/net/ipv4/tcp_input.c
@@ -4269,19 +4269,19 @@ static void tcp_ofo_queue(struct sock *s
 static int tcp_prune_ofo_queue(struct sock *sk);
 static int tcp_prune_queue(struct sock *sk);
 
-static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
+static inline int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	    !sk_rmem_schedule(sk, size)) {
+	    !sk_rmem_schedule(sk, skb)) {
 
 		if (tcp_prune_queue(sk) < 0)
 			return -1;
 
-		if (!sk_rmem_schedule(sk, size)) {
+		if (!sk_rmem_schedule(sk, skb)) {
 			if (!tcp_prune_ofo_queue(sk))
 				return -1;
 
-			if (!sk_rmem_schedule(sk, size))
+			if (!sk_rmem_schedule(sk, skb))
 				return -1;
 		}
 	}
@@ -4333,7 +4333,7 @@ static void tcp_data_queue(struct sock *
 		if (eaten <= 0) {
 queue_and_out:
 			if (eaten < 0 &&
-			    tcp_try_rmem_schedule(sk, skb->truesize))
+			    tcp_try_rmem_schedule(sk, skb))
 				goto drop;
 
 			skb_set_owner_r(skb, sk);
@@ -4404,7 +4404,7 @@ drop:
 
 	TCP_ECN_check_ce(tp, skb);
 
-	if (tcp_try_rmem_schedule(sk, skb->truesize))
+	if (tcp_try_rmem_schedule(sk, skb))
 		goto drop;
 
 	/* Disable header prediction. */
Index: mmotm/net/sctp/ulpevent.c
===================================================================
--- mmotm.orig/net/sctp/ulpevent.c
+++ mmotm/net/sctp/ulpevent.c
@@ -701,7 +701,7 @@ struct sctp_ulpevent *sctp_ulpevent_make
 	if (rx_count >= asoc->base.sk->sk_rcvbuf) {
 
 		if ((asoc->base.sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
-		    (!sk_rmem_schedule(asoc->base.sk, chunk->skb->truesize)))
+		    (!sk_rmem_schedule(asoc->base.sk, chunk->skb)))
 			goto fail;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 19/31] netvm: filter emergency skbs.
From: Suresh Jayaraman @ 2009-10-01 14:08 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

The correctness of this approach relies on the fact that networks must be
assumed lossy.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 net/core/filter.c |    3 +++
 1 file changed, 3 insertions(+)

Index: mmotm/net/core/filter.c
===================================================================
--- mmotm.orig/net/core/filter.c
+++ mmotm/net/core/filter.c
@@ -81,6 +81,9 @@ int sk_filter(struct sock *sk, struct sk
 	int err;
 	struct sk_filter *filter;
 
+	if (skb_emergency(skb) && !sk_has_memalloc(sk))
+		return -ENOMEM;
+
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)
 		return err;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 18/31] netvm: hook skb allocation to reserves
From: Suresh Jayaraman @ 2009-10-01 14:08 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. SKBs allocated from the reserve are tagged in
skb->emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page->index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()->dataref.

Note that the decision to distinguish between regular and emergency SKBs allows
the accounting overhead to be limited to the later kind.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/linux/mm_types.h |    1 
 include/linux/skbuff.h   |   28 +++++++--
 net/core/skbuff.c        |  137 +++++++++++++++++++++++++++++++++++++----------
 3 files changed, 133 insertions(+), 33 deletions(-)

Index: mmotm/include/linux/mm_types.h
===================================================================
--- mmotm.orig/include/linux/mm_types.h
+++ mmotm/include/linux/mm_types.h
@@ -78,6 +78,7 @@ struct page {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
 		int reserve;		/* page_alloc: page is a reserve page */
+		atomic_t frag_count;	/* skb fragment use count */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
Index: mmotm/include/linux/skbuff.h
===================================================================
--- mmotm.orig/include/linux/skbuff.h
+++ mmotm/include/linux/skbuff.h
@@ -384,8 +384,10 @@ struct sk_buff {
 	__u8			do_not_encrypt:1;
 #endif
 	kmemcheck_bitfield_end(flags2);
-
-	/* 0/13/14 bit hole */
+#ifdef CONFIG_NETVM
+	__u8			emergency:1;
+#endif
+	/* 0/14 bit hole */
 
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
@@ -426,6 +428,18 @@ extern void skb_dma_unmap(struct device
 			  enum dma_data_direction dir);
 #endif
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
+static inline bool skb_emergency(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETVM
+	return unlikely(skb->emergency);
+#else
+	return false;
+#endif
+}
+
 static inline struct dst_entry *skb_dst(const struct sk_buff *skb)
 {
 	return (struct dst_entry *)skb->_skb_dst;
@@ -445,7 +459,7 @@ extern void kfree_skb(struct sk_buff *sk
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone, int node);
+				   gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
@@ -455,7 +469,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -1466,7 +1480,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
 					      gfp_t gfp_mask)
 {
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+	struct sk_buff *skb =
+		__alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
 	if (likely(skb))
 		skb_reserve(skb, NET_SKB_PAD);
 	return skb;
@@ -1497,6 +1512,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1513,7 +1529,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-	__free_page(page);
+	__netdev_free_page(dev, page);
 }
 
 /**
Index: mmotm/net/core/skbuff.c
===================================================================
--- mmotm.orig/net/core/skbuff.c
+++ mmotm/net/core/skbuff.c
@@ -170,23 +170,29 @@ EXPORT_SYMBOL(skb_under_panic);
  *	%GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone, int node)
+			    int flags, int node)
 {
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	int emergency = 0;
+	int memalloc = sk_memalloc_socks();
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	size = SKB_DATA_ALIGN(size);
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
+
+	if (memalloc && (flags & SKB_ALLOC_RX))
+		gfp_mask |= __GFP_MEMALLOC;
 
 	/* Get the HEAD */
 	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
 	if (!skb)
 		goto out;
 
-	size = SKB_DATA_ALIGN(size);
-	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
-			gfp_mask, node);
+	data = kmalloc_reserve(size + sizeof(struct skb_shared_info),
+			gfp_mask, node, &net_skb_reserve, &emergency);
 	if (!data)
 		goto nodata;
 
@@ -196,6 +202,9 @@ struct sk_buff *__alloc_skb(unsigned int
 	 * the tail pointer in struct sk_buff!
 	 */
 	memset(skb, 0, offsetof(struct sk_buff, tail));
+#ifdef CONFIG_NETVM
+	skb->emergency = emergency;
+#endif
 	skb->truesize = size + sizeof(struct sk_buff);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
@@ -220,7 +229,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	skb_frag_list_init(skb);
 	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -230,6 +239,9 @@ struct sk_buff *__alloc_skb(unsigned int
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+#ifdef CONFIG_NETVM
+		child->emergency = skb->emergency;
+#endif
 	}
 out:
 	return skb;
@@ -259,7 +271,7 @@ struct sk_buff *__netdev_alloc_skb(struc
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -273,11 +285,19 @@ struct page *__netdev_alloc_page(struct
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct page *page;
 
-	page = alloc_pages_node(node, gfp_mask, 0);
+	page = alloc_pages_reserve(node, gfp_mask | __GFP_MEMALLOC, 0,
+			&net_skb_reserve, NULL);
+
 	return page;
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+	free_pages_reserve(page, 0, &net_skb_reserve, page->reserve);
+}
+EXPORT_SYMBOL(__netdev_free_page);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
@@ -285,6 +305,27 @@ void skb_add_rx_frag(struct sk_buff *skb
 	skb->len += size;
 	skb->data_len += size;
 	skb->truesize += size;
+
+#ifdef CONFIG_NETVM
+	/*
+	 * In the rare case that skb_emergency() != page->reserved we'll
+	 * skew the accounting slightly, but since its only a 'small' constant
+	 * shift its ok.
+	 */
+	if (skb_emergency(skb)) {
+		/*
+		 * We need to track fragment pages so that we properly
+		 * release their reserve in skb_put_page().
+		 */
+		atomic_set(&page->frag_count, 1);
+	} else if (unlikely(page->reserve)) {
+		/*
+		 * Release the reserve now, because normal skbs don't
+		 * do the emergency accounting.
+		 */
+		mem_reserve_pages_charge(&net_skb_reserve, -1);
+	}
+#endif
 }
 EXPORT_SYMBOL(skb_add_rx_frag);
 
@@ -336,21 +377,38 @@ static void skb_clone_fraglist(struct sk
 		skb_get(list);
 }
 
+static void skb_get_page(struct sk_buff *skb, struct page *page)
+{
+	get_page(page);
+	if (skb_emergency(skb))
+		atomic_inc(&page->frag_count);
+}
+
+static void skb_put_page(struct sk_buff *skb, struct page *page)
+{
+	if (skb_emergency(skb) && atomic_dec_and_test(&page->frag_count))
+		mem_reserve_pages_charge(&net_skb_reserve, -1);
+	put_page(page);
+}
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
+
 		if (skb_shinfo(skb)->nr_frags) {
 			int i;
-			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-				put_page(skb_shinfo(skb)->frags[i].page);
+			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+				skb_put_page(skb,
+					     skb_shinfo(skb)->frags[i].page);
+			}
 		}
 
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
 
-		kfree(skb->head);
+		kfree_reserve(skb->head, &net_skb_reserve, skb_emergency(skb));
 	}
 }
 
@@ -544,6 +602,9 @@ static void __copy_skb_header(struct sk_
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	new->ipvs_property	= old->ipvs_property;
 #endif
+#ifdef CONFIG_NETVM
+	new->emergency		= old->emergency;
+#endif
 	new->protocol		= old->protocol;
 	new->mark		= old->mark;
 	new->iif		= old->iif;
@@ -641,6 +702,9 @@ struct sk_buff *skb_clone(struct sk_buff
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
 	} else {
+		if (skb_emergency(skb))
+			gfp_mask |= __GFP_MEMALLOC;
+
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -677,6 +741,14 @@ static void copy_skb_header(struct sk_bu
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
 
+static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
+{
+	if (skb_emergency(skb))
+		return SKB_ALLOC_RX;
+
+	return 0;
+}
+
 /**
  *	skb_copy	-	create private copy of an sk_buff
  *	@skb: buffer to copy
@@ -697,15 +769,17 @@ static void copy_skb_header(struct sk_bu
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 {
 	int headerlen = skb->data - skb->head;
+	int size;
 	/*
 	 *	Allocate the copy buffer
 	 */
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end + skb->data_len, gfp_mask);
+	size = skb->end + skb->data_len;
 #else
-	n = alloc_skb(skb->end - skb->head + skb->data_len, gfp_mask);
+	size = skb->end - skb->head + skb->data_len;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), -1);
 	if (!n)
 		return NULL;
 
@@ -740,12 +814,14 @@ struct sk_buff *pskb_copy(struct sk_buff
 	/*
 	 *	Allocate the copy buffer
 	 */
+	int size;
 	struct sk_buff *n;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
-	n = alloc_skb(skb->end, gfp_mask);
+	size = skb->end;
 #else
-	n = alloc_skb(skb->end - skb->head, gfp_mask);
+	size = skb->end - skb->head;
 #endif
+	n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), -1);
 	if (!n)
 		goto out;
 
@@ -764,8 +840,9 @@ struct sk_buff *pskb_copy(struct sk_buff
 		int i;
 
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
-			get_page(skb_shinfo(n)->frags[i].page);
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+			skb_shinfo(n)->frags[i] = *frag;
+			skb_get_page(n, frag->page);
 		}
 		skb_shinfo(n)->nr_frags = i;
 	}
@@ -816,7 +893,11 @@ int pskb_expand_head(struct sk_buff *skb
 
 	size = SKB_DATA_ALIGN(size);
 
-	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+	if (skb_emergency(skb))
+		gfp_mask |= __GFP_MEMALLOC;
+
+	data = kmalloc_reserve(size + sizeof(struct skb_shared_info),
+			gfp_mask, -1, &net_skb_reserve, NULL);
 	if (!data)
 		goto nodata;
 
@@ -831,7 +912,7 @@ int pskb_expand_head(struct sk_buff *skb
 	       sizeof(struct skb_shared_info));
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-		get_page(skb_shinfo(skb)->frags[i].page);
+		skb_get_page(skb, skb_shinfo(skb)->frags[i].page);
 
 	if (skb_has_frags(skb))
 		skb_clone_fraglist(skb);
@@ -912,8 +993,8 @@ struct sk_buff *skb_copy_expand(const st
 	/*
 	 *	Allocate the copy buffer
 	 */
-	struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
-				      gfp_mask);
+	struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+					gfp_mask, skb_alloc_rx_flag(skb), -1);
 	int oldheadroom = skb_headroom(skb);
 	int head_copy_len, head_copy_off;
 	int off;
@@ -1105,7 +1186,7 @@ drop_pages:
 		skb_shinfo(skb)->nr_frags = i;
 
 		for (; i < nfrags; i++)
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
@@ -1274,7 +1355,7 @@ pull_pages:
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		if (skb_shinfo(skb)->frags[i].size <= eat) {
-			put_page(skb_shinfo(skb)->frags[i].page);
+			skb_put_page(skb, skb_shinfo(skb)->frags[i].page);
 			eat -= skb_shinfo(skb)->frags[i].size;
 		} else {
 			skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -2052,6 +2133,7 @@ static inline void skb_split_no_header(s
 			skb_shinfo(skb1)->frags[k] = skb_shinfo(skb)->frags[i];
 
 			if (pos < len) {
+				struct page *page = skb_shinfo(skb)->frags[i].page;
 				/* Split frag.
 				 * We have two variants in this case:
 				 * 1. Move all the frag to the second
@@ -2060,7 +2142,7 @@ static inline void skb_split_no_header(s
 				 *    where splitting is expensive.
 				 * 2. Split is accurately. We make this.
 				 */
-				get_page(skb_shinfo(skb)->frags[i].page);
+				skb_get_page(skb1, page);
 				skb_shinfo(skb1)->frags[0].page_offset += len - pos;
 				skb_shinfo(skb1)->frags[0].size -= len - pos;
 				skb_shinfo(skb)->frags[i].size	= len - pos;
@@ -2559,8 +2641,9 @@ struct sk_buff *skb_segment(struct sk_bu
 			skb_release_head_state(nskb);
 			__skb_push(nskb, doffset);
 		} else {
-			nskb = alloc_skb(hsize + doffset + headroom,
-					 GFP_ATOMIC);
+			nskb = __alloc_skb(hsize + doffset + headroom,
+					 GFP_ATOMIC, skb_alloc_rx_flag(skb),
+					 -1);
 
 			if (unlikely(!nskb))
 				goto err;
@@ -2602,7 +2685,7 @@ struct sk_buff *skb_segment(struct sk_bu
 
 		while (pos < offset + len && i < nfrags) {
 			*frag = skb_shinfo(skb)->frags[i];
-			get_page(frag->page);
+			skb_get_page(nskb, frag->page);
 			size = frag->size;
 
 			if (pos < offset) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 17/31] Fix initialization of ipv4_route_lock
From: Suresh Jayaraman @ 2009-10-01 14:08 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Jeff Mahoney, Suresh Jayaraman

From: Jeff Mahoney <jeffm@suse.com>

 It's CONFIG_PROC_FS, not CONFIG_PROCFS.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 net/ipv4/route.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: mmotm/net/ipv4/route.c
===================================================================
--- mmotm.orig/net/ipv4/route.c
+++ mmotm/net/ipv4/route.c
@@ -3483,7 +3483,7 @@ int __init ip_rt_init(void)
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;
 
-#ifdef CONFIG_PROCFS
+#ifdef CONFIG_PROC_FS
 	mutex_init(&ipv4_route_lock);
 #endif
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 16/31] netvm: INET reserves.
From: Suresh Jayaraman @ 2009-10-01 14:07 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Reserve the route cache under generic RX reserve, its usage is bounded by
the high reclaim watermark, and thus does not need further accounting.

Reserve the ip-fragement caches under SKB data reserve, these add to the
SKB RX limit. By ensuring we can at least receive as much data as fits in
the reassmbly line we avoid fragment attack deadlocks.

Adds to the reserve tree:

  total network reserve
    network TX reserve
      protocol TX pages
    network RX reserve
+     IPv6 route cache
+     IPv4 route cache
      SKB data reserve
+       IPv6 fragment cache
+       IPv4 fragment cache

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/net/inet_frag.h  |    7 +++
 include/net/netns/ipv6.h |    4 ++
 net/ipv4/inet_fragment.c |    3 +
 net/ipv4/ip_fragment.c   |   86 +++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv4/route.c         |   70 +++++++++++++++++++++++++++++++++++++-
 net/ipv6/reassembly.c    |   85 +++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv6/route.c         |   77 ++++++++++++++++++++++++++++++++++++++++--
 7 files changed, 325 insertions(+), 7 deletions(-)

Index: mmotm/net/ipv4/ip_fragment.c
===================================================================
--- mmotm.orig/net/ipv4/ip_fragment.c
+++ mmotm/net/ipv4/ip_fragment.c
@@ -42,6 +42,8 @@
 #include <linux/udp.h>
 #include <linux/inet.h>
 #include <linux/netfilter_ipv4.h>
+#include <linux/reserve.h>
+#include <linux/nsproxy.h>
 
 /* NOTE. Logic of IP defragmentation is parallel to corresponding IPv6
  * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c
@@ -599,6 +601,63 @@ int ip_defrag(struct sk_buff *skb, u32 u
 }
 
 #ifdef CONFIG_SYSCTL
+static int
+proc_dointvec_fragment(struct ctl_table *table, int write, struct file *filp,
+		void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = container_of(table->data, struct net,
+				       ipv4.frags.high_thresh);
+	ctl_table tmp = *table;
+	int new_bytes, ret;
+
+	mutex_lock(&net->ipv4.frags.lock);
+	if (write) {
+		tmp.data = &new_bytes;
+		table = &tmp;
+	}
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&net->ipv4.frags.reserve,
+				new_bytes);
+		if (!ret)
+			net->ipv4.frags.high_thresh = new_bytes;
+	}
+	mutex_unlock(&net->ipv4.frags.lock);
+
+	return ret;
+}
+
+static int sysctl_intvec_fragment(struct ctl_table *table,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	struct net *net = container_of(table->data, struct net,
+				       ipv4.frags.high_thresh);
+	int write = (newval && newlen);
+	ctl_table tmp = *table;
+	int new_bytes, ret;
+
+	mutex_lock(&net->ipv4.frags.lock);
+	if (write) {
+		tmp.data = &new_bytes;
+		table = &tmp;
+	}
+
+	ret = sysctl_intvec(table, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&net->ipv4.frags.reserve,
+				new_bytes);
+		if (!ret)
+			net->ipv4.frags.high_thresh = new_bytes;
+	}
+	mutex_unlock(&net->ipv4.frags.lock);
+
+	return ret;
+}
+
 static int zero;
 
 static struct ctl_table ip4_frags_ns_ctl_table[] = {
@@ -608,7 +667,8 @@ static struct ctl_table ip4_frags_ns_ctl
 		.data		= &init_net.ipv4.frags.high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment,
+		.strategy	= &sysctl_intvec_fragment,
 	},
 	{
 		.ctl_name	= NET_IPV4_IPFRAG_LOW_THRESH,
@@ -711,6 +771,8 @@ static inline void ip4_frags_ctl_registe
 
 static int ipv4_frags_init_net(struct net *net)
 {
+	int ret;
+
 	/*
 	 * Fragment cache limits. We will commit 256K at one time. Should we
 	 * cross that limit we will prune down to 192K. This should cope with
@@ -728,11 +790,31 @@ static int ipv4_frags_init_net(struct ne
 
 	inet_frags_init_net(&net->ipv4.frags);
 
-	return ip4_frags_ns_ctl_register(net);
+	ret = ip4_frags_ns_ctl_register(net);
+	if (ret)
+		goto out_reg;
+
+	mem_reserve_init(&net->ipv4.frags.reserve, "IPv4 fragment cache",
+			&net_skb_reserve);
+	ret = mem_reserve_kmalloc_set(&net->ipv4.frags.reserve,
+			net->ipv4.frags.high_thresh);
+	if (ret)
+		goto out_reserve;
+
+	return 0;
+
+out_reserve:
+	mem_reserve_disconnect(&net->ipv4.frags.reserve);
+	ip4_frags_ns_ctl_unregister(net);
+out_reg:
+	inet_frags_exit_net(&net->ipv4.frags, &ip4_frags);
+
+	return ret;
 }
 
 static void ipv4_frags_exit_net(struct net *net)
 {
+	mem_reserve_disconnect(&net->ipv4.frags.reserve);
 	ip4_frags_ns_ctl_unregister(net);
 	inet_frags_exit_net(&net->ipv4.frags, &ip4_frags);
 }
Index: mmotm/net/ipv6/reassembly.c
===================================================================
--- mmotm.orig/net/ipv6/reassembly.c
+++ mmotm/net/ipv6/reassembly.c
@@ -41,6 +41,7 @@
 #include <linux/random.h>
 #include <linux/jhash.h>
 #include <linux/skbuff.h>
+#include <linux/reserve.h>
 
 #include <net/sock.h>
 #include <net/snmp.h>
@@ -634,6 +635,63 @@ static struct inet6_protocol frag_protoc
 };
 
 #ifdef CONFIG_SYSCTL
+static int
+proc_dointvec_fragment(struct ctl_table *table, int write, struct file *filp,
+		void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = container_of(table->data, struct net,
+				       ipv6.frags.high_thresh);
+	ctl_table tmp = *table;
+	int new_bytes, ret;
+
+	mutex_lock(&net->ipv6.frags.lock);
+	if (write) {
+		tmp.data = &new_bytes;
+		table = &tmp;
+	}
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&net->ipv6.frags.reserve,
+					      new_bytes);
+		if (!ret)
+			net->ipv6.frags.high_thresh = new_bytes;
+	}
+	mutex_unlock(&net->ipv6.frags.lock);
+
+	return ret;
+}
+
+static int sysctl_intvec_fragment(struct ctl_table *table,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	struct net *net = container_of(table->data, struct net,
+				       ipv6.frags.high_thresh);
+	int write = (newval && newlen);
+	ctl_table tmp = *table;
+	int new_bytes, ret;
+
+	mutex_lock(&net->ipv6.frags.lock);
+	if (write) {
+		tmp.data = &new_bytes;
+		table = &tmp;
+	}
+
+	ret = sysctl_intvec(table, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmalloc_set(&net->ipv6.frags.reserve,
+					      new_bytes);
+		if (!ret)
+			net->ipv6.frags.high_thresh = new_bytes;
+	}
+	mutex_unlock(&net->ipv6.frags.lock);
+
+	return ret;
+}
+
 static struct ctl_table ip6_frags_ns_ctl_table[] = {
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_HIGH_THRESH,
@@ -641,7 +699,8 @@ static struct ctl_table ip6_frags_ns_ctl
 		.data		= &init_net.ipv6.frags.high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= &proc_dointvec_fragment,
+		.strategy	= &sysctl_intvec_fragment,
 	},
 	{
 		.ctl_name	= NET_IPV6_IP6FRAG_LOW_THRESH,
@@ -750,17 +809,39 @@ static inline void ip6_frags_sysctl_unre
 
 static int ipv6_frags_init_net(struct net *net)
 {
+	int ret;
+
 	net->ipv6.frags.high_thresh = 256 * 1024;
 	net->ipv6.frags.low_thresh = 192 * 1024;
 	net->ipv6.frags.timeout = IPV6_FRAG_TIMEOUT;
 
 	inet_frags_init_net(&net->ipv6.frags);
 
-	return ip6_frags_ns_sysctl_register(net);
+	ret = ip6_frags_ns_sysctl_register(net);
+	if (ret)
+		goto out_reg;
+
+	mem_reserve_init(&net->ipv6.frags.reserve, "IPv6 fragment cache",
+			 &net_skb_reserve);
+	ret = mem_reserve_kmalloc_set(&net->ipv6.frags.reserve,
+				      net->ipv6.frags.high_thresh);
+	if (ret)
+		goto out_reserve;
+
+	return 0;
+
+out_reserve:
+	mem_reserve_disconnect(&net->ipv6.frags.reserve);
+	ip6_frags_ns_sysctl_unregister(net);
+out_reg:
+	inet_frags_exit_net(&net->ipv6.frags, &ip6_frags);
+
+	return ret;
 }
 
 static void ipv6_frags_exit_net(struct net *net)
 {
+	mem_reserve_disconnect(&net->ipv6.frags.reserve);
 	ip6_frags_ns_sysctl_unregister(net);
 	inet_frags_exit_net(&net->ipv6.frags, &ip6_frags);
 }
Index: mmotm/net/ipv4/route.c
===================================================================
--- mmotm.orig/net/ipv4/route.c
+++ mmotm/net/ipv4/route.c
@@ -107,6 +107,7 @@
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
+#include <linux/reserve.h>
 
 #define RT_FL_TOS(oldflp) \
     ((u32)(oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK)))
@@ -271,6 +272,8 @@ static inline int rt_genid(struct net *n
 	return atomic_read(&net->ipv4.rt_genid);
 }
 
+static struct mem_reserve ipv4_route_reserve;
+
 #ifdef CONFIG_PROC_FS
 struct rt_cache_iter_state {
 	struct seq_net_private p;
@@ -400,6 +403,61 @@ static int rt_cache_seq_show(struct seq_
 	return 0;
 }
 
+static struct mutex ipv4_route_lock;
+
+static int
+proc_dointvec_route(struct ctl_table *table, int write, struct file *filp,
+		void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	ctl_table tmp = *table;
+	int new_size, ret;
+
+	mutex_lock(&ipv4_route_lock);
+	if (write) {
+		tmp.data = &new_size;
+		table = &tmp;
+	}
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+				ipv4_dst_ops.kmem_cachep, new_size);
+		if (!ret)
+			ip_rt_max_size = new_size;
+	}
+	mutex_unlock(&ipv4_route_lock);
+
+	return ret;
+}
+
+static int sysctl_intvec_route(struct ctl_table *table,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	int write = (newval && newlen);
+	ctl_table tmp = *table;
+	int new_size, ret;
+
+	mutex_lock(&ipv4_route_lock);
+	if (write) {
+		tmp.data = &new_size;
+		table = &tmp;
+	}
+
+	ret = sysctl_intvec(table, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+				ipv4_dst_ops.kmem_cachep, new_size);
+		if (!ret)
+			ip_rt_max_size = new_size;
+	}
+	mutex_unlock(&ipv4_route_lock);
+
+	return ret;
+}
+
 static const struct seq_operations rt_cache_seq_ops = {
 	.start  = rt_cache_seq_start,
 	.next   = rt_cache_seq_next,
@@ -3145,7 +3203,8 @@ static ctl_table ipv4_route_table[] = {
 		.data		= &ip_rt_max_size,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= &proc_dointvec_route,
+		.strategy	= &sysctl_intvec_route,
 	},
 	{
 		/*  Deprecated. Use gc_min_interval_ms */
@@ -3424,6 +3483,15 @@ int __init ip_rt_init(void)
 	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
 	ip_rt_max_size = (rt_hash_mask + 1) * 16;
 
+#ifdef CONFIG_PROCFS
+	mutex_init(&ipv4_route_lock);
+#endif
+
+	mem_reserve_init(&ipv4_route_reserve, "IPv4 route cache",
+			&net_rx_reserve);
+	mem_reserve_kmem_cache_set(&ipv4_route_reserve,
+			ipv4_dst_ops.kmem_cachep, ip_rt_max_size);
+
 	devinet_init();
 	ip_fib_init();
 
Index: mmotm/net/ipv6/route.c
===================================================================
--- mmotm.orig/net/ipv6/route.c
+++ mmotm/net/ipv6/route.c
@@ -37,6 +37,7 @@
 #include <linux/mroute6.h>
 #include <linux/init.h>
 #include <linux/if_arp.h>
+#include <linux/reserve.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/nsproxy.h>
@@ -2537,6 +2538,63 @@ int ipv6_sysctl_rtcache_flush(ctl_table
 		return -EINVAL;
 }
 
+static int
+proc_dointvec_route(struct ctl_table *table, int write, struct file *filp,
+		void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = container_of(table->data, struct net,
+				       ipv6.sysctl.ip6_rt_max_size);
+	ctl_table tmp = *table;
+	int new_size, ret;
+
+	mutex_lock(&net->ipv6.sysctl.ip6_rt_lock);
+	if (write) {
+		tmp.data = &new_size;
+		table = &tmp;
+	}
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
+				net->ipv6.ip6_dst_ops->kmem_cachep, new_size);
+		if (!ret)
+			net->ipv6.sysctl.ip6_rt_max_size = new_size;
+	}
+	mutex_unlock(&net->ipv6.sysctl.ip6_rt_lock);
+
+	return ret;
+}
+
+static int sysctl_intvec_route(struct ctl_table *table,
+		void __user *oldval, size_t __user *oldlenp,
+		void __user *newval, size_t newlen)
+{
+	struct net *net = container_of(table->data, struct net,
+				       ipv6.sysctl.ip6_rt_max_size);
+	int write = (newval && newlen);
+	ctl_table tmp = *table;
+	int new_size, ret;
+
+	mutex_lock(&net->ipv6.sysctl.ip6_rt_lock);
+	if (write) {
+		tmp.data = &new_size;
+		table = &tmp;
+	}
+
+	ret = sysctl_intvec(table, oldval, oldlenp, newval, newlen);
+
+	if (!ret && write) {
+		ret = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
+				net->ipv6.ip6_dst_ops->kmem_cachep, new_size);
+		if (!ret)
+			net->ipv6.sysctl.ip6_rt_max_size = new_size;
+	}
+	mutex_unlock(&net->ipv6.sysctl.ip6_rt_lock);
+
+	return ret;
+}
+
 ctl_table ipv6_route_table_template[] = {
 	{
 		.procname	=	"flush",
@@ -2559,7 +2617,8 @@ ctl_table ipv6_route_table_template[] =
 		.data		=	&init_net.ipv6.sysctl.ip6_rt_max_size,
 		.maxlen		=	sizeof(int),
 		.mode		=	0644,
-		.proc_handler	=	proc_dointvec,
+		.proc_handler	=	&proc_dointvec_route,
+		.strategy	= 	&sysctl_intvec_route,
 	},
 	{
 		.ctl_name	=	NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2647,6 +2706,8 @@ struct ctl_table *ipv6_route_sysctl_init
 		table[8].data = &net->ipv6.sysctl.ip6_rt_min_advmss;
 	}
 
+	mutex_init(&net->ipv6.sysctl.ip6_rt_lock);
+
 	return table;
 }
 #endif
@@ -2700,6 +2761,14 @@ static int ip6_route_net_init(struct net
 	net->ipv6.sysctl.ip6_rt_mtu_expires = 10*60*HZ;
 	net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
 
+	mem_reserve_init(&net->ipv6.ip6_rt_reserve, "IPv6 route cache",
+			 &net_rx_reserve);
+	ret = mem_reserve_kmem_cache_set(&net->ipv6.ip6_rt_reserve,
+			net->ipv6.ip6_dst_ops->kmem_cachep,
+			net->ipv6.sysctl.ip6_rt_max_size);
+	if (ret)
+		goto out_reserve_fail;
+
 #ifdef CONFIG_PROC_FS
 	proc_net_fops_create(net, "ipv6_route", 0, &ipv6_route_proc_fops);
 	proc_net_fops_create(net, "rt6_stats", S_IRUGO, &rt6_stats_seq_fops);
@@ -2710,12 +2779,15 @@ static int ip6_route_net_init(struct net
 out:
 	return ret;
 
+out_reserve_fail:
+	mem_reserve_disconnect(&net->ipv6.ip6_rt_reserve);
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
+	kfree(net->ipv6.ip6_blk_hole_entry);
 out_ip6_prohibit_entry:
 	kfree(net->ipv6.ip6_prohibit_entry);
 out_ip6_null_entry:
-	kfree(net->ipv6.ip6_null_entry);
 #endif
+	kfree(net->ipv6.ip6_null_entry);
 out_ip6_dst_ops:
 	release_net(net->ipv6.ip6_dst_ops->dst_net);
 	kfree(net->ipv6.ip6_dst_ops);
@@ -2728,6 +2800,7 @@ static void ip6_route_net_exit(struct ne
 	proc_net_remove(net, "ipv6_route");
 	proc_net_remove(net, "rt6_stats");
 #endif
+	mem_reserve_disconnect(&net->ipv6.ip6_rt_reserve);
 	kfree(net->ipv6.ip6_null_entry);
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 	kfree(net->ipv6.ip6_prohibit_entry);
Index: mmotm/include/net/inet_frag.h
===================================================================
--- mmotm.orig/include/net/inet_frag.h
+++ mmotm/include/net/inet_frag.h
@@ -1,6 +1,9 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
+#include <linux/reserve.h>
+#include <linux/mutex.h>
+
 struct netns_frags {
 	int			nqueues;
 	atomic_t		mem;
@@ -10,6 +13,10 @@ struct netns_frags {
 	int			timeout;
 	int			high_thresh;
 	int			low_thresh;
+
+	/* reserves */
+	struct mutex		lock;
+	struct mem_reserve	reserve;
 };
 
 struct inet_frag_queue {
Index: mmotm/net/ipv4/inet_fragment.c
===================================================================
--- mmotm.orig/net/ipv4/inet_fragment.c
+++ mmotm/net/ipv4/inet_fragment.c
@@ -19,6 +19,7 @@
 #include <linux/random.h>
 #include <linux/skbuff.h>
 #include <linux/rtnetlink.h>
+#include <linux/reserve.h>
 
 #include <net/inet_frag.h>
 
@@ -74,6 +75,8 @@ void inet_frags_init_net(struct netns_fr
 	nf->nqueues = 0;
 	atomic_set(&nf->mem, 0);
 	INIT_LIST_HEAD(&nf->lru_list);
+	mutex_init(&nf->lock);
+	mem_reserve_init(&nf->reserve, "IP fragement cache", NULL);
 }
 EXPORT_SYMBOL(inet_frags_init_net);
 
Index: mmotm/include/net/netns/ipv6.h
===================================================================
--- mmotm.orig/include/net/netns/ipv6.h
+++ mmotm/include/net/netns/ipv6.h
@@ -24,6 +24,8 @@ struct netns_sysctl_ipv6 {
 	int ip6_rt_mtu_expires;
 	int ip6_rt_min_advmss;
 	int icmpv6_time;
+
+	struct mutex ip6_rt_lock;
 };
 
 struct netns_ipv6 {
@@ -55,6 +57,8 @@ struct netns_ipv6 {
 	struct sock             *ndisc_sk;
 	struct sock             *tcp_sk;
 	struct sock             *igmp_sk;
+
+	struct mem_reserve	ip6_rt_reserve;
 #ifdef CONFIG_IPV6_MROUTE
 	struct sock		*mroute6_sk;
 	struct mfc6_cache	**mfc6_cache_array;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 15/31] netvm: network reserve infrastructure
From: Suresh Jayaraman @ 2009-10-01 14:07 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Provide the basic infrastructure to reserve and charge/account network memory.

We provide the following reserve tree:

1)  total network reserve
2)    network TX reserve
3)      protocol TX pages
4)    network RX reserve
5)      SKB data reserve

[1] is used to make all the network reserves a single subtree, for easy
manipulation.

[2] and [4] are merely for eastetic reasons.

The TX pages reserve [3] is assumed bounded by it being the upper bound of
memory that can be used for sending pages (not quite true, but good enough)

The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
against in the fallback path.

The consumers for these reserves are sockets marked with:
  SOCK_MEMALLOC

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/net/sock.h |   43 ++++++++++++++++++++-
 net/Kconfig        |    3 +
 net/core/sock.c    |  107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 152 insertions(+), 1 deletion(-)

Index: mmotm/include/net/sock.h
===================================================================
--- mmotm.orig/include/net/sock.h
+++ mmotm/include/net/sock.h
@@ -51,6 +51,7 @@
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/mm.h>
 #include <linux/security.h>
+#include <linux/reserve.h>
 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
@@ -497,6 +498,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */
 	SOCK_TIMESTAMPING_TX_HARDWARE,  /* %SOF_TIMESTAMPING_TX_HARDWARE */
 	SOCK_TIMESTAMPING_TX_SOFTWARE,  /* %SOF_TIMESTAMPING_TX_SOFTWARE */
 	SOCK_TIMESTAMPING_RX_HARDWARE,  /* %SOF_TIMESTAMPING_RX_HARDWARE */
@@ -526,9 +528,48 @@ static inline int sock_flag(struct sock
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_memalloc(struct sock *sk)
+{
+	return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+extern struct mem_reserve net_rx_reserve;
+extern struct mem_reserve net_skb_reserve;
+
+#ifdef CONFIG_NETVM
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern int memalloc_socks;
+
+static inline int sk_memalloc_socks(void)
+{
+	return memalloc_socks;
+}
+
+extern int sk_adjust_memalloc(int socks, long tx_reserve_pages);
+extern int sk_set_memalloc(struct sock *sk);
+extern int sk_clear_memalloc(struct sock *sk);
+#else
+static inline int sk_memalloc_socks(void)
+{
+	return 0;
+}
+
+static inline int sk_clear_memalloc(struct sock *sk)
+{
+	return 0;
+}
+#endif
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-	return gfp_mask;
+	return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: mmotm/net/core/sock.c
===================================================================
--- mmotm.orig/net/core/sock.c
+++ mmotm/net/core/sock.c
@@ -110,6 +110,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/reserve.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -217,6 +218,105 @@ __u32 sysctl_rmem_default __read_mostly
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+static struct mem_reserve net_reserve;
+struct mem_reserve net_rx_reserve;
+EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */
+struct mem_reserve net_skb_reserve;
+EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */
+static struct mem_reserve net_tx_reserve;
+static struct mem_reserve net_tx_pages;
+
+#ifdef CONFIG_NETVM
+static DEFINE_MUTEX(memalloc_socks_lock);
+int memalloc_socks;
+
+/**
+ *	sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ *	@socks: number of new %SOCK_MEMALLOC sockets
+ *	@tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ *	This function adjusts the memalloc reserve based on system demand.
+ *	The RX reserve is a limit, and only added once, not for each socket.
+ *
+ *	NOTE:
+ *	   @tx_reserve_pages is an upper-bound of memory used for TX hence
+ *	   we need not account the pages like we do for RX pages.
+ */
+int sk_adjust_memalloc(int socks, long tx_reserve_pages)
+{
+	int err;
+
+	mutex_lock(&memalloc_socks_lock);
+	err = mem_reserve_pages_add(&net_tx_pages, tx_reserve_pages);
+	if (err)
+		goto unlock;
+
+	/*
+	 * either socks is positive and we need to check for 0 -> !0
+	 * transition and connect the reserve tree when we observe it.
+	 */
+	if (!memalloc_socks && socks > 0) {
+		err = mem_reserve_connect(&net_reserve, &mem_reserve_root);
+		if (err) {
+			/*
+			 * if we failed to connect the tree, undo the tx
+			 * reserve so that failure has no side effects.
+			 */
+			mem_reserve_pages_add(&net_tx_pages, -tx_reserve_pages);
+			goto unlock;
+		}
+	}
+	memalloc_socks += socks;
+	/*
+	 * or socks is negative and we must observe the !0 -> 0 transition
+	 * and disconnect the reserve tree.
+	 */
+	if (!memalloc_socks && socks)
+		mem_reserve_disconnect(&net_reserve);
+
+unlock:
+	mutex_unlock(&memalloc_socks_lock);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/**
+ *	sk_set_memalloc - sets %SOCK_MEMALLOC
+ *	@sk: socket to set it on
+ *
+ *	Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
+ *	accordingly.
+ */
+int sk_set_memalloc(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_MEMALLOC);
+
+	if (!set) {
+		int err = sk_adjust_memalloc(1, 0);
+		if (err)
+			return err;
+
+		sock_set_flag(sk, SOCK_MEMALLOC);
+		sk->sk_allocation |= __GFP_MEMALLOC;
+	}
+	return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
+int sk_clear_memalloc(struct sock *sk)
+{
+	int set = sock_flag(sk, SOCK_MEMALLOC);
+	if (set) {
+		sk_adjust_memalloc(-1, 0);
+		sock_reset_flag(sk, SOCK_MEMALLOC);
+		sk->sk_allocation &= ~__GFP_MEMALLOC;
+	}
+	return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1036,6 +1136,7 @@ static void __sk_free(struct sock *sk)
 {
 	struct sk_filter *filter;
 
+	sk_clear_memalloc(sk);
 	if (sk->sk_destruct)
 		sk->sk_destruct(sk);
 
@@ -1205,6 +1306,12 @@ void __init sk_init(void)
 		sysctl_wmem_max = 131071;
 		sysctl_rmem_max = 131071;
 	}
+
+	mem_reserve_init(&net_reserve, "total network reserve", NULL);
+	mem_reserve_init(&net_rx_reserve, "network RX reserve", &net_reserve);
+	mem_reserve_init(&net_skb_reserve, "SKB data reserve", &net_rx_reserve);
+	mem_reserve_init(&net_tx_reserve, "network TX reserve", &net_reserve);
+	mem_reserve_init(&net_tx_pages, "protocol TX pages", &net_tx_reserve);
 }
 
 /*
Index: mmotm/net/Kconfig
===================================================================
--- mmotm.orig/net/Kconfig
+++ mmotm/net/Kconfig
@@ -256,4 +256,7 @@ source "net/wimax/Kconfig"
 source "net/rfkill/Kconfig"
 source "net/9p/Kconfig"
 
+config NETVM
+	def_bool n
+
 endif   # if NET

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 14/31] net: sk_allocation() - concentrate socket related allocations
From: Suresh Jayaraman @ 2009-10-01 14:07 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/net/sock.h    |    5 +++++
 net/ipv4/tcp.c        |    3 ++-
 net/ipv4/tcp_output.c |   12 +++++++-----
 net/ipv6/tcp_ipv6.c   |   15 +++++++++++----
 4 files changed, 25 insertions(+), 10 deletions(-)

Index: mmotm/include/net/sock.h
===================================================================
--- mmotm.orig/include/net/sock.h
+++ mmotm/include/net/sock.h
@@ -526,6 +526,11 @@ static inline int sock_flag(struct sock
 	return test_bit(flag, &sk->sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+	return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
 	sk->sk_ack_backlog--;
Index: mmotm/net/ipv4/tcp.c
===================================================================
--- mmotm.orig/net/ipv4/tcp.c
+++ mmotm/net/ipv4/tcp.c
@@ -645,7 +645,8 @@ struct sk_buff *sk_stream_alloc_skb(stru
 	/* The TCP header must be at least 32-bit aligned.  */
 	size = ALIGN(size, 4);
 
-	skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp);
+	skb = alloc_skb_fclone(size + sk->sk_prot->max_header,
+			       sk_allocation(sk, gfp));
 	if (skb) {
 		if (sk_wmem_schedule(sk, skb->truesize)) {
 			/*
Index: mmotm/net/ipv4/tcp_output.c
===================================================================
--- mmotm.orig/net/ipv4/tcp_output.c
+++ mmotm/net/ipv4/tcp_output.c
@@ -2101,7 +2101,8 @@ void tcp_send_fin(struct sock *sk)
 	} else {
 		/* Socket is locked, keep trying until memory is available. */
 		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+			skb = alloc_skb_fclone(MAX_TCP_HEADER,
+					       sk_allocation(sk, GFP_KERNEL));
 			if (skb)
 				break;
 			yield();
@@ -2127,7 +2128,7 @@ void tcp_send_active_reset(struct sock *
 	struct sk_buff *skb;
 
 	/* NOTE: No TCP options attached and we never retransmit this. */
-	skb = alloc_skb(MAX_TCP_HEADER, priority);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
 	if (!skb) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
 		return;
@@ -2196,7 +2197,8 @@ struct sk_buff *tcp_make_synack(struct s
 	__u8 *md5_hash_location;
 	int mss;
 
-	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
+	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1,
+			sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return NULL;
 
@@ -2443,7 +2445,7 @@ void tcp_send_ack(struct sock *sk)
 	 * tcp_transmit_skb() will set the ownership to this
 	 * sock.
 	 */
-	buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+	buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 	if (buff == NULL) {
 		inet_csk_schedule_ack(sk);
 		inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2478,7 +2480,7 @@ static int tcp_xmit_probe_skb(struct soc
 	struct sk_buff *skb;
 
 	/* We don't queue it, tcp_transmit_skb() sets ownership. */
-	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+	skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
 	if (skb == NULL)
 		return -1;
 
Index: mmotm/net/ipv6/tcp_ipv6.c
===================================================================
--- mmotm.orig/net/ipv6/tcp_ipv6.c
+++ mmotm/net/ipv6/tcp_ipv6.c
@@ -584,7 +584,8 @@ static int tcp_v6_md5_do_add(struct sock
 	} else {
 		/* reallocate new list if current one is full. */
 		if (!tp->md5sig_info) {
-			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), GFP_ATOMIC);
+			tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
+					sk_allocation(sk, GFP_ATOMIC));
 			if (!tp->md5sig_info) {
 				kfree(newkey);
 				return -ENOMEM;
@@ -597,7 +598,8 @@ static int tcp_v6_md5_do_add(struct sock
 		}
 		if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
 			keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
-				       (tp->md5sig_info->entries6 + 1)), GFP_ATOMIC);
+				       (tp->md5sig_info->entries6 + 1)),
+				       sk_allocation(sk, GFP_ATOMIC));
 
 			if (!keys) {
 				tcp_free_md5sig_pool();
@@ -721,7 +723,8 @@ static int tcp_v6_parse_md5_keys (struct
 		struct tcp_sock *tp = tcp_sk(sk);
 		struct tcp_md5sig_info *p;
 
-		p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+		p = kzalloc(sizeof(struct tcp_md5sig_info),
+				   sk_allocation(sk, GFP_KERNEL));
 		if (!p)
 			return -ENOMEM;
 
@@ -987,6 +990,7 @@ static void tcp_v6_send_response(struct
 	unsigned int tot_len = sizeof(struct tcphdr);
 	struct dst_entry *dst;
 	__be32 *topt;
+	gfp_t gfp_mask = GFP_ATOMIC;
 
 	if (ts)
 		tot_len += TCPOLEN_TSTAMP_ALIGNED;
@@ -996,7 +1000,7 @@ static void tcp_v6_send_response(struct
 #endif
 
 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-			 GFP_ATOMIC);
+			 gfp_mask);
 	if (buff == NULL)
 		return;
 
@@ -1073,6 +1077,7 @@ static void tcp_v6_send_reset(struct soc
 	struct tcphdr *th = tcp_hdr(skb);
 	u32 seq = 0, ack_seq = 0;
 	struct tcp_md5sig_key *key = NULL;
+	gfp_t gfp_mask = GFP_ATOMIC;
 
 	if (th->rst)
 		return;
@@ -1084,6 +1089,8 @@ static void tcp_v6_send_reset(struct soc
 	if (sk)
 		key = tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr);
 #endif
+	if (sk)
+		gfp_mask = sk_allocation(sk, gfp_mask);
 
 	if (th->ack)
 		seq = ntohl(th->ack_seq);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 13/31] net: packet split receive api
From: Suresh Jayaraman @ 2009-10-01 14:07 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Jiri Bohac, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs. Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Thanks to Jiri Bohac for fixing a bug in bnx2.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 drivers/net/bnx2.c             |    9 +++------
 drivers/net/e1000e/netdev.c    |    7 ++-----
 drivers/net/igb/igb_main.c     |    9 ++-------
 drivers/net/ixgbe/ixgbe_main.c |   14 ++++++--------
 drivers/net/sky2.c             |   16 ++++++----------
 include/linux/skbuff.h         |    3 +++
 6 files changed, 22 insertions(+), 36 deletions(-)

Index: mmotm/drivers/net/bnx2.c
===================================================================
--- mmotm.orig/drivers/net/bnx2.c
+++ mmotm/drivers/net/bnx2.c
@@ -2648,7 +2648,7 @@ bnx2_alloc_rx_page(struct bnx2 *bp, stru
 	struct sw_pg *rx_pg = &rxr->rx_pg_ring[index];
 	struct rx_bd *rxbd =
 		&rxr->rx_pg_desc_ring[RX_RING(index)][RX_IDX(index)];
-	struct page *page = alloc_page(GFP_ATOMIC);
+	struct page *page = netdev_alloc_page(bp->dev);
 
 	if (!page)
 		return -ENOMEM;
@@ -2678,7 +2678,7 @@ bnx2_free_rx_page(struct bnx2 *bp, struc
 	pci_unmap_page(bp->pdev, pci_unmap_addr(rx_pg, mapping), PAGE_SIZE,
 		       PCI_DMA_FROMDEVICE);
 
-	__free_page(page);
+	netdev_free_page(bp->dev, page);
 	rx_pg->page = NULL;
 }
 
@@ -3003,7 +3003,7 @@ bnx2_rx_skb(struct bnx2 *bp, struct bnx2
 			if (i == pages - 1)
 				frag_len -= 4;
 
-			skb_fill_page_desc(skb, i, rx_pg->page, 0, frag_len);
+			skb_add_rx_frag(skb, i, rx_pg->page, 0, frag_len);
 			rx_pg->page = NULL;
 
 			err = bnx2_alloc_rx_page(bp, rxr,
@@ -3020,9 +3020,6 @@ bnx2_rx_skb(struct bnx2 *bp, struct bnx2
 				       PAGE_SIZE, PCI_DMA_FROMDEVICE);
 
 			frag_size -= frag_len;
-			skb->data_len += frag_len;
-			skb->truesize += frag_len;
-			skb->len += frag_len;
 
 			pg_prod = NEXT_RX_BD(pg_prod);
 			pg_cons = RX_PG_RING_IDX(NEXT_RX_BD(pg_cons));
Index: mmotm/drivers/net/e1000e/netdev.c
===================================================================
--- mmotm.orig/drivers/net/e1000e/netdev.c
+++ mmotm/drivers/net/e1000e/netdev.c
@@ -259,7 +259,7 @@ static void e1000_alloc_rx_buffers_ps(st
 				continue;
 			}
 			if (!ps_page->page) {
-				ps_page->page = alloc_page(GFP_ATOMIC);
+				ps_page->page = netdev_alloc_page(netdev);
 				if (!ps_page->page) {
 					adapter->alloc_rx_buff_failed++;
 					goto no_buffers;
@@ -820,11 +820,8 @@ static bool e1000_clean_rx_irq_ps(struct
 			pci_unmap_page(pdev, ps_page->dma, PAGE_SIZE,
 				       PCI_DMA_FROMDEVICE);
 			ps_page->dma = 0;
-			skb_fill_page_desc(skb, j, ps_page->page, 0, length);
+			skb_add_rx_frag(skb, j, ps_page->page, 0, length);
 			ps_page->page = NULL;
-			skb->len += length;
-			skb->data_len += length;
-			skb->truesize += length;
 		}
 
 		/* strip the ethernet crc, problem is we're using pages now so
Index: mmotm/drivers/net/igb/igb_main.c
===================================================================
--- mmotm.orig/drivers/net/igb/igb_main.c
+++ mmotm/drivers/net/igb/igb_main.c
@@ -4616,7 +4616,7 @@ static bool igb_clean_rx_irq_adv(struct
 				       PAGE_SIZE / 2, PCI_DMA_FROMDEVICE);
 			buffer_info->page_dma = 0;
 
-			skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags++,
+			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags++,
 						buffer_info->page,
 						buffer_info->page_offset,
 						length);
@@ -4626,11 +4626,6 @@ static bool igb_clean_rx_irq_adv(struct
 				buffer_info->page = NULL;
 			else
 				get_page(buffer_info->page);
-
-			skb->len += length;
-			skb->data_len += length;
-
-			skb->truesize += length;
 		}
 
 		if (!(staterr & E1000_RXD_STAT_EOP)) {
@@ -4755,7 +4750,7 @@ static void igb_alloc_rx_buffers_adv(str
 
 		if (adapter->rx_ps_hdr_size && !buffer_info->page_dma) {
 			if (!buffer_info->page) {
-				buffer_info->page = alloc_page(GFP_ATOMIC);
+				buffer_info->page = netdev_alloc_page(netdev);
 				if (!buffer_info->page) {
 					adapter->alloc_rx_buff_failed++;
 					goto no_buffers;
Index: mmotm/drivers/net/ixgbe/ixgbe_main.c
===================================================================
--- mmotm.orig/drivers/net/ixgbe/ixgbe_main.c
+++ mmotm/drivers/net/ixgbe/ixgbe_main.c
@@ -574,6 +574,7 @@ static void ixgbe_alloc_rx_buffers(struc
                                    int cleaned_count)
 {
 	struct pci_dev *pdev = adapter->pdev;
+	struct net_device *netdev = adapter->netdev;
 	union ixgbe_adv_rx_desc *rx_desc;
 	struct ixgbe_rx_buffer *bi;
 	unsigned int i;
@@ -587,7 +588,7 @@ static void ixgbe_alloc_rx_buffers(struc
 		if (!bi->page_dma &&
 		    (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED)) {
 			if (!bi->page) {
-				bi->page = alloc_page(GFP_ATOMIC);
+				bi->page = netdev_alloc_page(netdev);
 				if (!bi->page) {
 					adapter->alloc_rx_page_failed++;
 					goto no_buffers;
@@ -756,10 +757,10 @@ static bool ixgbe_clean_rx_irq(struct ix
 			pci_unmap_page(pdev, rx_buffer_info->page_dma,
 			               PAGE_SIZE / 2, PCI_DMA_FROMDEVICE);
 			rx_buffer_info->page_dma = 0;
-			skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags,
-			                   rx_buffer_info->page,
-			                   rx_buffer_info->page_offset,
-			                   upper_len);
+			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+					rx_buffer_info->page,
+					rx_buffer_info->page_offset,
+					upper_len);
 
 			if ((rx_ring->rx_buf_len > (PAGE_SIZE / 2)) ||
 			    (page_count(rx_buffer_info->page) != 1))
@@ -767,9 +768,6 @@ static bool ixgbe_clean_rx_irq(struct ix
 			else
 				get_page(rx_buffer_info->page);
 
-			skb->len += upper_len;
-			skb->data_len += upper_len;
-			skb->truesize += upper_len;
 		}
 
 		i++;
Index: mmotm/drivers/net/sky2.c
===================================================================
--- mmotm.orig/drivers/net/sky2.c
+++ mmotm/drivers/net/sky2.c
@@ -1282,7 +1282,7 @@ static struct sk_buff *sky2_rx_alloc(str
 		skb_reserve(skb, NET_IP_ALIGN);
 
 	for (i = 0; i < sky2->rx_nfrags; i++) {
-		struct page *page = alloc_page(GFP_ATOMIC);
+		struct page *page = netdev_alloc_page(sky2->netdev);
 
 		if (!page)
 			goto free_partial;
@@ -2218,8 +2218,8 @@ static struct sk_buff *receive_copy(stru
 }
 
 /* Adjust length of skb with fragments to match received data */
-static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
-			  unsigned int length)
+static void skb_put_frags(struct sky2_port *sky2, struct sk_buff *skb,
+			  unsigned int hdr_space, unsigned int length)
 {
 	int i, num_frags;
 	unsigned int size;
@@ -2236,15 +2236,11 @@ static void skb_put_frags(struct sk_buff
 
 		if (length == 0) {
 			/* don't need this page */
-			__free_page(frag->page);
+			netdev_free_page(sky2->netdev, frag->page);
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
-
-			frag->size = size;
-			skb->data_len += size;
-			skb->truesize += size;
-			skb->len += size;
+			skb_add_rx_frag(skb, i, frag->page, 0, size);
 			length -= size;
 		}
 	}
@@ -2275,7 +2271,7 @@ static struct sk_buff *receive_new(struc
 	}
 
 	if (skb_shinfo(skb)->nr_frags)
-		skb_put_frags(skb, hdr_space, length);
+		skb_put_frags(sky2, skb, hdr_space, length);
 	else
 		skb_put(skb, length);
 	return skb;
Index: mmotm/include/linux/skbuff.h
===================================================================
--- mmotm.orig/include/linux/skbuff.h
+++ mmotm/include/linux/skbuff.h
@@ -1079,6 +1079,9 @@ static inline void skb_fill_page_desc(st
 extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
 			    int off, int size);
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+			    int off, int size);
+
 #define SKB_PAGE_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->nr_frags)
 #define SKB_FRAG_ASSERT(skb) 	BUG_ON(skb_has_frags(skb))
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 12/31] selinux: tag avc cache alloc as non-critical
From: Suresh Jayaraman @ 2009-10-01 14:06 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Failing to allocate a cache entry will only harm performance not correctness.
Do not consume valuable reserve pages for something like that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 security/selinux/avc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: mmotm/security/selinux/avc.c
===================================================================
--- mmotm.orig/security/selinux/avc.c
+++ mmotm/security/selinux/avc.c
@@ -344,7 +344,7 @@ static struct avc_node *avc_alloc_node(v
 {
 	struct avc_node *node;
 
-	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+	node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
 	if (!node)
 		goto out;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 11/31] mm: memory reserve management
From: Suresh Jayaraman @ 2009-10-01 14:06 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Generic reserve management code.

It provides methods to reserve and charge. Upon this, generic alloc/free style
reserve pools could be build, which could fully replace mempool_t
functionality.

It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/linux/reserve.h |  198 ++++++++++++++
 include/linux/slab.h    |   19 -
 mm/Makefile             |    2 
 mm/reserve.c            |  637 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/slub.c               |    2 
 5 files changed, 847 insertions(+), 11 deletions(-)

Index: mmotm/include/linux/reserve.h
===================================================================
--- /dev/null
+++ mmotm/include/linux/reserve.h
@@ -0,0 +1,198 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007-2008 Red Hat, Inc.,
+ *  			    Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_RESERVE_H
+#define _LINUX_RESERVE_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/slab.h>
+
+struct mem_reserve {
+	struct mem_reserve *parent;
+	struct list_head children;
+	struct list_head siblings;
+
+	const char *name;
+
+	long pages;
+	long limit;
+	long usage;
+	spinlock_t lock;	/* protects limit and usage */
+
+	wait_queue_head_t waitqueue;
+};
+
+extern struct mem_reserve mem_reserve_root;
+
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+		      struct mem_reserve *parent);
+int mem_reserve_connect(struct mem_reserve *new_child,
+			struct mem_reserve *node);
+void mem_reserve_disconnect(struct mem_reserve *node);
+
+int mem_reserve_pages_set(struct mem_reserve *res, long pages);
+int mem_reserve_pages_add(struct mem_reserve *res, long pages);
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages);
+
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes);
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes);
+
+struct kmem_cache;
+
+int mem_reserve_kmem_cache_set(struct mem_reserve *res,
+			       struct kmem_cache *s,
+			       int objects);
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res,
+				  struct kmem_cache *s, long objs);
+
+void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
+			 struct mem_reserve *res, int *emerg);
+
+static inline
+void *__kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
+			struct mem_reserve *res, int *emerg)
+{
+	void *obj;
+
+	obj = __kmalloc_node_track_caller(size,
+			flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node, ip);
+	if (!obj)
+		obj = ___kmalloc_reserve(size, flags, node, ip, res, emerg);
+
+	return obj;
+}
+
+/**
+ * kmalloc_reserve() - kmalloc() and charge against @res for @emerg allocations
+ * @size - size of the requested memory region
+ * @gfp - allocation flags to use for this allocation
+ * @node - preferred memory node for this allocation
+ * @res - reserve to charge emergency allocations against
+ * @emerg - bit 0 is set when the allocation was an emergency allocation
+ *
+ * Returns NULL on failure
+ */
+#define kmalloc_reserve(size, gfp, node, res, emerg) 			\
+	__kmalloc_reserve(size, gfp, node, 				\
+			  __builtin_return_address(0), res, emerg)
+
+void __kfree_reserve(void *obj, struct mem_reserve *res, int emerg);
+
+/**
+ * kfree_reserve() - kfree() and uncharge against @res for @emerg allocations
+ * @obj - memory to free
+ * @res - reserve to uncharge emergency allocations from
+ * @emerg - was this an emergency allocation
+ */
+static inline
+void kfree_reserve(void *obj, struct mem_reserve *res, int emerg)
+{
+	if (unlikely(obj && res && emerg))
+		__kfree_reserve(obj, res, emerg);
+	else
+		kfree(obj);
+}
+
+void *__kmem_cache_alloc_reserve(struct kmem_cache *s, gfp_t flags, int node,
+				 struct mem_reserve *res, int *emerg);
+
+/**
+ * kmem_cache_alloc_reserve() - kmem_cache_alloc() and charge against @res
+ * @s - kmem_cache to allocate from
+ * @gfp - allocation flags to use for this allocation
+ * @node - preferred memory node for this allocation
+ * @res - reserve to charge emergency allocations against
+ * @emerg - bit 0 is set when the allocation was an emergency allocation
+ *
+ * Returns NULL on failure
+ */
+static inline
+void *kmem_cache_alloc_reserve(struct kmem_cache *s, gfp_t flags, int node,
+			       struct mem_reserve *res, int *emerg)
+{
+	void *obj;
+
+	obj = kmem_cache_alloc_node(s,
+			flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node);
+	if (!obj)
+		obj = __kmem_cache_alloc_reserve(s, flags, node, res, emerg);
+
+	return obj;
+}
+
+void __kmem_cache_free_reserve(struct kmem_cache *s, void *obj,
+			       struct mem_reserve *res, int emerg);
+
+/**
+ * kmem_cache_free_reserve() - kmem_cache_free() and uncharge against @res
+ * @s - kmem_cache to free to
+ * @obj - memory to free
+ * @res - reserve to uncharge emergency allocations from
+ * @emerg - was this an emergency allocation
+ */
+static inline
+void kmem_cache_free_reserve(struct kmem_cache *s, void *obj,
+			     struct mem_reserve *res, int emerg)
+{
+	if (unlikely(obj && res && emerg))
+		__kmem_cache_free_reserve(s, obj, res, emerg);
+	else
+		kmem_cache_free(s, obj);
+}
+
+struct page *__alloc_pages_reserve(int node, gfp_t flags, int order,
+				  struct mem_reserve *res, int *emerg);
+
+/**
+ * alloc_pages_reserve() - alloc_pages() and charge against @res
+ * @node - preferred memory node for this allocation
+ * @gfp - allocation flags to use for this allocation
+ * @order - page order
+ * @res - reserve to charge emergency allocations against
+ * @emerg - bit 0 is set when the allocation was an emergency allocation
+ *
+ * Returns NULL on failure
+ */
+static inline
+struct page *alloc_pages_reserve(int node, gfp_t flags, int order,
+				 struct mem_reserve *res, int *emerg)
+{
+	struct page *page;
+
+	page = alloc_pages_node(node,
+			flags | __GFP_NOMEMALLOC | __GFP_NOWARN, order);
+	if (!page)
+		page = __alloc_pages_reserve(node, flags, order, res, emerg);
+
+	return page;
+}
+
+void __free_pages_reserve(struct page *page, int order,
+			  struct mem_reserve *res, int emerg);
+
+/**
+ * free_pages_reserve() - __free_pages() and uncharge against @res
+ * @page - page to free
+ * @order - page order
+ * @res - reserve to uncharge emergency allocations from
+ * @emerg - was this an emergency allocation
+ */
+static inline
+void free_pages_reserve(struct page *page, int order,
+			struct mem_reserve *res, int emerg)
+{
+	if (unlikely(page && res && emerg))
+		__free_pages_reserve(page, order, res, emerg);
+	else
+		__free_pages(page, order);
+}
+
+#endif /* _LINUX_RESERVE_H */
Index: mmotm/include/linux/slab.h
===================================================================
--- mmotm.orig/include/linux/slab.h
+++ mmotm/include/linux/slab.h
@@ -268,13 +268,14 @@ static inline void *kmem_cache_alloc_nod
  */
 #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
 extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
-#define kmalloc_track_caller(size, flags) \
-	__kmalloc_track_caller(size, flags, _RET_IP_)
 #else
-#define kmalloc_track_caller(size, flags) \
+#define __kmalloc_track_caller(size, flags, ip) \
 	__kmalloc(size, flags)
 #endif /* DEBUG_SLAB */
 
+#define kmalloc_track_caller(size, flags) \
+	__kmalloc_track_caller(size, flags, _RET_IP_)
+
 #ifdef CONFIG_NUMA
 /*
  * kmalloc_node_track_caller is a special version of kmalloc_node that
@@ -286,21 +287,21 @@ extern void *__kmalloc_track_caller(size
  */
 #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
-#define kmalloc_node_track_caller(size, flags, node) \
-	__kmalloc_node_track_caller(size, flags, node, \
-			_RET_IP_)
 #else
-#define kmalloc_node_track_caller(size, flags, node) \
+#define __kmalloc_node_track_caller(size, flags, node, ip) \
 	__kmalloc_node(size, flags, node)
 #endif
 
 #else /* CONFIG_NUMA */
 
-#define kmalloc_node_track_caller(size, flags, node) \
-	kmalloc_track_caller(size, flags)
+#define __kmalloc_node_track_caller(size, flags, node, ip) \
+	__kmalloc_track_caller(size, flags, ip)
 
 #endif /* CONFIG_NUMA */
 
+#define kmalloc_node_track_caller(size, flags, node) \
+	__kmalloc_node_track_caller(size, flags, node, \
+			_RET_IP_)
 /*
  * Shortcuts
  */
Index: mmotm/mm/Makefile
===================================================================
--- mmotm.orig/mm/Makefile
+++ mmotm/mm/Makefile
@@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   maccess.o page_alloc.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-			   page_isolation.o mm_init.o $(mmu-y)
+			   page_isolation.o mm_init.o reserve.o $(mmu-y)
 obj-y += init-mm.o
 
 obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
Index: mmotm/mm/reserve.c
===================================================================
--- /dev/null
+++ mmotm/mm/reserve.c
@@ -0,0 +1,637 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007-2008, Red Hat, Inc.,
+ *  			     Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * Description:
+ *
+ * Manage a set of memory reserves.
+ *
+ * A memory reserve is a reserve for a specified number of object of specified
+ * size. Since memory is managed in pages, this reserve demand is then
+ * translated into a page unit.
+ *
+ * So each reserve has a specified object limit, an object usage count and a
+ * number of pages required to back these objects.
+ *
+ * Usage is charged against a reserve, if the charge fails, the resource must
+ * not be allocated/used.
+ *
+ * The reserves are managed in a tree, and the resource demands (pages and
+ * limit) are propagated up the tree. Obviously the object limit will be
+ * meaningless as soon as the unit starts mixing, but the required page reserve
+ * (being of one unit) is still valid at the root.
+ *
+ * It is the page demand of the root node that is used to set the global
+ * reserve (adjust_memalloc_reserve() which sets zone->pages_emerg).
+ *
+ * As long as a subtree has the same usage unit, an aggregate node can be used
+ * to charge against, instead of the leaf nodes. However, do be consistent with
+ * who is charged, resource usage is not propagated up the tree (for
+ * performance reasons).
+ */
+
+#include <linux/reserve.h>
+#include <linux/mutex.h>
+#include <linux/mmzone.h>
+#include <linux/log2.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include "internal.h"
+
+static DEFINE_MUTEX(mem_reserve_mutex);
+
+/**
+ * @mem_reserve_root - the global reserve root
+ *
+ * The global reserve is empty, and has no limit unit, it merely
+ * acts as an aggregation point for reserves and an interface to
+ * adjust_memalloc_reserve().
+ */
+struct mem_reserve mem_reserve_root = {
+	.children = LIST_HEAD_INIT(mem_reserve_root.children),
+	.siblings = LIST_HEAD_INIT(mem_reserve_root.siblings),
+	.name = "total reserve",
+	.lock = __SPIN_LOCK_UNLOCKED(mem_reserve_root.lock),
+	.waitqueue = __WAIT_QUEUE_HEAD_INITIALIZER(mem_reserve_root.waitqueue),
+};
+EXPORT_SYMBOL_GPL(mem_reserve_root);
+
+/**
+ * mem_reserve_init() - initialize a memory reserve object
+ * @res - the new reserve object
+ * @name - a name for this reserve
+ * @parent - when non NULL, the parent to connect to.
+ */
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+		      struct mem_reserve *parent)
+{
+	memset(res, 0, sizeof(*res));
+	INIT_LIST_HEAD(&res->children);
+	INIT_LIST_HEAD(&res->siblings);
+	res->name = name;
+	spin_lock_init(&res->lock);
+	init_waitqueue_head(&res->waitqueue);
+
+	if (parent)
+		mem_reserve_connect(res, parent);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_init);
+
+/*
+ * propagate the pages and limit changes up the (sub)tree.
+ */
+static void __calc_reserve(struct mem_reserve *res, long pages, long limit)
+{
+	unsigned long flags;
+
+	for ( ; res; res = res->parent) {
+		res->pages += pages;
+
+		if (limit) {
+			spin_lock_irqsave(&res->lock, flags);
+			res->limit += limit;
+			spin_unlock_irqrestore(&res->lock, flags);
+		}
+	}
+}
+
+/**
+ * __mem_reserve_add() - primitive to change the size of a reserve
+ * @res - reserve to change
+ * @pages - page delta
+ * @limit - usage limit delta
+ *
+ * Returns -ENOMEM when a size increase is not possible atm.
+ */
+static int __mem_reserve_add(struct mem_reserve *res, long pages, long limit)
+{
+	int ret = 0;
+	long reserve;
+
+	/*
+	 * This looks more complex than need be, that is because we handle
+	 * the case where @res isn't actually connected to mem_reserve_root.
+	 *
+	 * So, by propagating the new pages up the (sub)tree and computing
+	 * the difference in mem_reserve_root.pages we find if this action
+	 * affects the actual reserve.
+	 *
+	 * The (partial) propagation also makes that mem_reserve_connect()
+	 * needs only look at the direct child, since each disconnected
+	 * sub-tree is fully up-to-date.
+	 */
+	reserve = mem_reserve_root.pages;
+	__calc_reserve(res, pages, 0);
+	reserve = mem_reserve_root.pages - reserve;
+
+	if (reserve) {
+		ret = adjust_memalloc_reserve(reserve);
+		if (ret)
+			__calc_reserve(res, -pages, 0);
+	}
+
+	/*
+	 * Delay updating the limits until we've acquired the resources to
+	 * back it.
+	 */
+	if (!ret)
+		__calc_reserve(res, 0, limit);
+
+	return ret;
+}
+
+/**
+ * __mem_reserve_charge() - primitive to charge object usage of a reserve
+ * @res - reserve to charge
+ * @charge - size of the charge
+ *
+ * Returns non-zero on success, zero on failure.
+ */
+static
+int __mem_reserve_charge(struct mem_reserve *res, long charge)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&res->lock, flags);
+	if (charge < 0 || res->usage + charge < res->limit) {
+		res->usage += charge;
+		if (unlikely(res->usage < 0))
+			res->usage = 0;
+		ret = 1;
+	}
+	if (charge < 0)
+		wake_up_all(&res->waitqueue);
+	spin_unlock_irqrestore(&res->lock, flags);
+
+	return ret;
+}
+
+/**
+ * mem_reserve_connect() - connect a reserve to another in a child-parent relation
+ * @new_child - the reserve node to connect (child)
+ * @node - the reserve node to connect to (parent)
+ *
+ * Connecting a node results in an increase of the reserve by the amount of
+ * pages in @new_child->pages if @node has a connection to mem_reserve_root.
+ *
+ * Returns -ENOMEM when the new connection would increase the reserve (parent
+ * is connected to mem_reserve_root) and there is no memory to do so.
+ *
+ * On error, the child is _NOT_ connected.
+ */
+int mem_reserve_connect(struct mem_reserve *new_child, struct mem_reserve *node)
+{
+	int ret;
+
+	WARN_ON(!new_child->name);
+
+	mutex_lock(&mem_reserve_mutex);
+	if (new_child->parent) {
+		ret = -EEXIST;
+		goto unlock;
+	}
+	new_child->parent = node;
+	list_add(&new_child->siblings, &node->children);
+	ret = __mem_reserve_add(node, new_child->pages, new_child->limit);
+	if (ret) {
+		new_child->parent = NULL;
+		list_del_init(&new_child->siblings);
+	}
+unlock:
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_connect);
+
+/**
+ * mem_reserve_disconnect() - sever a nodes connection to the reserve tree
+ * @node - the node to disconnect
+ *
+ * Disconnecting a node results in a reduction of the reserve by @node->pages
+ * if node had a connection to mem_reserve_root.
+ */
+void mem_reserve_disconnect(struct mem_reserve *node)
+{
+	int ret;
+
+	BUG_ON(!node->parent);
+
+	mutex_lock(&mem_reserve_mutex);
+	if (!node->parent) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+	ret = __mem_reserve_add(node->parent, -node->pages, -node->limit);
+	if (!ret) {
+		node->parent = NULL;
+		list_del_init(&node->siblings);
+	}
+unlock:
+	mutex_unlock(&mem_reserve_mutex);
+
+	/*
+	 * We cannot fail to shrink the reserves, can we?
+	 */
+	WARN_ON(ret);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_disconnect);
+
+#ifdef CONFIG_PROC_FS
+
+/*
+ * Simple output of the reserve tree in: /proc/reserve_info
+ * Example:
+ *
+ * localhost ~ # cat /proc/reserve_info
+ * 1:0 "total reserve" 6232K 0/278581
+ * 2:1 "total network reserve" 6232K 0/278581
+ * 3:2 "network TX reserve" 212K 0/53
+ * 4:3 "protocol TX pages" 212K 0/53
+ * 5:2 "network RX reserve" 6020K 0/278528
+ * 6:5 "IPv4 route cache" 5508K 0/16384
+ * 7:5 "SKB data reserve" 512K 0/262144
+ * 8:7 "IPv4 fragment cache" 512K 0/262144
+ */
+
+static void mem_reserve_show_item(struct seq_file *m, struct mem_reserve *res,
+				  unsigned int parent, unsigned int *id)
+{
+	struct mem_reserve *child;
+	unsigned int my_id = ++*id;
+
+	seq_printf(m, "%d:%d \"%s\" %ldK %ld/%ld\n",
+			my_id, parent, res->name,
+			res->pages << (PAGE_SHIFT - 10),
+			res->usage, res->limit);
+
+	list_for_each_entry(child, &res->children, siblings)
+		mem_reserve_show_item(m, child, my_id, id);
+}
+
+static int mem_reserve_show(struct seq_file *m, void *v)
+{
+	unsigned int ident = 0;
+
+	mutex_lock(&mem_reserve_mutex);
+	mem_reserve_show_item(m, &mem_reserve_root, ident, &ident);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return 0;
+}
+
+static int mem_reserve_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mem_reserve_show, NULL);
+}
+
+static const struct file_operations mem_reserve_opterations = {
+	.open = mem_reserve_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static __init int mem_reserve_proc_init(void)
+{
+	proc_create("reserve_info", S_IRUSR, NULL, &mem_reserve_opterations);
+	return 0;
+}
+
+module_init(mem_reserve_proc_init);
+
+#endif
+
+/*
+ * alloc_page helpers
+ */
+
+/**
+ * mem_reserve_pages_set() - set reserves size in pages
+ * @res - reserve to set
+ * @pages - size in pages to set it to
+ *
+ * Returns -ENOMEM when it fails to set the reserve. On failure the old size
+ * is preserved.
+ */
+int mem_reserve_pages_set(struct mem_reserve *res, long pages)
+{
+	int ret;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages -= res->pages;
+	ret = __mem_reserve_add(res, pages, pages * PAGE_SIZE);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_pages_set);
+
+/**
+ * mem_reserve_pages_add() - change the size in a relative way
+ * @res - reserve to change
+ * @pages - number of pages to add (or subtract when negative)
+ *
+ * Similar to mem_reserve_pages_set, except that the argument is relative
+ * instead of absolute.
+ *
+ * Returns -ENOMEM when it fails to increase.
+ */
+int mem_reserve_pages_add(struct mem_reserve *res, long pages)
+{
+	int ret;
+
+	mutex_lock(&mem_reserve_mutex);
+	ret = __mem_reserve_add(res, pages, pages * PAGE_SIZE);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+
+/**
+ * mem_reserve_pages_charge() - charge page usage to a reserve
+ * @res - reserve to charge
+ * @pages - size to charge
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages)
+{
+	return __mem_reserve_charge(res, pages * PAGE_SIZE);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_pages_charge);
+
+/*
+ * kmalloc helpers
+ */
+
+/**
+ * mem_reserve_kmalloc_set() - set this reserve to bytes worth of kmalloc
+ * @res - reserve to change
+ * @bytes - size in bytes to reserve
+ *
+ * Returns -ENOMEM on failure.
+ */
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes)
+{
+	int ret;
+	long pages;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages = kmalloc_estimate_bytes(GFP_ATOMIC, bytes);
+	pages -= res->pages;
+	bytes -= res->limit;
+	ret = __mem_reserve_add(res, pages, bytes);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmalloc_set);
+
+/**
+ * mem_reserve_kmalloc_charge() - charge bytes to a reserve
+ * @res - reserve to charge
+ * @bytes - bytes to charge
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes)
+{
+	if (bytes < 0)
+		bytes = -roundup_pow_of_two(-bytes);
+	else
+		bytes = roundup_pow_of_two(bytes);
+
+	return __mem_reserve_charge(res, bytes);
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmalloc_charge);
+
+/*
+ * kmem_cache helpers
+ */
+
+/**
+ * mem_reserve_kmem_cache_set() - set reserve to @objects worth of kmem_cache_alloc of @s
+ * @res - reserve to set
+ * @s - kmem_cache to reserve from
+ * @objects - number of objects to reserve
+ *
+ * Returns -ENOMEM on failure.
+ */
+int mem_reserve_kmem_cache_set(struct mem_reserve *res, struct kmem_cache *s,
+			       int objects)
+{
+	int ret;
+	long pages, bytes;
+
+	mutex_lock(&mem_reserve_mutex);
+	pages = kmem_alloc_estimate(s, GFP_ATOMIC, objects);
+	pages -= res->pages;
+	bytes = objects * kmem_cache_size(s) - res->limit;
+	ret = __mem_reserve_add(res, pages, bytes);
+	mutex_unlock(&mem_reserve_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmem_cache_set);
+
+/**
+ * mem_reserve_kmem_cache_charge() - charge (or uncharge) usage of objs
+ * @res - reserve to charge
+ * @objs - objects to charge for
+ *
+ * Returns non-zero on success.
+ */
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res, struct kmem_cache *s,
+				  long objs)
+{
+	return __mem_reserve_charge(res, objs * kmem_cache_size(s));
+}
+EXPORT_SYMBOL_GPL(mem_reserve_kmem_cache_charge);
+
+/*
+ * Alloc wrappers.
+ *
+ * Actual usage is commented in linux/reserve.h where the interface functions
+ * live. Furthermore, the code is 3 instances of the same paradigm, hence only
+ * the first contains extensive comments.
+ */
+
+/*
+ * kmalloc/kfree
+ */
+
+void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
+			 struct mem_reserve *res, int *emerg)
+{
+	void *obj;
+	gfp_t gfp;
+
+	/*
+	 * Try a regular allocation, when that fails and we're not entitled
+	 * to the reserves, fail.
+	 */
+	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
+	obj = __kmalloc_node_track_caller(size, gfp, node, ip);
+
+	if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		goto out;
+
+	/*
+	 * If we were given a reserve to charge against, try that.
+	 */
+	if (res && !mem_reserve_kmalloc_charge(res, size)) {
+		/*
+		 * If we failed to charge and we're not allowed to wait for
+		 * it to succeed, bail.
+		 */
+		if (!(flags & __GFP_WAIT))
+			goto out;
+
+		/*
+		 * Wait for a successfull charge against the reserve. All
+		 * uncharge operations against this reserve will wake us up.
+		 */
+		wait_event(res->waitqueue,
+				mem_reserve_kmalloc_charge(res, size));
+
+		/*
+		 * After waiting for it, again try a regular allocation.
+		 * Pressure could have lifted during our sleep. If this
+		 * succeeds, uncharge the reserve.
+		 */
+		obj = __kmalloc_node_track_caller(size, gfp, node, ip);
+		if (obj) {
+			mem_reserve_kmalloc_charge(res, -size);
+			goto out;
+		}
+	}
+
+	/*
+	 * Regular allocation failed, and we've successfully charged our
+	 * requested usage against the reserve. Do the emergency allocation.
+	 */
+	obj = __kmalloc_node_track_caller(size, flags, node, ip);
+	WARN_ON(!obj);
+	if (emerg)
+		*emerg = 1;
+
+out:
+	return obj;
+}
+
+void __kfree_reserve(void *obj, struct mem_reserve *res, int emerg)
+{
+	/*
+	 * ksize gives the full allocated size vs the requested size we used to
+	 * charge; however since we round up to the nearest power of two, this
+	 * should all work nicely.
+	 */
+	size_t size = ksize(obj);
+
+	kfree(obj);
+	/*
+	 * Free before uncharge, this ensures memory is actually present when
+	 * a subsequent charge succeeds.
+	 */
+	mem_reserve_kmalloc_charge(res, -size);
+}
+
+/*
+ * kmem_cache_alloc/kmem_cache_free
+ */
+
+void *__kmem_cache_alloc_reserve(struct kmem_cache *s, gfp_t flags, int node,
+				 struct mem_reserve *res, int *emerg)
+{
+	void *obj;
+	gfp_t gfp;
+
+	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
+	obj = kmem_cache_alloc_node(s, gfp, node);
+
+	if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		goto out;
+
+	if (res && !mem_reserve_kmem_cache_charge(res, s, 1)) {
+		if (!(flags & __GFP_WAIT))
+			goto out;
+
+		wait_event(res->waitqueue,
+				mem_reserve_kmem_cache_charge(res, s, 1));
+
+		obj = kmem_cache_alloc_node(s, gfp, node);
+		if (obj) {
+			mem_reserve_kmem_cache_charge(res, s, -1);
+			goto out;
+		}
+	}
+
+	obj = kmem_cache_alloc_node(s, flags, node);
+	WARN_ON(!obj);
+	if (emerg)
+		*emerg = 1;
+
+out:
+	return obj;
+}
+
+void __kmem_cache_free_reserve(struct kmem_cache *s, void *obj,
+			       struct mem_reserve *res, int emerg)
+{
+	kmem_cache_free(s, obj);
+	mem_reserve_kmem_cache_charge(res, s, -1);
+}
+
+/*
+ * alloc_pages/free_pages
+ */
+
+struct page *__alloc_pages_reserve(int node, gfp_t flags, int order,
+				   struct mem_reserve *res, int *emerg)
+{
+	struct page *page;
+	gfp_t gfp;
+	long pages = 1 << order;
+
+	gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
+	page = alloc_pages_node(node, gfp, order);
+
+	if (page || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		goto out;
+
+	if (res && !mem_reserve_pages_charge(res, pages)) {
+		if (!(flags & __GFP_WAIT))
+			goto out;
+
+		wait_event(res->waitqueue,
+				mem_reserve_pages_charge(res, pages));
+
+		page = alloc_pages_node(node, gfp, order);
+		if (page) {
+			mem_reserve_pages_charge(res, -pages);
+			goto out;
+		}
+	}
+
+	page = alloc_pages_node(node, flags, order);
+	WARN_ON(!page);
+	if (emerg)
+		*emerg = 1;
+
+out:
+	return page;
+}
+
+void __free_pages_reserve(struct page *page, int order,
+			  struct mem_reserve *res, int emerg)
+{
+	__free_pages(page, order);
+	mem_reserve_pages_charge(res, -(1 << order));
+}
Index: mmotm/mm/slub.c
===================================================================
--- mmotm.orig/mm/slub.c
+++ mmotm/mm/slub.c
@@ -2896,6 +2896,7 @@ void *__kmalloc(size_t size, gfp_t flags
 }
 EXPORT_SYMBOL(__kmalloc);
 
+#ifdef CONFIG_NUMA
 static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 {
 	struct page *page;
@@ -2910,7 +2911,6 @@ static void *kmalloc_large_node(size_t s
 	return ptr;
 }
 
-#ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	struct kmem_cache *s;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 10/31] mm: __GFP_MEMALLOC
From: Suresh Jayaraman @ 2009-10-01 14:06 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

It allows one to pass along the memalloc state in object related allocation
flags as opposed to task related flags, such as sk->sk_allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/linux/gfp.h |    3 ++-
 mm/page_alloc.c     |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Index: mmotm/include/linux/gfp.h
===================================================================
--- mmotm.orig/include/linux/gfp.h
+++ mmotm/include/linux/gfp.h
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)0x400u)	/* See above */
 #define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* See above */
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
@@ -96,7 +97,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control slab gfp mask during early boot */
 #define GFP_BOOT_MASK __GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS)
Index: mmotm/mm/page_alloc.c
===================================================================
--- mmotm.orig/mm/page_alloc.c
+++ mmotm/mm/page_alloc.c
@@ -1710,7 +1710,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_irq() && (p->flags & PF_MEMALLOC))
+		if (gfp_mask & __GFP_MEMALLOC)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_irq() && (p->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				unlikely(test_thread_flag(TIF_MEMDIE)))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [09/31] mm: system wide ALLOC_NO_WATERMARK
From: Suresh Jayaraman @ 2009-10-01 14:06 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

The reserve is proportionally distributed over all (!highmem) zones in the
system. So we need to allow an emergency allocation access to all zones. In
order to do that we need to break out of any mempolicy boundaries we might
have.

In my opinion that does not break mempolicies as those are user oriented
and not system oriented. That is, system allocations are not guaranteed to be
within mempolicy boundaries. For instance IRQs don't even have a mempolicy.

So breaking out of mempolicy boundaries for 'rare' emergency allocations,
which are always system allocations (as opposed to user) is ok.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 mm/page_alloc.c |    5 +++++
 1 file changed, 5 insertions(+)

Index: mmotm/mm/page_alloc.c
===================================================================
--- mmotm.orig/mm/page_alloc.c
+++ mmotm/mm/page_alloc.c
@@ -1775,6 +1775,11 @@ restart:
 rebalance:
 	/* Allocate without watermarks if the context allows */
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
+		/*
+		 * break out mempolicy boundaries
+		 */
+		zonelist = node_zonelist(numa_node_id(), gfp_mask);
+
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 08/31] mm: emergency pool
From: Suresh Jayaraman @ 2009-10-01 14:06 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/linux/mmzone.h |    3 +
 mm/page_alloc.c        |   84 +++++++++++++++++++++++++++++++++++++++++++------
 mm/vmstat.c            |    6 +--
 3 files changed, 80 insertions(+), 13 deletions(-)

Index: mmotm/include/linux/mmzone.h
===================================================================
--- mmotm.orig/include/linux/mmzone.h
+++ mmotm/include/linux/mmzone.h
@@ -273,6 +273,7 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
+	unsigned long           pages_emerg;    /* emergency pool */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
@@ -757,6 +758,8 @@ int sysctl_min_unmapped_ratio_sysctl_han
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
 
+int adjust_memalloc_reserve(int pages);
+
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			struct file *, void __user *, size_t *, loff_t *);
 extern char numa_zonelist_order[];
Index: mmotm/mm/page_alloc.c
===================================================================
--- mmotm.orig/mm/page_alloc.c
+++ mmotm/mm/page_alloc.c
@@ -123,6 +123,8 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -1302,7 +1304,7 @@ int zone_watermark_ok(struct zone *z, in
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
-	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min+z->lowmem_reserve[classzone_idx]+z->pages_emerg)
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
@@ -1726,7 +1728,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
-	int alloc_flags;
+	int alloc_flags = 0;
 	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	struct task_struct *p = current;
@@ -1841,8 +1843,8 @@ rebalance:
 nopage:
 	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		printk(KERN_WARNING "%s: page allocation failure."
-			" order:%d, mode:0x%x\n",
-			p->comm, order, gfp_mask);
+			" order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n",
+			p->comm, order, gfp_mask, alloc_flags, p->flags);
 		dump_stack();
 		show_mem();
 	}
@@ -2158,9 +2160,9 @@ void show_free_areas(void)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(min_wmark_pages(zone)),
-			K(low_wmark_pages(zone)),
-			K(high_wmark_pages(zone)),
+			K(zone->pages_emerg + min_wmark_pages(zone)),
+			K(zone->pages_emerg + low_wmark_pages(zone)),
+			K(zone->pages_emerg + high_wmark_pages(zone)),
 			K(zone_page_state(zone, NR_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
@@ -4388,7 +4390,7 @@ static void calculate_totalreserve_pages
 			}
 
 			/* we treat the high watermark as reserved pages. */
-			max += high_wmark_pages(zone);
+			max += high_wmark_pages(zone) + zone->pages_emerg;
 
 			if (max > zone->present_pages)
 				max = zone->present_pages;
@@ -4446,7 +4448,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_wmarks(void)
 {
-	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned pages_emerg = var_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -4458,11 +4461,13 @@ static void __setup_per_zone_wmarks(void
 	}
 
 	for_each_zone(zone) {
-		u64 tmp;
+		u64 tmp, tmp_emerg;
 
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
+		tmp_emerg = (u64)pages_emerg * zone->present_pages;
+		do_div(tmp_emerg, lowmem_pages);
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -4481,12 +4486,14 @@ static void __setup_per_zone_wmarks(void
 			if (min_pages > 128)
 				min_pages = 128;
 			zone->watermark[WMARK_MIN] = min_pages;
+			zone->pages_emerg = 0;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
 			zone->watermark[WMARK_MIN] = tmp;
+			zone->pages_emerg = tmp_emerg;
 		}
 
 		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
@@ -4551,6 +4558,63 @@ void setup_per_zone_wmarks(void)
 	spin_unlock_irqrestore(&min_free_lock, flags);
 }
 
+static void __adjust_memalloc_reserve(int pages)
+{
+	var_free_kbytes += pages << (PAGE_SHIFT - 10);
+	BUG_ON(var_free_kbytes < 0);
+	setup_per_zone_wmarks();
+}
+
+static int test_reserve_limits(void)
+{
+	struct zone *zone;
+	int node;
+
+	for_each_zone(zone)
+		wakeup_kswapd(zone, 0);
+
+	for_each_online_node(node) {
+		struct page *page = alloc_pages_node(node, GFP_KERNEL, 0);
+		if (!page)
+			return -ENOMEM;
+
+		__free_page(page);
+	}
+
+	return 0;
+}
+
+/**
+ *	adjust_memalloc_reserve - adjust the memalloc reserve
+ *	@pages: number of pages to add
+ *
+ *	It adds a number of pages to the memalloc reserve; if
+ *	the number was positive it kicks reclaim into action to
+ *	satisfy the higher watermarks.
+ *
+ *	returns -ENOMEM when it failed to satisfy the watermarks.
+ */
+int adjust_memalloc_reserve(int pages)
+{
+	int err = 0;
+
+	mutex_lock(&var_free_mutex);
+	__adjust_memalloc_reserve(pages);
+	if (pages > 0) {
+		err = test_reserve_limits();
+		if (err) {
+			__adjust_memalloc_reserve(-pages);
+			goto unlock;
+		}
+	}
+	printk(KERN_DEBUG "Emergency reserve: %d\n", var_free_kbytes);
+
+unlock:
+	mutex_unlock(&var_free_mutex);
+	return err;
+}
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
Index: mmotm/mm/vmstat.c
===================================================================
--- mmotm.orig/mm/vmstat.c
+++ mmotm/mm/vmstat.c
@@ -713,9 +713,9 @@ static void zoneinfo_show_print(struct s
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   min_wmark_pages(zone),
-		   low_wmark_pages(zone),
-		   high_wmark_pages(zone),
+		   zone->pages_emerg + min_wmark_pages(zone),
+		   zone->pages_emerg + min_wmark_pages(zone),
+		   zone->pages_emerg + high_wmark_pages(zone),
 		   zone->pages_scanned,
 		   zone->spanned_pages,
 		   zone->present_pages);

^ permalink raw reply

* [PATCH 07/31] mm: allow PF_MEMALLOC from softirq context
From: Suresh Jayaraman @ 2009-10-01 14:05 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being associated
with a task, and therefore not having task flags to fiddle with - thus the gfp
to alloc flag mapping ignores the task flags when in interrupts (hard or soft)
context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some trickery.
We basically borrow the task flags from whatever process happens to be
preempted by the softirq.

So we modify the gfp to alloc flags mapping to not exclude task flags in
softirq context, and modify the softirq code to save, clear and restore the
PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag cannot
leak back into the preempted process.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/linux/sched.h |    7 +++++++
 kernel/softirq.c      |    3 +++
 mm/page_alloc.c       |    7 ++++---
 3 files changed, 14 insertions(+), 3 deletions(-)

Index: mmotm/include/linux/sched.h
===================================================================
--- mmotm.orig/include/linux/sched.h
+++ mmotm/include/linux/sched.h
@@ -1724,6 +1724,13 @@ extern cputime_t task_gtime(struct task_
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+static inline void tsk_restore_flags(struct task_struct *p,
+				     unsigned long pflags, unsigned long mask)
+{
+	p->flags &= ~mask;
+	p->flags |= pflags & mask;
+}
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed_ptr(struct task_struct *p,
 				const struct cpumask *new_mask);
Index: mmotm/kernel/softirq.c
===================================================================
--- mmotm.orig/kernel/softirq.c
+++ mmotm/kernel/softirq.c
@@ -194,6 +194,8 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long pflags = current->flags;
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -246,6 +248,7 @@ restart:
 
 	account_system_vtime(current);
 	_local_bh_enable();
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: mmotm/mm/page_alloc.c
===================================================================
--- mmotm.orig/mm/page_alloc.c
+++ mmotm/mm/page_alloc.c
@@ -1708,9 +1708,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((p->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+		if (!in_irq() && (p->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				unlikely(test_thread_flag(TIF_MEMDIE)))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 06/31] mm: kmem_alloc_estimate()
From: Suresh Jayaraman @ 2009-10-01 14:05 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

This lays the foundation for a generic reserve framework as presented in
a later patch in this series. This framework needs to convert object demand
(kmalloc() bytes, kmem_cache_alloc() objects) to pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/linux/slab.h |    4 ++
 mm/slab.c            |   75 +++++++++++++++++++++++++++++++++++++++++++
 mm/slob.c            |   67 +++++++++++++++++++++++++++++++++++++++
 mm/slub.c            |   87 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 233 insertions(+)

Index: mmotm/include/linux/slab.h
===================================================================
--- mmotm.orig/include/linux/slab.h
+++ mmotm/include/linux/slab.h
@@ -102,6 +102,8 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
+			gfp_t flags, int objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -138,6 +140,8 @@ void * __must_check krealloc(const void
 void kfree(const void *);
 void kzfree(const void *);
 size_t ksize(const void *);
+unsigned kmalloc_estimate_objs(size_t, gfp_t, int);
+unsigned kmalloc_estimate_bytes(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
Index: mmotm/mm/slab.c
===================================================================
--- mmotm.orig/mm/slab.c
+++ mmotm/mm/slab.c
@@ -3829,6 +3829,81 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL_GPL(kmem_cache_name);
 
 /*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @objects objects from @cachep.
+ */
+unsigned kmem_alloc_estimate(struct kmem_cache *cachep,
+		gfp_t flags, int objects)
+{
+	/*
+	 * (1) memory for objects,
+	 */
+	unsigned nr_slabs = DIV_ROUND_UP(objects, cachep->num);
+	unsigned nr_pages = nr_slabs << cachep->gfporder;
+
+	/*
+	 * (2) memory for each per-cpu queue (nr_cpu_ids),
+	 * (3) memory for each per-node alien queues (nr_cpu_ids), and
+	 * (4) some amount of memory for the slab management structures
+	 *
+	 * XXX: truely account these
+	 */
+	nr_pages += 1 + ilog2(nr_pages);
+
+	return nr_pages;
+}
+
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @count objects of @size bytes from kmalloc given @flags.
+ */
+unsigned kmalloc_estimate_objs(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = kmem_find_general_cachep(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_alloc_estimate(s, flags, count);
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_objs);
+
+/*
+ * Calculate the upper bound of pages requires to sequentially allocate @bytes
+ * from kmalloc in an unspecified number of allocations of nonuniform size.
+ */
+unsigned kmalloc_estimate_bytes(gfp_t flags, size_t bytes)
+{
+	unsigned long pages;
+	struct cache_sizes *csizep = malloc_sizes;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (csizep = malloc_sizes; csizep->cs_cachep; csizep++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & __GFP_DMA))
+			s = csizep->cs_dmacachep;
+		else
+#endif
+			s = csizep->cs_cachep;
+
+		if (s)
+			pages += kmem_alloc_estimate(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_bytes);
+
+/*
  * This initializes kmem_list3 or resizes various caches for all nodes.
  */
 static int alloc_kmemlist(struct kmem_cache *cachep, gfp_t gfp)
Index: mmotm/mm/slob.c
===================================================================
--- mmotm.orig/mm/slob.c
+++ mmotm/mm/slob.c
@@ -702,6 +702,73 @@ int slab_is_available(void)
 	return slob_ready;
 }
 
+static __slob_estimate(unsigned size, unsigned align, unsigned objects)
+{
+	unsigned nr_pages;
+
+	size = SLOB_UNIT * SLOB_UNITS(size + align - 1);
+
+	if (size <= PAGE_SIZE) {
+		nr_pages = DIV_ROUND_UP(objects, PAGE_SIZE / size);
+	} else {
+		nr_pages = objects << get_order(size);
+	}
+
+	return nr_pages;
+}
+
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @objects objects from @cachep.
+ */
+unsigned kmem_alloc_estimate(struct kmem_cache *c, gfp_t flags, int objects)
+{
+	unsigned size = c->size;
+
+	if (c->flags & SLAB_DESTROY_BY_RCU)
+		size += sizeof(struct slob_rcu);
+
+	return __slob_estimate(size, c->align, objects);
+}
+
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @count objects of @size bytes from kmalloc given @flags.
+ */
+unsigned kmalloc_estimate_objs(size_t size, gfp_t flags, int count)
+{
+	unsigned align = max(ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
+
+	return __slob_estimate(size, align, count);
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_objs);
+
+/*
+ * Calculate the upper bound of pages requires to sequentially allocate @bytes
+ * from kmalloc in an unspecified number of allocations of nonuniform size.
+ */
+unsigned kmalloc_estimate_bytes(gfp_t flags, size_t bytes)
+{
+	unsigned long pages;
+
+	/*
+	 * Multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 *
+	 * While not true for slob, it cannot do worse than that for sequential
+	 * allocations.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * Our power of two series starts at PAGE_SIZE, so add one page.
+	 */
+	pages++;
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_bytes);
+
 void __init kmem_cache_init(void)
 {
 	slob_ready = 1;
Index: mmotm/mm/slub.c
===================================================================
--- mmotm.orig/mm/slub.c
+++ mmotm/mm/slub.c
@@ -2547,6 +2547,42 @@ const char *kmem_cache_name(struct kmem_
 }
 EXPORT_SYMBOL(kmem_cache_name);
 
+/*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @objects objects from @cachep.
+ *
+ * We should use s->min_objects because those are the least efficient.
+ */
+unsigned kmem_alloc_estimate(struct kmem_cache *s, gfp_t flags, int objects)
+{
+	unsigned long pages;
+	struct kmem_cache_order_objects x;
+
+	if (WARN_ON(!s) || WARN_ON(!oo_objects(s->min)))
+		return 0;
+
+	x = s->min;
+	pages = DIV_ROUND_UP(objects, oo_objects(x)) << oo_order(x);
+
+	/*
+	 * Account the possible additional overhead if the slab holds more that
+	 * one object. Use s->max_objects because that's the worst case.
+	 */
+	x = s->oo;
+	if (oo_objects(x) > 1) {
+		/*
+		 * Account the possible additional overhead if per cpu slabs
+		 * are currently empty and have to be allocated. This is very
+		 * unlikely but a possible scenario immediately after
+		 * kmem_cache_shrink.
+		 */
+		pages += num_possible_cpus() << oo_order(x);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmem_alloc_estimate);
+
 static void list_slab_objects(struct kmem_cache *s, struct page *page,
 							const char *text)
 {
@@ -2965,6 +3001,57 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
+ * Calculate the upper bound of pages required to sequentially allocate
+ * @count objects of @size bytes from kmalloc given @flags.
+ */
+unsigned kmalloc_estimate_objs(size_t size, gfp_t flags, int count)
+{
+	struct kmem_cache *s = get_slab(size, flags);
+	if (!s)
+		return 0;
+
+	return kmem_alloc_estimate(s, flags, count);
+
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_objs);
+
+/*
+ * Calculate the upper bound of pages requires to sequentially allocate @bytes
+ * from kmalloc in an unspecified number of allocations of nonuniform size.
+ */
+unsigned kmalloc_estimate_bytes(gfp_t flags, size_t bytes)
+{
+	int i;
+	unsigned long pages;
+
+	/*
+	 * multiply by two, in order to account the worst case slack space
+	 * due to the power-of-two allocation sizes.
+	 */
+	pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+	/*
+	 * add the kmem_cache overhead of each possible kmalloc cache
+	 */
+	for (i = 1; i < PAGE_SHIFT; i++) {
+		struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+		if (unlikely(flags & SLUB_DMA))
+			s = dma_kmalloc_cache(i, flags);
+		else
+#endif
+			s = &kmalloc_caches[i];
+
+		if (s)
+			pages += kmem_alloc_estimate(s, flags, 0);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(kmalloc_estimate_bytes);
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 05/31] mm: sl[au]b: add knowledge of reserve pages
From: Suresh Jayaraman @ 2009-10-01 14:05 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it. This is done to ensure reserve pages don't
leak out and get consumed.

The basic pattern used for all # allocators is the following, for each active
slab page we store if it came from an emergency allocation. When we find it
did, make sure the current allocation context would have been able to allocate
page from the emergency reserves as well. In that case allow the allocation. If
not, force a new slab allocation. When that works the memory pressure has
lifted enough to allow this context to get an object, otherwise fail the
allocation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/linux/slub_def.h |    1 
 mm/slab.c                |   61 ++++++++++++++++++++++++++++++++++++++++-------
 mm/slob.c                |   16 +++++++++++-
 mm/slub.c                |   43 +++++++++++++++++++++++++++------
 4 files changed, 104 insertions(+), 17 deletions(-)

Index: mmotm/mm/slub.c
===================================================================
--- mmotm.orig/mm/slub.c
+++ mmotm/mm/slub.c
@@ -28,6 +28,8 @@
 #include <linux/memory.h>
 #include <linux/math64.h>
 #include <linux/fault-inject.h>
+#include "internal.h"
+
 
 /*
  * Lock order:
@@ -1142,7 +1144,8 @@ static void setup_object(struct kmem_cac
 		s->ctor(object);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static
+struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
 {
 	struct page *page;
 	void *start;
@@ -1156,6 +1159,8 @@ static struct page *new_slab(struct kmem
 	if (!page)
 		goto out;
 
+	*reserve = page->reserve;
+
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
@@ -1602,10 +1607,20 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
+	int reserve;
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
 
+	if (unlikely(c->reserve)) {
+		/*
+		 * If the current slab is a reserve slab and the current
+		 * allocation context does not allow access to the reserves we
+		 * must force an allocation to test the current levels.
+		 */
+		if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+			goto grow_slab;
+	}
 	if (!c->page)
 		goto new_slab;
 
@@ -1619,8 +1634,8 @@ load_freelist:
 	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
-		goto debug;
+	if (unlikely(PageSlubDebug(c->page) || c->reserve))
+		goto slow_path;
 
 	c->freelist = object[c->offset];
 	c->page->inuse = c->page->objects;
@@ -1642,16 +1657,18 @@ new_slab:
 		goto load_freelist;
 	}
 
+grow_slab:
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
-	new = new_slab(s, gfpflags, node);
+	new = new_slab(s, gfpflags, node, &reserve);
 
 	if (gfpflags & __GFP_WAIT)
 		local_irq_disable();
 
 	if (new) {
 		c = get_cpu_slab(s, smp_processor_id());
+		c->reserve = reserve;
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1663,10 +1680,21 @@ new_slab:
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
 	return NULL;
-debug:
-	if (!alloc_debug_processing(s, c->page, object, addr))
+
+slow_path:
+	if (PageSlubDebug(c->page) &&
+			!alloc_debug_processing(s, c->page, object, addr))
 		goto another_slab;
 
+	/*
+	 * Avoid the slub fast path in slab_alloc() by not setting
+	 * c->freelist and the fast path in slab_free() by making
+	 * node_match() fail by setting c->node to -1.
+	 *
+	 * We use this for for debug and reserve checks which need
+	 * to be done for each allocation.
+	 */
+
 	c->page->inuse++;
 	c->page->freelist = object[c->offset];
 	c->node = -1;
@@ -2213,10 +2241,11 @@ static void early_kmem_cache_node_alloc(
 	struct page *page;
 	struct kmem_cache_node *n;
 	unsigned long flags;
+	int reserve;
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags, node);
+	page = new_slab(kmalloc_caches, gfpflags, node, &reserve);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
Index: mmotm/include/linux/slub_def.h
===================================================================
--- mmotm.orig/include/linux/slub_def.h
+++ mmotm/include/linux/slub_def.h
@@ -40,6 +40,7 @@ struct kmem_cache_cpu {
 	int node;		/* The node of the page (or -1 for debug) */
 	unsigned int offset;	/* Freepointer offset (in word units) */
 	unsigned int objsize;	/* Size of an object (from kmem_cache) */
+	int reserve;		/* Did the current page come from the reserve */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: mmotm/mm/slab.c
===================================================================
--- mmotm.orig/mm/slab.c
+++ mmotm/mm/slab.c
@@ -120,6 +120,8 @@
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
 
+#include 	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -268,7 +270,8 @@ struct array_cache {
 	unsigned int avail;
 	unsigned int limit;
 	unsigned int batchcount;
-	unsigned int touched;
+	unsigned int touched:1,
+		     reserve:1;
 	spinlock_t lock;
 	void *entry[];	/*
 			 * Must have this definition in here for the proper
@@ -692,6 +695,27 @@ static inline struct array_cache *cpu_ca
 	return cachep->array[smp_processor_id()];
 }
 
+/*
+ * If the last page came from the reserves, and the current allocation context
+ * does not have access to them, force an allocation to test the watermarks.
+ */
+static inline int slab_force_alloc(struct kmem_cache *cachep, gfp_t flags)
+{
+	if (unlikely(cpu_cache_get(cachep)->reserve) &&
+			!(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+		return 1;
+
+	return 0;
+}
+
+static inline void slab_set_reserve(struct kmem_cache *cachep, int reserve)
+{
+	struct array_cache *ac = cpu_cache_get(cachep);
+
+	if (unlikely(ac->reserve != reserve))
+		ac->reserve = reserve;
+}
+
 static inline struct kmem_cache *__find_general_cachep(size_t size,
 							gfp_t gfpflags)
 {
@@ -898,6 +922,7 @@ static struct array_cache *alloc_arrayca
 		nc->limit = entries;
 		nc->batchcount = batchcount;
 		nc->touched = 0;
+		nc->reserve = 0;
 		spin_lock_init(&nc->lock);
 	}
 	return nc;
@@ -1595,7 +1620,8 @@ __initcall(cpucache_init);
  * did not request dmaable memory, we might get it, but that
  * would be relatively rare and ignorable.
  */
-static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+		int *reserve)
 {
 	struct page *page;
 	int nr_pages;
@@ -1617,6 +1643,7 @@ static void *kmem_getpages(struct kmem_c
 	if (!page)
 		return NULL;
 
+	*reserve = page->reserve;
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -2049,6 +2076,7 @@ static int __init_refok setup_cpu_cache(
 	cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
 	cpu_cache_get(cachep)->batchcount = 1;
 	cpu_cache_get(cachep)->touched = 0;
+	cpu_cache_get(cachep)->reserve = 0;
 	cachep->batchcount = 1;
 	cachep->limit = BOOT_CPUCACHE_ENTRIES;
 	return 0;
@@ -2732,6 +2760,7 @@ static int cache_grow(struct kmem_cache
 	size_t offset;
 	gfp_t local_flags;
 	struct kmem_list3 *l3;
+	int reserve;
 
 	/*
 	 * Be lazy and only check for valid flags here,  keeping it out of the
@@ -2770,7 +2799,7 @@ static int cache_grow(struct kmem_cache
 	 * 'nodeid'.
 	 */
 	if (!objp)
-		objp = kmem_getpages(cachep, local_flags, nodeid);
+		objp = kmem_getpages(cachep, local_flags, nodeid, &reserve);
 	if (!objp)
 		goto failed;
 
@@ -2787,6 +2816,7 @@ static int cache_grow(struct kmem_cache
 	if (local_flags & __GFP_WAIT)
 		local_irq_disable();
 	check_irq_off();
+	slab_set_reserve(cachep, reserve);
 	spin_lock(&l3->list_lock);
 
 	/* Make slab active. */
@@ -2921,7 +2951,8 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep,
+		gfp_t flags, int must_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
@@ -2931,6 +2962,8 @@ static void *cache_alloc_refill(struct k
 retry:
 	check_irq_off();
 	node = numa_node_id();
+	if (unlikely(must_refill))
+		goto force_grow;
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -2998,11 +3031,14 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || must_refill))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
@@ -3092,17 +3128,18 @@ static inline void *____cache_alloc(stru
 {
 	void *objp;
 	struct array_cache *ac;
+	int must_refill = slab_force_alloc(cachep, flags);
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
-	if (likely(ac->avail)) {
+	if (likely(ac->avail && !must_refill)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
 		objp = ac->entry[--ac->avail];
 	} else {
 		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = cache_alloc_refill(cachep, flags, must_refill);
 	}
 	/*
 	 * To avoid a false negative, if an object that is in one of the
@@ -3152,7 +3189,7 @@ static void *fallback_alloc(struct kmem_
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
-	int nid;
+	int nid, reserve;
 
 	if (flags & __GFP_THISNODE)
 		return NULL;
@@ -3188,10 +3225,12 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, numa_node_id());
+		obj = kmem_getpages(cache, local_flags, numa_node_id(),
+				    &reserve);
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
+			slab_set_reserve(cache, reserve);
 			/*
 			 * Insert into the appropriate per node queues
 			 */
@@ -3230,6 +3269,9 @@ static void *____cache_alloc_node(struct
 	l3 = cachep->nodelists[nodeid];
 	BUG_ON(!l3);
 
+	if (unlikely(slab_force_alloc(cachep, flags)))
+		goto force_grow;
+
 retry:
 	check_irq_off();
 	spin_lock(&l3->list_lock);
@@ -3267,6 +3309,7 @@ retry:
 
 must_grow:
 	spin_unlock(&l3->list_lock);
+force_grow:
 	x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL);
 	if (x)
 		goto retry;
Index: mmotm/mm/slob.c
===================================================================
--- mmotm.orig/mm/slob.c
+++ mmotm/mm/slob.c
@@ -69,6 +69,7 @@
 #include <linux/kmemtrace.h>
 #include <linux/kmemleak.h>
 #include <asm/atomic.h>
+#include "internal.h"
 
 /*
  * slob_block has a field 'units', which indicates size of block if +ve,
@@ -191,6 +192,11 @@ struct slob_rcu {
 static DEFINE_SPINLOCK(slob_lock);
 
 /*
+ * tracks the reserve state for the allocator.
+ */
+static int slob_reserve;
+
+/*
  * Encode the given size and next info into a free slob block s.
  */
 static void set_slob(slob_t *s, slobidx_t size, slob_t *next)
@@ -240,7 +246,7 @@ static int slob_last(slob_t *s)
 
 static void *slob_new_pages(gfp_t gfp, int order, int node)
 {
-	void *page;
+	struct page *page;
 
 #ifdef CONFIG_NUMA
 	if (node != -1)
@@ -252,6 +258,8 @@ static void *slob_new_pages(gfp_t gfp, i
 	if (!page)
 		return NULL;
 
+	slob_reserve = page->reserve;
+
 	return page_address(page);
 }
 
@@ -324,6 +332,11 @@ static void *slob_alloc(size_t size, gfp
 	slob_t *b = NULL;
 	unsigned long flags;
 
+	if (unlikely(slob_reserve)) {
+		if (!(gfp_to_alloc_flags(gfp) & ALLOC_NO_WATERMARKS))
+			goto grow;
+	}
+
 	if (size < SLOB_BREAK1)
 		slob_list = &free_slob_small;
 	else if (size < SLOB_BREAK2)
@@ -362,6 +375,7 @@ static void *slob_alloc(size_t size, gfp
 	}
 	spin_unlock_irqrestore(&slob_lock, flags);
 
+grow:
 	/* Not enough space: must allocate a new page */
 	if (!b) {
 		b = slob_new_pages(gfp & ~__GFP_ZERO, 0, node);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 04/31] mm: tag reseve pages
From: Suresh Jayaraman @ 2009-10-01 14:05 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Tag pages allocated from the reserves with a non-zero page->reserve.
This allows us to distinguish and account reserve pages.

Since low-memory situations are transient, and unrelated the the actual
page (any page can be on the freelist when we run low), don't mark the
page in any permanent way - just pass along the information to the
allocatee.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 include/linux/mm_types.h |    1 +
 mm/page_alloc.c          |    4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

Index: mmotm/include/linux/mm_types.h
===================================================================
--- mmotm.orig/include/linux/mm_types.h
+++ mmotm/include/linux/mm_types.h
@@ -77,6 +77,7 @@ struct page {
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* SLUB: freelist req. slab lock */
+		int reserve;		/* page_alloc: page is a reserve page */
 	};
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
Index: mmotm/mm/page_alloc.c
===================================================================
--- mmotm.orig/mm/page_alloc.c
+++ mmotm/mm/page_alloc.c
@@ -1501,8 +1501,10 @@ zonelist_scan:
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
 						gfp_mask, migratetype);
-		if (page)
+		if (page) {
+			page->reserve = !!(alloc_flags & ALLOC_NO_WATERMARKS);
 			break;
+		}
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);

^ permalink raw reply

* [PATCH 03/31] mm: expose gfp_to_alloc_flags()
From: Suresh Jayaraman @ 2009-10-01 14:05 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

Expose the gfp to alloc_flags mapping, so we can use it in other parts
of the vm.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 mm/internal.h   |   15 +++++++++++++++
 mm/page_alloc.c |   16 +---------------
 2 files changed, 16 insertions(+), 15 deletions(-)

Index: mmotm/mm/internal.h
===================================================================
--- mmotm.orig/mm/internal.h
+++ mmotm/mm/internal.h
@@ -194,6 +194,21 @@ static inline struct page *mem_map_next(
 #define __paginginit __init
 #endif
 
+/* The ALLOC_WMARK bits are used as an index to zone->watermark */
+#define ALLOC_WMARK_MIN		WMARK_MIN
+#define ALLOC_WMARK_LOW		WMARK_LOW
+#define ALLOC_WMARK_HIGH	WMARK_HIGH
+#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
+
+/* Mask to get the watermark bits */
+#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
+
+#define ALLOC_HARDER		0x10 /* try to alloc harder */
+#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
Index: mmotm/mm/page_alloc.c
===================================================================
--- mmotm.orig/mm/page_alloc.c
+++ mmotm/mm/page_alloc.c
@@ -1190,19 +1190,6 @@ failed:
 	return NULL;
 }
 
-/* The ALLOC_WMARK bits are used as an index to zone->watermark */
-#define ALLOC_WMARK_MIN		WMARK_MIN
-#define ALLOC_WMARK_LOW		WMARK_LOW
-#define ALLOC_WMARK_HIGH	WMARK_HIGH
-#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
-
-/* Mask to get the watermark bits */
-#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
-
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1691,8 +1678,7 @@ void wake_all_kswapd(unsigned int order,
 		wakeup_kswapd(zone, order);
 }
 
-static inline int
-gfp_to_alloc_flags(gfp_t gfp_mask)
+int gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	struct task_struct *p = current;
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 02/31] swap over network documentation
From: Suresh Jayaraman @ 2009-10-01 14:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Neil Brown <neilb@suse.de>

Document describing the problem and proposed solution

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 Documentation/network-swap.txt |  270 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 270 insertions(+)

Index: mmotm/Documentation/network-swap.txt
===================================================================
--- /dev/null
+++ mmotm/Documentation/network-swap.txt
@@ -0,0 +1,270 @@
+
+Problem:
+   When Linux needs to allocate memory it may find that there is
+   insufficient free memory so it needs to reclaim space that is in
+   use but not needed at the moment.  There are several options:
+
+   1/ Shrink a kernel cache such as the inode or dentry cache.  This
+      is fairly easy but provides limited returns.
+   2/ Discard 'clean' pages from the page cache.  This is easy, and
+      works well as long as there are clean pages in the page cache.
+      Similarly clean 'anonymous' pages can be discarded - if there
+      are any.
+   3/ Write out some dirty page-cache pages so that they become clean.
+      The VM limits the number of dirty page-cache pages to e.g. 40%
+      of available memory so that (among other reasons) a "sync" will
+      not take excessively long.  So there should never be excessive
+      amounts of dirty pagecache.
+      Writing out dirty page-cache pages involves work by the
+      filesystem which may need to allocate memory itself.  To avoid
+      deadlock, filesystems use GFP_NOFS when allocating memory on the
+      write-out path.  When this is used, cleaning dirty page-cache
+      pages is not an option so if the filesystem finds that  memory
+      is tight, another option must be found.
+   4/ Write out dirty anonymous pages to the "Swap" partition/file.
+      This is the most interesting for a couple of reasons.
+      a/ Unlike dirty page-cache pages, there is no need to write anon
+         pages out unless we are actually short of memory.  Thus they
+         tend to be left to last.
+      b/ Anon pages tend to be updated randomly and unpredictably, and
+         flushing them out of memory can have a very significant
+         performance impact on the process using them.  This contrasts
+         with page-cache pages which are often written sequentially
+         and often treated as "write-once, read-many".
+      So anon pages tend to be left until last to be cleaned, and may
+      be the only cleanable pages while there are still some dirty
+      page-cache pages (which are waiting on a GFP_NOFS allocation).
+
+[I don't find the above wholly satisfying.  There seems to be too much
+ hand-waving.  If someone can provide better text explaining why
+ swapout is a special case, that would be great.]
+
+So we need to be able to write to the swap file/partition without
+needing to allocate any memory ... or only a small well controlled
+amount.
+
+The VM reserves a small amount of memory that can only be allocated
+for use as part of the swap-out procedure.  It is only available to
+processes with the PF_MEMALLOC flag set, which is typically just the
+memory cleaner.
+
+Traditionally swap-out is performed directly to block devices (swap
+files on block-device filesystems are supported by examining the
+mapping from file offset to device offset in advance, and then using
+the device offsets to write directly to the device).  Block devices
+are (required to be) written to pre-allocate any memory that might be
+needed during write-out, and to block when the pre-allocated memory is
+exhausted and no other memory is available.  They can be sure not to
+block forever as the pre-allocated memory will be returned as soon as
+the data it is being used for has been written out.  The primary
+mechanism for pre-allocating memory is called "mempools".
+
+This approach does not work for writing anonymous pages
+(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.
+
+
+The main reason that it does not work is that when data from an anon
+page is written to the network, we must wait for a reply to confirm
+the data is safe.  Receiving that reply will consume memory and,
+significantly, we need to allocate memory to an incoming packet before
+we can tell if it is the reply we are waiting for or not.
+
+The secondary reason is that the network code is not written to use
+mempools and in most cases does not need to use them.  Changing all
+allocations in the networking layer to use mempools would be quite
+intrusive, and would waste memory, and probably cause a slow-down in
+the common case of not swapping over the network.
+
+These problems are addressed by enhancing the system of memory
+reserves used by PF_MEMALLOC and requiring any in-kernel networking
+client that is used for swap-out to indicate which sockets are used
+for swapout so they can be handled specially in low memory situations.
+
+There are several major parts to this enhancement:
+
+1/ page->reserve, GFP_MEMALLOC
+
+  To handle low memory conditions we need to know when those
+  conditions exist.  Having a global "low on memory" flag seems easy,
+  but its implementation is problematic.  Instead we make it possible
+  to tell if a recent memory allocation required use of the emergency
+  memory pool.
+  For pages returned by alloc_page, the new page->reserve flag
+  can be tested.  If this is set, then a low memory condition was
+  current when the page was allocated, so the memory should be used
+  carefully. (Because low memory conditions are transient, this
+  state is kept in an overloaded member instead of in page flags, which
+  would suggest a more permanent state.)
+
+  For memory allocated using slab/slub: If a page that is added to a
+  kmem_cache is found to have page->reserve set, then a  s->reserve
+  flag is set for the whole kmem_cache.  Further allocations will only
+  be returned from that page (or any other page in the cache) if they
+  are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
+  Non-emergency allocations will block in alloc_page until a
+  non-reserve page is available.  Once a non-reserve page has been
+  added to the cache, the s->reserve flag on the cache is removed.
+
+  Because slab objects have no individual state its hard to pass
+  reserve state along, the current code relies on a regular alloc
+  failing. There are various allocation wrappers help here.
+
+  This allows us to
+   a/ request use of the emergency pool when allocating memory
+     (GFP_MEMALLOC), and
+   b/ to find out if the emergency pool was used.
+
+2/ SK_MEMALLOC, sk_buff->emergency.
+
+  When memory from the reserve is used to store incoming network
+  packets, the memory must be freed (and the packet dropped) as soon
+  as we find out that the packet is not for a socket that is used for
+  swap-out.
+  To achieve this we have an ->emergency flag for skbs, and an
+  SK_MEMALLOC flag for sockets.
+  When memory is allocated for an skb, it is allocated with
+  GFP_MEMALLOC (if we are currently swapping over the network at
+  all).  If a subsequent test shows that the emergency pool was used,
+  ->emergency is set.
+  When the skb is finally attached to its destination socket, the
+  SK_MEMALLOC flag on the socket is tested.  If the skb has
+  ->emergency set, but the socket does not have SK_MEMALLOC set, then
+  the skb is immediately freed and the packet is dropped.
+  This ensures that reserve memory is never queued on a socket that is
+  not used for swapout.
+
+  Similarly, if an skb is ever queued for delivery to user-space for
+  example by netfilter, the ->emergency flag is tested and the skb is
+  released if ->emergency is set. (so obviously the storage route may
+  not pass through a userspace helper, otherwise the packets will never
+  arrive and we'll deadlock)
+
+  This ensures that memory from the emergency reserve can be used to
+  allow swapout to proceed, but will not get caught up in any other
+  network queue.
+
+
+3/ pages_emergency
+
+  The above would be sufficient if the total memory below the lowest
+  memory watermark (i.e the size of the emergency reserve) were known
+  to be enough to hold all transient allocations needed for writeout.
+  I'm a little blurry on how big the current emergency pool is, but it
+  isn't big and certainly hasn't been sized to allow network traffic
+  to consume any.
+
+  We could simply make the size of the reserve bigger. However in the
+  common case that we are not swapping over the network, that would be
+  a waste of memory.
+
+  So a new "watermark" is defined: pages_emergency.  This is
+  effectively added to the current low water marks, so that pages from
+  this emergency pool can only be allocated if one of PF_MEMALLOC or
+  GFP_MEMALLOC are set.
+
+  pages_emergency can be changed dynamically based on need.  When
+  swapout over the network is required, pages_emergency is increased
+  to cover the maximum expected load.  When network swapout is
+  disabled, pages_emergency is decreased.
+
+  To determine how much to increase it by, we introduce reservation
+  groups....
+
+3a/ reservation groups
+
+  The memory used transiently for swapout can be in a number of
+  different places.  e.g. the network route cache, the network
+  fragment cache, in transit between network card and socket, or (in
+  the case of NFS) in sunrpc data structures awaiting a reply.
+  We need to ensure each of these is limited in the amount of memory
+  they use, and that the maximum is included in the reserve.
+
+  The memory required by the network layer only needs to be reserved
+  once, even if there are multiple swapout paths using the network
+  (e.g. NFS and NDB and iSCSI, though using all three for swapout at
+  the same time would be unusual).
+
+  So we create a tree of reservation groups.  The network might
+  register a collection of reservations, but not mark them as being in
+  use.  NFS and sunrpc might similarly register a collection of
+  reservations, and attach it to the network reservations as it
+  depends on them.
+  When swapout over NFS is requested, the NFS/sunrpc reservations are
+  activated which implicitly activates the network reservations.
+
+  The total new reservation is added to pages_emergency.
+
+  Provided each memory usage stays beneath the registered limit (at
+  least when allocating memory from reserves), the system will never
+  run out of emergency memory, and swapout will not deadlock.
+
+  It is worth noting here that it is not critical that each usage
+  stays beneath the limit 100% of the time.  Occasional excess is
+  acceptable provided that the memory will be freed  again within a
+  short amount of time that does *not* require waiting for any event
+  that itself might require memory.
+  This is because, at all stages of transmit and receive, it is
+  acceptable to discard all transient memory associated with a
+  particular writeout and try again later.  On transmit, the page can
+  be re-queued for later transmission.  On receive, the packet can be
+  dropped assuming that the peer will resend after a timeout.
+
+  Thus allocations that are truly transient and will be freed without
+  blocking do not strictly need to be reserved for.  Doing so might
+  still be a good idea to ensure forward progress doesn't take too
+  long.
+
+4/ low-mem accounting
+
+  Most places that might hold on to emergency memory (e.g. route
+  cache, fragment cache etc) already place a limit on the amount of
+  memory that they can use.  This limit can simply be reserved using
+  the above mechanism and no more needs to be done.
+
+  However some memory usage might not be accounted with sufficient
+  firmness to allow an appropriate emergency reservation.  The
+  in-flight skbs for incoming packets is one such example.
+
+  To support this, a low-overhead mechanism for accounting memory
+  usage against the reserves is provided.  This mechanism uses the
+  same data structure that is used to store the emergency memory
+  reservations through the addition of a 'usage' field.
+
+  Before we attempt allocation from the memory reserves, we much check
+  if the resulting 'usage' is below the reservation. If so, we increase
+  the usage and attempt the allocation (which should succeed). If
+  the projected 'usage' exceeds the reservation we'll either fail the
+  allocation, or wait for 'usage' to decrease enough so that it would
+  succeed, depending on __GFP_WAIT.
+
+  When memory that was allocated for that purpose is freed, the
+  'usage' field is checked again.  If it is non-zero, then the size of
+  the freed memory is subtracted from the usage, making sure the usage
+  never becomes less than zero.
+
+  This provides adequate accounting with minimal overheads when not in
+  a low memory condition.  When a low memory condition is encountered
+  it does add the cost of a spin lock necessary to serialise updates
+  to 'usage'.
+
+
+
+5/ swapon/swapoff/swap_out/swap_in
+
+  So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
+  any network socket that it uses, and can know when to account
+  reserve memory carefully, new address_space_operations are
+  available.
+  "swapon" requests that an address space (i.e a file) be make ready
+  for swapout.  swap_out and swap_in request the actual IO.  They
+  together must ensure that each swap_out request can succeed without
+  allocating more emergency memory that was reserved by swapon. swapoff
+  is used to reverse the state changes caused by swapon when we disable
+  the swap file.
+
+
+Thanks for reading this far.  I hope it made sense :-)
+
+Neil Brown (with updates from Peter Zijlstra)
+
+

^ permalink raw reply

* Re: [PATCH] net: fix NOHZ: local_softirq_pending 08
From: Michael Buesch @ 2009-10-01 14:04 UTC (permalink / raw)
  To: David Miller
  Cc: oliver, johannes, kalle.valo, linville, linux-wireless, netdev
In-Reply-To: <20090930.163333.234658158.davem@davemloft.net>

On Thursday 01 October 2009 01:33:33 David Miller wrote:

> I'm not applying this until all of these details are sorted out 

John, please apply my fix to wireless-testing to get rid of the regression.
You can revert it later, if there's a better fix available.

-- 
Greetings, Michael.

^ permalink raw reply

* [PATCH 01/31] mm: serialize access to min_free_kbytes
From: Suresh Jayaraman @ 2009-10-01 14:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust, Suresh Jayaraman

From: Peter Zijlstra <a.p.zijlstra@chello.nl> 

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_wmarks(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 mm/page_alloc.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: mmotm/mm/page_alloc.c
===================================================================
--- mmotm.orig/mm/page_alloc.c
+++ mmotm/mm/page_alloc.c
@@ -121,6 +121,7 @@ static char * const zone_names[MAX_NR_ZO
 	 "Movable",
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -4448,13 +4449,13 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_wmarks - called when min_free_kbytes changes
+ * __setup_per_zone_wmarks - called when min_free_kbytes changes
  * or when memory is hot-{added|removed}
  *
  * Ensures that the watermark[min,low,high] values for each zone are set
  * correctly with respect to min_free_kbytes.
  */
-void setup_per_zone_wmarks(void)
+static void __setup_per_zone_wmarks(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
@@ -4552,6 +4553,15 @@ static void __init setup_per_zone_inacti
 		calculate_zone_inactive_ratio(zone);
 }
 
+void setup_per_zone_wmarks(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&min_free_lock, flags);
+	__setup_per_zone_wmarks();
+	spin_unlock_irqrestore(&min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4587,7 +4597,7 @@ static int __init init_per_zone_wmark_mi
 		min_free_kbytes = 128;
 	if (min_free_kbytes > 65536)
 		min_free_kbytes = 65536;
-	setup_per_zone_wmarks();
+	__setup_per_zone_wmarks();
 	setup_per_zone_lowmem_reserve();
 	setup_per_zone_inactive_ratio();
 	return 0;

^ permalink raw reply

* [PATCH 00/31] Swap over NFS -v20
From: Suresh Jayaraman @ 2009-10-01 14:04 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm
  Cc: netdev, Neil Brown, Miklos Szeredi, Wouter Verhelst,
	Peter Zijlstra, trond.myklebust

Hi,

Here's the latest version of swap over NFS series since -v19 last October by
Peter Zijlstra. Peter does not have time to pursue this further (though he has
not lost interest) and that led me to take over this patchset and try merging
upstream.

The patches are against the current mmotm. It does not support SLQB, yet.
These patches can also be found online here:
	http://www.suse.de/~sjayaraman/patches/swap-over-nfs/

The swap over NFS patches are being shipped with openSUSE 11.1 and SLE 11 (with
CONFIG_NFS_SWAP enabled by default) for several months now. There have been
no bugs reported so far due to these patches and it has been found stable.

Changes since -v19:
 - rebased patches against current -mm
 - adapted changes pertaining to using zone->watermarks array
 - dropped cleanup patches/fixes that have already made to upstream
 - dropped the patch that remove nfs mempools
 - fixed racy nature of sync_page in swap_sync_page (NeilBrown)
 - fixed use of uninitialized variable in cache_grow() (Miklos Szeredi)
 - fixed a bug in bnx2 driver (Jiri Bohac)
 - fixed null-pointer dereferences in swapfile code path when s_bdev is NULL

Thanks,
Suresh Jayaraman

--

Peter Zijlstra (26)
 mm: serialize access to min_free_kbytes
 mm: expose gfp_to_alloc_flags()
 mm: tag reseve pages
 mm: sl[au]b: add knowledge of reserve pages
 mm: kmem_alloc_estimate()
 mm: allow PF_MEMALLOC from softirq context
 mm: emergency pool
 mm: system wide ALLOC_NO_WATERMARK
 mm: __GFP_MEMALLOC
 mm: memory reserve management
 mm: add support for non block device backed swap files
 mm: methods for teaching filesystems about PG_swapcache pages
 net: packet split receive api
 net: sk_allocation() - concentrate socket related allocations
 selinux: tag avc cache alloc as non-critical
 netvm: network reserve infrastructure
 netvm: INET reserves
 netvm: hook skb allocation to reserves
 netvm: filter emergency skbs
 netvm: prevent a stream specific deadlock
 netvm: skb processing
 netfilter: NF_QUEUE vs emergency skbs
 nfs: teach the NFS client how to treat PG_swapcache pages
 nfs: disable data cache revalidation for swapfiles
 nfs: enable swap on NFS
 nfs: fix various memory recursions possible with swap over NFS

Jeff Mahoney (1)
 Fix initialization of ipv4_route_lock

Neil Brown (2)
 swap over network documentation
 Cope with racy nature of sync_page in swap_sync_page

Miklos Szeredi (1)
 Fix use of uninitialized variable in cache_grow()

Suresh Jayaraman (1)
 swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL


 fs/nfs/file.c                           |   18 
 fs/nfs/pagelist.c                       |    2 
 fs/nfs/write.c                          |   99 ++++
 include/linux/mm_types.h                |    1 
 include/linux/skbuff.h                  |   28 +
 include/linux/slab.h                    |   19 
 include/net/sock.h                      |   55 ++
 mm/page_alloc.c                         |  120 ++++--
 mm/page_io.c                            |    2 
 mm/slab.c                               |   80 +++-
 mm/slob.c                               |   67 +++
 mm/slub.c                               |   89 ++++
 mm/swapfile.c                           |   53 ++
 Documentation/filesystems/Locking	 |   22 +
 Documentation/filesystems/vfs.txt	 |   18 
 Documentation/network-swap.txt		 |  270 +++++++++++++
 drivers/net/bnx2.c               	 |    9 
 drivers/net/e1000e/netdev.c      	 |    7 
 drivers/net/igb/igb_main.c        	 |    9 
 drivers/net/ixgbe/ixgbe_main.c    	 |   14 
 drivers/net/sky2.c                	 |   16 
 fs/nfs/Kconfig                    	 |   10 
 fs/nfs/file.c                     	 |    6 
 fs/nfs/inode.c                    	 |    6 
 fs/nfs/internal.h                  	 |    7 
 fs/nfs/pagelist.c                 	 |    6 
 fs/nfs/read.c                     	 |    6 
 fs/nfs/write.c                    	 |   53 +-
 include/linux/buffer_head.h       	 |    1 
 include/linux/fs.h                	 |    9 
 include/linux/gfp.h               	 |    3 
 include/linux/mm.h                	 |   25 +
 include/linux/mm_types.h          	 |    1 
 include/linux/mmzone.h            	 |    3 
 include/linux/nfs_fs.h            	 |    2 
 include/linux/pagemap.h           	 |    5 
 include/linux/reserve.h           	 |  198 +++++++++
 include/linux/sched.h             	 |    7 
 include/linux/skbuff.h            	 |    3 
 include/linux/slab.h              	 |    4 
 include/linux/slub_def.h          	 |    1 
 include/linux/sunrpc/xprt.h       	 |    5 
 include/linux/swap.h              	 |    4 
 include/net/inet_frag.h           	 |    7 
 include/net/netns/ipv6.h          	 |    4 
 include/net/sock.h                	 |    5 
 kernel/softirq.c                  	 |    3 
 mm/Makefile                       	 |    2 
 mm/internal.h                     	 |   15 
 mm/page_alloc.c                   	 |   16 
 mm/page_io.c                      	 |   51 ++
 mm/reserve.c                      	 |  637 ++++++++++++++++++++++++++++++++
 mm/slab.c                         	 |   61 ++-
 mm/slob.c                         	 |   16 
 mm/slub.c                          	 |   43 +-
 mm/swap_state.c                   	 |    4 
 mm/swapfile.c                     	 |   30 +
 mm/vmstat.c                       	 |    6 
 net/Kconfig                       	 |    3 
 net/core/dev.c                    	 |   57 ++
 net/core/filter.c                 	 |    3 
 net/core/skbuff.c                 	 |  137 +++++-
 net/core/sock.c                   	 |  107 +++++
 net/ipv4/inet_fragment.c          	 |    3 
 net/ipv4/ip_fragment.c            	 |   86 ++++
 net/ipv4/route.c                  	 |   70 +++
 net/ipv4/tcp.c                    	 |    3 
 net/ipv4/tcp_input.c              	 |   12 
 net/ipv4/tcp_output.c             	 |   12 
 net/ipv6/reassembly.c             	 |   85 ++++
 net/ipv6/route.c                  	 |   77 +++
 net/ipv6/tcp_ipv6.c               	 |   15 
 net/netfilter/core.c              	 |    3 
 net/sctp/ulpevent.c               	 |    2 
 net/sunrpc/Kconfig                	 |    5 
 net/sunrpc/sched.c                	 |    9 
 net/sunrpc/xprtsock.c             	 |   68 +++
 security/selinux/avc.c            	 |    2 
 net/core/sock.c                         |   18 
 net/ipv4/route.c                        |    2 

 80 files changed, 2797 insertions(+), 245 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: r8169c: Support for Realtek 8168DP chip?
From: David Dillow @ 2009-10-01 13:38 UTC (permalink / raw)
  To: Rainer Koenig; +Cc: netdev
In-Reply-To: <4AC494EC.8050405@ts.fujitsu.com>

On Thu, 2009-10-01 at 13:39 +0200, Rainer Koenig wrote:
> The reason why is easy to decode when looking at the source: The
> TxConfig register returns 2b800000 and there is no MAC_VERSION in the
> list of valid versions. That means not PHY initialization code is
> executed and stop, no working device. :-(

Francois Romieu posted a patch yesterday (today, his time) to the thread
"r8169 chips on some Intel D945GSEJT boards fail to work after PXE boot"

It looks to add MAC support for your card; you should be able to find it
at any of your favorite mail archives, Google, or better yet,
http://patchwork.ozlabs.org/project/netdev/list/

Hmm, patchwork doesn't seem to have picked it up, yet.

Please test that and let us know how it works.
Dave


^ permalink raw reply

* Re: [PATCH] TI DaVinci EMAC: Minor macro related updates
From: Sergei Shtylyov @ 2009-10-01 12:11 UTC (permalink / raw)
  To: Chaithrika U S; +Cc: netdev, davinci-linux-open-source, davem
In-Reply-To: <1254428719-13960-1-git-send-email-chaithrika@ti.com>

Hello.

Chaithrika U S wrote:

> Use BIT for macro definitions wherever possible, remove
> unused and redundant macros.
> 
> Signed-off-by: Chaithrika U S <chaithrika@ti.com>
[...]
> diff --git a/drivers/net/davinci_emac.c b/drivers/net/davinci_emac.c
> index 65a2d0b..a421ec0 100644
> --- a/drivers/net/davinci_emac.c
> +++ b/drivers/net/davinci_emac.c
> @@ -164,16 +164,14 @@ static const char emac_version_string[] = "TI DaVinci EMAC Linux v6.1";
>  # define EMAC_MBP_MCASTCHAN(ch)		((ch) & 0x7)
>  
>  /* EMAC mac_control register */
> -#define EMAC_MACCONTROL_TXPTYPE		(0x200)
> -#define EMAC_MACCONTROL_TXPACEEN	(0x40)
> -#define EMAC_MACCONTROL_MIIEN		(0x20)
> -#define EMAC_MACCONTROL_GIGABITEN	(0x80)
> -#define EMAC_MACCONTROL_GIGABITEN_SHIFT (7)
> -#define EMAC_MACCONTROL_FULLDUPLEXEN	(0x1)
> +#define EMAC_MACCONTROL_TXPTYPE		BIT(9)
> +#define EMAC_MACCONTROL_TXPACEEN	BIT(6)
> +#define EMAC_MACCONTROL_GMIIEN		BIT(5)
> +#define EMAC_MACCONTROL_GIGABITEN	BIT(7)
> +#define EMAC_MACCONTROL_FULLDUPLEXEN	BIT(0)
>  #define EMAC_MACCONTROL_RMIISPEED_MASK	BIT(15)

    Can we have these properly sorted by value, while you're at it?

>  
>  /* GIGABIT MODE related bits */
> -#define EMAC_DM646X_MACCONTORL_GMIIEN	BIT(5)
>  #define EMAC_DM646X_MACCONTORL_GIG	BIT(7)
>  #define EMAC_DM646X_MACCONTORL_GIGFORCE	BIT(17)
>  
> @@ -192,10 +190,10 @@ static const char emac_version_string[] = "TI DaVinci EMAC Linux v6.1";
>  #define EMAC_RX_BUFFER_OFFSET_MASK	(0xFFFF)
>  
>  /* MAC_IN_VECTOR (0x180) register bit fields */
> -#define EMAC_DM644X_MAC_IN_VECTOR_HOST_INT	      (0x20000)
> -#define EMAC_DM644X_MAC_IN_VECTOR_STATPEND_INT	      (0x10000)
> -#define EMAC_DM644X_MAC_IN_VECTOR_RX_INT_VEC	      (0x0100)
> -#define EMAC_DM644X_MAC_IN_VECTOR_TX_INT_VEC	      (0x01)
> +#define EMAC_DM644X_MAC_IN_VECTOR_HOST_INT	BIT(17)
> +#define EMAC_DM644X_MAC_IN_VECTOR_STATPEND_INT	BIT(16)
> +#define EMAC_DM644X_MAC_IN_VECTOR_RX_INT_VEC	BIT(8)
> +#define EMAC_DM644X_MAC_IN_VECTOR_TX_INT_VEC	BIT(0)
>  
>  /** NOTE:: For DM646x the IN_VECTOR has changed */
>  #define EMAC_DM646X_MAC_IN_VECTOR_RX_INT_VEC	BIT(EMAC_DEF_RX_CH)
> @@ -203,7 +201,6 @@ static const char emac_version_string[] = "TI DaVinci EMAC Linux v6.1";
>  #define EMAC_DM646X_MAC_IN_VECTOR_HOST_INT	BIT(26)
>  #define EMAC_DM646X_MAC_IN_VECTOR_STATPEND_INT	BIT(27)
>  
> -
>  /* CPPI bit positions */
>  #define EMAC_CPPI_SOP_BIT		BIT(31)
>  #define EMAC_CPPI_EOP_BIT		BIT(30)
> @@ -747,8 +744,7 @@ static void emac_update_phystatus(struct emac_priv *priv)
>  
>  	if (priv->speed == SPEED_1000 && (priv->version == EMAC_VERSION_2)) {
>  		mac_control = emac_read(EMAC_MACCONTROL);
> -		mac_control |= (EMAC_DM646X_MACCONTORL_GMIIEN |
> -				EMAC_DM646X_MACCONTORL_GIG |
> +		mac_control |= (EMAC_DM646X_MACCONTORL_GIG |
>  				EMAC_DM646X_MACCONTORL_GIGFORCE);
>  	} else {
>  		/* Clear the GIG bit and GIGFORCE bit */
> @@ -2105,7 +2101,7 @@ static int emac_hw_enable(struct emac_priv *priv)
>  
>  	/* Enable MII */
>  	val = emac_read(EMAC_MACCONTROL);
> -	val |= (EMAC_MACCONTROL_MIIEN);
> +	val |= (EMAC_MACCONTROL_GMIIEN);

    Parens not needed.

>  	emac_write(EMAC_MACCONTROL, val);
>  
>  	/* Enable NAPI and interrupts */

WBR, Sergei

^ permalink raw reply

* r8169c: Support for Realtek 8168DP chip?
From: Rainer Koenig @ 2009-10-01 11:39 UTC (permalink / raw)
  To: netdev

Hi there,

I got several new workstation models  that come with the Realtek 8168DP
chip (8168 with DASH capabilites).
When trying to use this chip with the r8169 driver module I get the
following errors:

localhost kernel: r8169 0000:05:00.0: unknown MAC (2b800600)
localhost kernel: eth0: RTL8169 at 0xffffc2000004c000, 9:65:d3:9f, XID
28800000 IRQ 138
localhost kernel: eth0: PHY reset failed.
localhost kernel: r8169: eth0: TBI auto-negotiating
localhost kernel: r8169: eth0: unknown chipset (mac_version = 1).
localhost kernel: r8169: eth0: link down
localhost kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready

The distribution running is RHEL 5.4, but actually that doesn't matter
since I didn't see the necessary code lines even in the latest blob from
git.kernel.org.

The reason why is easy to decode when looking at the source: The
TxConfig register returns 2b800000 and there is no MAC_VERSION in the
list of valid versions. That means not PHY initialization code is
executed and stop, no working device. :-(

The latest OEM download from Realtek
http://218.210.127.131/downloads/RedirectFTPSite.aspx?SiteID=1&DownTypeID=3&DownID=332&PFid=5&Conn=4
compiles and works. Looking at the source of this driver it shows code
for this TxConfig value and it has a special part for the PHY
initialization.

So the questions are:
- Will there be a patch for the 8168DP chip in the r8169 driver soon?
- What is necessary to get a patch?
- Are the maintainers of r8169 talking to the people that do the OEM
  driver or is r8169 just a reverse engineered driver?

Best regards
Rainer
-- 
Dipl.-Inf. (FH) Rainer Koenig
Project Manager Linux Business Clients
Dept. TSP CLI E SW OSE

Fujitsu Technology Solutions
Bürgermeister-Ullrich-Str. 100
86199 Augsburg
Germany

Telephone: +49-821-804-3321
Telefax:   +49-821-804-2131
Mail:      mailto:Rainer.Koenig@ts.fujitsu.com

Internet         ts.fujtsu.com
Company Details  ts.fujitsu.com/imprint.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox