Netdev List

Netdev List
 help / color / mirror / Atom feed

* How does SACK or FACK determine the time to start fast retransmition?
From: LovelyLich @ 2012-06-20 14:31 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

HI all,
    When tcp uses reno as its congestion control algothim, it uses
tp->sacked_out as dup-ack. When the third dup-ack(under default
condition) comes, tcp will initiate its fast retransmition.
    But how about sack ?
    According to kernel source code comments, when sack or fack tcp option
is enabled, there is no dup-ack counter. See comments for function
tcp_dupack_heuristics():
http://lxr.linux.no/linux+v2.6.37/net/ipv4/tcp_input.c#L2300
    So , how does tcp know the current dup-ack is the last one which
triggers the fast retransmition?

    According to rfc3517 section 5:
    "Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the
scoreboard is to be updated as normal."
    "When a TCP sender receives the duplicate ACK corresponding to
DupThresh ACKs,
the scoreboard MUST be updated with the new SACK information (via
Update ()). If no previous loss event has occurred
on the connection or the cumulative acknowledgment point is beyond
the last value of RecoveryPoint, a loss recovery phase SHOULD be
initiated, per the fast retransmit algorithm outlined in [RFC2581]."

    But these sentences doesn't describe how tcp knows the current ack
is the dup-threshold dup-ack.

    Accorrding to rfc3517 seciton 4 and isLost(Seqnum) function:
    "The routine returns true when either
DupThresh discontiguous SACKed sequences have arrived above
’SeqNum’ or (DupThresh * SMSS) bytes with sequence numbers greater
than ’SeqNum’ have been SACKed. Otherwise, the routine returns
false."
    I think this is just what I am searching for, but I still don't know
which line of code in Linux tcp protocol does this check.
    Can any one help me ? thks in advance.

^ permalink raw reply

* Re: [PATCH 01/12] netvm: Prevent a stream-specific deadlock
From: Mel Gorman @ 2012-06-20 14:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
	Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
	Mike Christie, Eric B Munson
In-Reply-To: <1340185081-22525-2-git-send-email-mgorman@suse.de>

On Wed, Jun 20, 2012 at 10:37:50AM +0100, Mel Gorman wrote:
> It could happen that all !SOCK_MEMALLOC sockets have buffered so
> much data that we're over the global rmem limit. This will prevent
> SOCK_MEMALLOC buffers from receiving data, which will prevent userspace
> from running, which is needed to reduce the buffered data.
> 
> Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.
> Once this change it applied, it is important that sockets that set
> SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
> If this happens, a warning is generated and the tokens reclaimed to
> avoid accounting errors until the bug is fixed.
> 
> [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: David S. Miller <davem@davemloft.net>

This patch introduced a new warning that I had previously missed. I'll
fix it up when rebasing this series on top of linux-next.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 07/17] net: Introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket
From: Mel Gorman @ 2012-06-20 14:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340193999.4604.867.camel@edumazet-glaptop>

On Wed, Jun 20, 2012 at 02:06:39PM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 12:44 +0100, Mel Gorman wrote:
> > Introduce sk_gfp_atomic(), this function allows to inject sock specific
> > flags to each sock related allocation. It is only used on allocation
> > paths that may be required for writing pages back to network storage.
> > 
> > [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: David S. Miller <davem@davemloft.net>
> > ---
> >  include/net/sock.h    |    5 +++++
> >  net/ipv4/tcp_output.c |    9 +++++----
> >  net/ipv6/tcp_ipv6.c   |    8 +++++---
> >  3 files changed, 15 insertions(+), 7 deletions(-)
> > 
> 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 803cbfe..440b47e 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2461,7 +2461,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
> >  
> >  	if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired)
> >  		s_data_desired = cvp->s_data_desired;
> > -	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1, GFP_ATOMIC);
> > +	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1,
> > +			   sk_gfp_atomic(sk, GFP_ATOMIC));
> >  	if (skb == NULL)
> >  
> 
> This bit no longer applies on net-next, sock_wmalloc() was changed to a
> mere alloc_skb()
> 

Thanks, I'll rebase and retest.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 08/17] net: Do not coalesce skbs belonging to PFMEMALLOC sockets
From: Mel Gorman @ 2012-06-20 14:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340200312.4604.1008.camel@edumazet-glaptop>

On Wed, Jun 20, 2012 at 03:51:52PM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 14:36 +0100, Mel Gorman wrote:
> > The intention was to avoid any coalescing in the input path due to avoid
> > packets that "were held back due to TCP_CORK or attempt at coalescing
> > tiny packet". I recognise that it is clumsy and will take the approach
> > instead of having __tcp_push_pending_frames() use sk_gfp_atomic() in the
> > output path.
> 
> But coalescing in input path needs no additional memory allocation, it
> can actually free some memory.
> 

When I wrote it I thought the timing of the transmission of pending frames
was the problem rather than the actual memory usage. My intention was that
any data related to swapping be handled immediately without delay instead of
deferring until a time when GFP_ATOMIC allocations might fail. I arrived
at this patch because tcp_input.c does call tcp_push_pending_frames()
on the receive path and that led me to believe that coalescing was a
factor.

> And it avoids most of the time the infamous "tcp collapses" that needed
> extra memory allocations to group tcp payload on single pages.
> 
> If you want tcp output path being safer, you should disable TSO/GSO
> because some drivers have special handling for skbs that cannot be
> mapped because of various hardware limitations.
> 

Understood. Thanks for the explanation.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH net-next] inetpeer: inetpeer_invalidate_tree() cleanup
From: Eric Dumazet @ 2012-06-20 14:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Steffen Klassert

From: Eric Dumazet <edumazet@google.com>

No need to use cmpxchg() in inetpeer_invalidate_tree() since we hold
base lock.

Also use correct rcu annotations to remove sparse errors
(CONFIG_SPARSE_RCU_POINTER=y)

net/ipv4/inetpeer.c:144:19: error: incompatible types in comparison
expression (different address spaces)
net/ipv4/inetpeer.c:149:20: error: incompatible types in comparison
expression (different address spaces)
net/ipv4/inetpeer.c:595:10: error: incompatible types in comparison
expression (different address spaces)


Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv4/inetpeer.c |   34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index cac02ad..da90a8c 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -126,7 +126,7 @@ int inet_peer_maxttl __read_mostly = 10 * 60 * HZ;	/* usual time to live: 10 min
 
 static void inetpeer_gc_worker(struct work_struct *work)
 {
-	struct inet_peer *p, *n;
+	struct inet_peer *p, *n, *c;
 	LIST_HEAD(list);
 
 	spin_lock_bh(&gc_lock);
@@ -138,17 +138,19 @@ static void inetpeer_gc_worker(struct work_struct *work)
 
 	list_for_each_entry_safe(p, n, &list, gc_list) {
 
-		if(need_resched())
+		if (need_resched())
 			cond_resched();
 
-		if (p->avl_left != peer_avl_empty) {
-			list_add_tail(&p->avl_left->gc_list, &list);
-			p->avl_left = peer_avl_empty;
+		c = rcu_dereference_protected(p->avl_left, 1);
+		if (c != peer_avl_empty) {
+			list_add_tail(&c->gc_list, &list);
+			p->avl_left = peer_avl_empty_rcu;
 		}
 
-		if (p->avl_right != peer_avl_empty) {
-			list_add_tail(&p->avl_right->gc_list, &list);
-			p->avl_right = peer_avl_empty;
+		c = rcu_dereference_protected(p->avl_right, 1);
+		if (c != peer_avl_empty) {
+			list_add_tail(&c->gc_list, &list);
+			p->avl_right = peer_avl_empty_rcu;
 		}
 
 		n = list_entry(p->gc_list.next, struct inet_peer, gc_list);
@@ -587,23 +589,17 @@ static void inetpeer_inval_rcu(struct rcu_head *head)
 
 void inetpeer_invalidate_tree(struct inet_peer_base *base)
 {
-	struct inet_peer *old, *new, *prev;
+	struct inet_peer *root;
 
 	write_seqlock_bh(&base->lock);
 
-	old = base->root;
-	if (old == peer_avl_empty_rcu)
-		goto out;
-
-	new = peer_avl_empty_rcu;
-
-	prev = cmpxchg(&base->root, old, new);
-	if (prev == old) {
+	root = rcu_deref_locked(base->root, base);
+	if (root != peer_avl_empty) {
+		base->root = peer_avl_empty_rcu;
 		base->total = 0;
-		call_rcu(&prev->gc_rcu, inetpeer_inval_rcu);
+		call_rcu(&root->gc_rcu, inetpeer_inval_rcu);
 	}
 
-out:
 	write_sequnlock_bh(&base->lock);
 }
 EXPORT_SYMBOL(inetpeer_invalidate_tree);

^ permalink raw reply related

* Re: [PATCH 08/17] net: Do not coalesce skbs belonging to PFMEMALLOC sockets
From: Eric Dumazet @ 2012-06-20 13:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <20120620133656.GH4011@suse.de>

On Wed, 2012-06-20 at 14:36 +0100, Mel Gorman wrote:
> The intention was to avoid any coalescing in the input path due to avoid
> packets that "were held back due to TCP_CORK or attempt at coalescing
> tiny packet". I recognise that it is clumsy and will take the approach
> instead of having __tcp_push_pending_frames() use sk_gfp_atomic() in the
> output path.

But coalescing in input path needs no additional memory allocation, it
can actually free some memory.

And it avoids most of the time the infamous "tcp collapses" that needed
extra memory allocations to group tcp payload on single pages.


If you want tcp output path being safer, you should disable TSO/GSO
because some drivers have special handling for skbs that cannot be
mapped because of various hardware limitations.

(for example, tg3 and its tg3_tso_bug() or tigon3_dma_hwbug_workaround()
functions)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 08/17] net: Do not coalesce skbs belonging to PFMEMALLOC sockets
From: Mel Gorman @ 2012-06-20 13:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340193892.4604.865.camel@edumazet-glaptop>

On Wed, Jun 20, 2012 at 02:04:52PM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 12:44 +0100, Mel Gorman wrote:
> > Commit [bad43ca8: net: introduce skb_try_coalesce()] introduced an
> > optimisation to coalesce skbs to reduce memory usage and cache line
> > misses. In the case where the socket is used for swapping this can result
> > in a warning like the following.
> > 
> > [  110.476565] nbd0: page allocation failure: order:0, mode:0x20
> > [  110.476568] Pid: 2714, comm: nbd0 Not tainted 3.5.0-rc2-swapnbd-v12r2-slab #3
> > [  110.476569] Call Trace:
> > [  110.476573]  [<ffffffff811042d3>] warn_alloc_failed+0xf3/0x160
> > [  110.476578]  [<ffffffff81107c92>] __alloc_pages_nodemask+0x6e2/0x930
> >
> > <SNIP
> >  
> 
> 
> This makes absolutely no sense to me.
> 
> This patch changes input path, while your stack trace is about output
> path and a packet being fragmented.
> 

The intention was to avoid any coalescing in the input path due to avoid
packets that "were held back due to TCP_CORK or attempt at coalescing
tiny packet". I recognise that it is clumsy and will take the approach
instead of having __tcp_push_pending_frames() use sk_gfp_atomic() in the
output path.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [net-next:master 257/266] drivers/net/team/team_mode_loadbalance.c:99:30: sparse: incompatible types in comparison expression (different address spaces)
From: Jiri Pirko @ 2012-06-20 13:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: paulmck, Fengguang Wu, LKML, netdev
In-Reply-To: <1340197909.4604.956.camel@edumazet-glaptop>

Wed, Jun 20, 2012 at 03:11:49PM CEST, eric.dumazet@gmail.com wrote:
>On Wed, 2012-06-20 at 05:49 -0700, Paul E. McKenney wrote:
>> On Wed, Jun 20, 2012 at 02:50:55PM +0800, Fengguang Wu wrote:
>> > [CC Paul, the RCU maintainer]
>> > 
>> > On Wed, Jun 20, 2012 at 08:36:07AM +0200, Jiri Pirko wrote:
>> > > Wed, Jun 20, 2012 at 06:27:43AM CEST, wfg@linux.intel.com wrote:
>> > > >Hi Jiri,
>> > > >
>> > > >There are new sparse warnings show up in
>> > > >
>> > > >tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
>> > > >head:   677a3d60fb3153f786a0d28fcf0287670e7bd3c2
>> > > >commit: ab8250d70063f77929fc404c02390a1f64d66416 [257/266] team: lb: introduce infrastructure for userspace driven tx loadbalancing
>> > > >
>> > > >All sparse warnings:
>> > > >
>> > > >drivers/net/team/team_mode_loadbalance.c:99:30: sparse: incompatible types in comparison expression (different address spaces)
>> > > >
>> > > >drivers/net/team/team_mode_loadbalance.c:99:
>> > > >    96			struct lb_port_mapping *pm;
>> > > >    97	
>> > > >    98			pm = &lb_priv->ex->tx_hash_to_port_mapping[i];
>> > > >  > 99			if (pm->port == port) {
>> > > 
>> > > This looks like your checker does not like
>> > > (struct team_port __rcu *) == (struct team_port *)
>> > > But I wonder why (or how should I fix that)
>> 
>> Because you said that it was an RCU-protected pointer, but then did
>> not use an RCU primitive to access it.  In this case, where you are
>> just using the value but not dereferencing it, you can use
>> rcu_access_pointer().
>
>Yes, and please Jiri change the 
>
>rcu_assign_pointer(pm->port, NULL);
>
>by
>
>RCU_INIT_POINTER(pm->port, NULL);

Will do.

>
>
>

^ permalink raw reply

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
From: Eric Dumazet @ 2012-06-20 13:21 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher
In-Reply-To: <1340180223.4604.828.camel@edumazet-glaptop>

On Wed, 2012-06-20 at 10:17 +0200, Eric Dumazet wrote:

> Strange, I did again benchs with order-2 allocations and got good
> results this time, but with latest net-next, maybe things have changed
> since last time I did this.
> 
> (netdev_alloc_frag(), get_page_from_freelist() and put_page() less
> prevalent in perf results)
> 

In fact, since SLUB uses order-3 for kmalloc-2048, I felt lucky to try
this as well, and results are really good, on ixgbe at least.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..ffd2cba 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -299,6 +299,9 @@ struct netdev_alloc_cache {
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
+#define MAX_NETDEV_FRAGSIZE	max_t(unsigned int, PAGE_SIZE, 32768)
+#define NETDEV_FRAG_ORDER	get_order(MAX_NETDEV_FRAGSIZE)
+
 /**
  * netdev_alloc_frag - allocate a page fragment
  * @fragsz: fragment size
@@ -316,11 +319,13 @@ void *netdev_alloc_frag(unsigned int fragsz)
 	nc = &__get_cpu_var(netdev_alloc_cache);
 	if (unlikely(!nc->page)) {
 refill:
-		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
+				       (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
+				       NETDEV_FRAG_ORDER);
 		nc->offset = 0;
 	}
 	if (likely(nc->page)) {
-		if (nc->offset + fragsz > PAGE_SIZE) {
+		if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
 			put_page(nc->page);
 			goto refill;
 		}
@@ -353,7 +358,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
 			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+	if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
 		void *data = netdev_alloc_frag(fragsz);
 
 		if (likely(data)) {

^ permalink raw reply related

* Re: [net-next:master 257/266] drivers/net/team/team_mode_loadbalance.c:99:30: sparse: incompatible types in comparison expression (different address spaces)
From: Eric Dumazet @ 2012-06-20 13:11 UTC (permalink / raw)
  To: paulmck; +Cc: Fengguang Wu, Jiri Pirko, LKML, netdev
In-Reply-To: <20120620124913.GF2432@linux.vnet.ibm.com>

On Wed, 2012-06-20 at 05:49 -0700, Paul E. McKenney wrote:
> On Wed, Jun 20, 2012 at 02:50:55PM +0800, Fengguang Wu wrote:
> > [CC Paul, the RCU maintainer]
> > 
> > On Wed, Jun 20, 2012 at 08:36:07AM +0200, Jiri Pirko wrote:
> > > Wed, Jun 20, 2012 at 06:27:43AM CEST, wfg@linux.intel.com wrote:
> > > >Hi Jiri,
> > > >
> > > >There are new sparse warnings show up in
> > > >
> > > >tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
> > > >head:   677a3d60fb3153f786a0d28fcf0287670e7bd3c2
> > > >commit: ab8250d70063f77929fc404c02390a1f64d66416 [257/266] team: lb: introduce infrastructure for userspace driven tx loadbalancing
> > > >
> > > >All sparse warnings:
> > > >
> > > >drivers/net/team/team_mode_loadbalance.c:99:30: sparse: incompatible types in comparison expression (different address spaces)
> > > >
> > > >drivers/net/team/team_mode_loadbalance.c:99:
> > > >    96			struct lb_port_mapping *pm;
> > > >    97	
> > > >    98			pm = &lb_priv->ex->tx_hash_to_port_mapping[i];
> > > >  > 99			if (pm->port == port) {
> > > 
> > > This looks like your checker does not like
> > > (struct team_port __rcu *) == (struct team_port *)
> > > But I wonder why (or how should I fix that)
> 
> Because you said that it was an RCU-protected pointer, but then did
> not use an RCU primitive to access it.  In this case, where you are
> just using the value but not dereferencing it, you can use
> rcu_access_pointer().

Yes, and please Jiri change the 

rcu_assign_pointer(pm->port, NULL);

by

RCU_INIT_POINTER(pm->port, NULL);

^ permalink raw reply

* Re: [PATCH] net: sh_eth: fix the rxdesc pointer when rx descriptor empty happens
From: Guennadi Liakhovetski @ 2012-06-20 13:10 UTC (permalink / raw)
  To: Shimoda, Yoshihiro; +Cc: netdev, SH-Linux
In-Reply-To: <4FC491EB.4040002@renesas.com>

Hello Shimoda-san

On Tue, 29 May 2012, Shimoda, Yoshihiro wrote:

> When Receive Descriptor Empty happens, rxdesc pointer of the driver
> and actual next descriptor of the controller may be mismatch.
> This patch fixes it.

Unfortunately, this patch breaks networking on ecovec (sh7724). Booting 
with dhcp and NFS-root progresses very slowly with lots of "nfs: server 
not responding / Ok" messages and never completes. Reverting it in current 
Linus' tree fixes the problem.

Thanks
Guennadi

> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> ---
>  drivers/net/ethernet/renesas/sh_eth.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> index be3c221..667169b 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -1101,8 +1101,12 @@ static int sh_eth_rx(struct net_device *ndev)
> 
>  	/* Restart Rx engine if stopped. */
>  	/* If we don't need to check status, don't. -KDU */
> -	if (!(sh_eth_read(ndev, EDRRR) & EDRRR_R))
> +	if (!(sh_eth_read(ndev, EDRRR) & EDRRR_R)) {
> +		/* fix the values for the next receiving */
> +		mdp->cur_rx = mdp->dirty_rx = (sh_eth_read(ndev, RDFAR) -
> +					       sh_eth_read(ndev, RDLAR)) >> 4;
>  		sh_eth_write(ndev, EDRRR_R, EDRRR);
> +	}
> 
>  	return 0;
>  }
> @@ -1199,8 +1203,6 @@ static void sh_eth_error(struct net_device *ndev, int intr_status)
>  		/* Receive Descriptor Empty int */
>  		ndev->stats.rx_over_errors++;
> 
> -		if (sh_eth_read(ndev, EDRRR) ^ EDRRR_R)
> -			sh_eth_write(ndev, EDRRR_R, EDRRR);
>  		if (netif_msg_rx_err(mdp))
>  			dev_err(&ndev->dev, "Receive Descriptor Empty\n");
>  	}
> -- 
> 1.7.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sh" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/

^ permalink raw reply

* RE: [PATCH] netxen : Error return off by one for XG port.
From: Rajesh Borundia @ 2012-06-20 13:06 UTC (permalink / raw)
  To: santosh nayak, Sony Chacko; +Cc: netdev, kernel-janitors@vger.kernel.org
In-Reply-To: <1340189578-18308-1-git-send-email-santoshprasadnayak@gmail.com>

______________________________________
From: santosh nayak [santoshprasadnayak@gmail.com]
Sent: Wednesday, June 20, 2012 4:22 PM
To: Sony Chacko; Rajesh Borundia
Cc: netdev; kernel-janitors@vger.kernel.org; Santosh Nayak
Subject: [PATCH] netxen : Error return off by one for XG port.

From: Santosh Nayak <santoshprasadnayak@gmail.com>

There are  NETXEN_NIU_MAX_XG_PORTS ports.
Port indexing starts from zero.
Hence we should also return error for  'port == NETXEN_NIU_MAX_XG_PORTS'.

Signed-off-by: Santosh Nayak <santoshprasadnayak@gmail.com>
---
 .../ethernet/qlogic/netxen/netxen_nic_ethtool.c    |    4 ++--
 drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c
index d4f179f..9103e3e 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c
@@ -511,7 +511,7 @@ netxen_nic_get_pauseparam(struct net_device *dev,
                                break;
                }
        } else if (adapter->ahw.port_type == NETXEN_NIC_XGBE) {
-               if ((port < 0) || (port > NETXEN_NIU_MAX_XG_PORTS))
+               if ((port < 0) || (port >= NETXEN_NIU_MAX_XG_PORTS))
                        return;
                pause->rx_pause = 1;
                val = NXRD32(adapter, NETXEN_NIU_XG_PAUSE_CTL);
@@ -577,7 +577,7 @@ netxen_nic_set_pauseparam(struct net_device *dev,
                }
                NXWR32(adapter, NETXEN_NIU_GB_PAUSE_CTL, val);
        } else if (adapter->ahw.port_type == NETXEN_NIC_XGBE) {
-               if ((port < 0) || (port > NETXEN_NIU_MAX_XG_PORTS))
+               if ((port < 0) || (port >= NETXEN_NIU_MAX_XG_PORTS))
                        return -EIO;
                val = NXRD32(adapter, NETXEN_NIU_XG_PAUSE_CTL);
                if (port == 0) {
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
index de96a94..946160f 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
@@ -365,7 +365,7 @@ static int netxen_niu_disable_xg_port(struct netxen_adapter *adapter)
        if (NX_IS_REVISION_P3(adapter->ahw.revision_id))
                return 0;

-       if (port > NETXEN_NIU_MAX_XG_PORTS)
+       if (port >= NETXEN_NIU_MAX_XG_PORTS)
                return -EINVAL;

        mac_cfg = 0;
@@ -392,7 +392,7 @@ static int netxen_p2_nic_set_promisc(struct netxen_adapter *adapter, u32 mode)
        u32 port = adapter->physical_port;
        u16 board_type = adapter->ahw.board_type;

-       if (port > NETXEN_NIU_MAX_XG_PORTS)
+       if (port >= NETXEN_NIU_MAX_XG_PORTS)
                return -EINVAL;

        mac_cfg = NXRD32(adapter, NETXEN_NIU_XGE_CONFIG_0 + (0x10000 * port));
--
1.7.4.4

Looks ok to me.

Rajesh

^ permalink raw reply related

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: Eric Dumazet @ 2012-06-20 12:38 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <20120620.031543.1511134879638711616.davem@davemloft.net>

On Wed, 2012-06-20 at 03:15 -0700, David Miller wrote:
> 			       dev->ifindex);
> +		if (sk) {
> +			skb_orphan(skb);
> +			skb->sk = sk;
> +			skb->destructor = sock_edemux;
> +			if (!skb_dst(skb) &&
> +			    sk->sk_state != TCP_TIME_WAIT) {
> +				struct dst_entry *dst = sk->sk_rx_dst;
> +				if (dst)
> +					dst = dst_check(dst, 0);
> +				if (dst) {
> +					struct rtable *rt = (struct rtable *) dst;
> +
> +					if (rt->rt_iif == dev->ifindex)
> +						skb_dst_set_noref(skb, dst);
> +				}
> +			}
> +		}
> +	}
> +	return pp;

I am trying to convince myself its safe.

skb_dst_set_noref() assumes caller hold rcu_read_lock() until we use the
skb dst.

And dev_gro_receive() releases RCU...

Problem could happen if sk->sk_rx_dst is freed while some packets are
still in napi or socket backlog (can happen with some network
reordering)

1) Socket backlog must be flushed before sk->sk_rx_dst freeing

2) Even if we move rcu_read_lock() in net_rx_action(), we need some
napi_gro_forcedstrefs() in case we sofnet_break

Or maybe just use napi_gro_flush() ?

diff --git a/net/core/dev.c b/net/core/dev.c
index 57c4f9b..c0f71a0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3861,6 +3861,9 @@ static void net_rx_action(struct softirq_action *h)
 
 		budget -= work;
 
+		if (work == weight)
+			napi_gro_flush(n);
+
 		local_irq_disable();
 
 		/* Drivers must not modify the NAPI state if they

^ permalink raw reply related

* Re: [PATCH 07/17] net: Introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket
From: Eric Dumazet @ 2012-06-20 12:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340192652-31658-8-git-send-email-mgorman@suse.de>

On Wed, 2012-06-20 at 12:44 +0100, Mel Gorman wrote:
> Introduce sk_gfp_atomic(), this function allows to inject sock specific
> flags to each sock related allocation. It is only used on allocation
> paths that may be required for writing pages back to network storage.
> 
> [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: David S. Miller <davem@davemloft.net>
> ---
>  include/net/sock.h    |    5 +++++
>  net/ipv4/tcp_output.c |    9 +++++----
>  net/ipv6/tcp_ipv6.c   |    8 +++++---
>  3 files changed, 15 insertions(+), 7 deletions(-)
> 

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 803cbfe..440b47e 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2461,7 +2461,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
>  
>  	if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired)
>  		s_data_desired = cvp->s_data_desired;
> -	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1, GFP_ATOMIC);
> +	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1,
> +			   sk_gfp_atomic(sk, GFP_ATOMIC));
>  	if (skb == NULL)
>  

This bit no longer applies on net-next, sock_wmalloc() was changed to a
mere alloc_skb()


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 08/17] net: Do not coalesce skbs belonging to PFMEMALLOC sockets
From: Eric Dumazet @ 2012-06-20 12:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340192652-31658-9-git-send-email-mgorman@suse.de>

On Wed, 2012-06-20 at 12:44 +0100, Mel Gorman wrote:
> Commit [bad43ca8: net: introduce skb_try_coalesce()] introduced an
> optimisation to coalesce skbs to reduce memory usage and cache line
> misses. In the case where the socket is used for swapping this can result
> in a warning like the following.
> 
> [  110.476565] nbd0: page allocation failure: order:0, mode:0x20
> [  110.476568] Pid: 2714, comm: nbd0 Not tainted 3.5.0-rc2-swapnbd-v12r2-slab #3
> [  110.476569] Call Trace:
> [  110.476573]  [<ffffffff811042d3>] warn_alloc_failed+0xf3/0x160
> [  110.476578]  [<ffffffff81107c92>] __alloc_pages_nodemask+0x6e2/0x930
> [  110.476582]  [<ffffffff81107c92>] ?  __alloc_pages_nodemask+0x6e2/0x930
> [  110.476588]  [<ffffffff81149f09>] kmem_getpages+0x59/0x1a0
> [  110.476593]  [<ffffffff8114ae5b>] fallback_alloc+0x17b/0x260
> [  110.476597]  [<ffffffff8114ac26>] ____cache_alloc_node+0x96/0x150
> [  110.476602]  [<ffffffff8114a458>] kmem_cache_alloc_node+0x78/0x1b0
> [  110.476607]  [<ffffffff8136c127>] __alloc_skb+0x57/0x1e0
> [  110.476612]  [<ffffffff813b9f81>] sk_stream_alloc_skb+0x41/0x120
> [  110.476617]  [<ffffffff813c8c72>] tcp_fragment+0x62/0x370
> [  110.476622]  [<ffffffff813c8fb9>] tso_fragment+0x39/0x180
> [  110.476628]  [<ffffffff813ca2a9>] tcp_write_xmit+0x1a9/0x3f0
> [  110.476634]  [<ffffffff813ca556>] __tcp_push_pending_frames+0x26/0xd0
> [  110.476639]  [<ffffffff813c61f5>] tcp_rcv_established+0x385/0x760
> [  110.476644]  [<ffffffff813ce671>] tcp_v4_do_rcv+0x111/0x1f0
> [  110.476648]  [<ffffffff81367259>] release_sock+0x99/0x140
> [  110.476652]  [<ffffffff813ba82b> tcp_sendmsg+0x7cb/0xe80
> [  110.476657]  [<ffffffff813df9b4>] inet_sendmsg+0x64/0xb0
> [  110.476661]  [<ffffffff811f0a00>] ? security_socket_sendmsg+0x10/0x20
> [  110.476666]  [<ffffffff81361dd8>] sock_sendmsg+0xf8/0x130
> [  110.476672]  [<ffffffff8124ba4c>] ? cpumask_next_and+0x3c/0x50
> [  110.476677]  [<ffffffff8107b053>] ? update_sd_lb_stats+0x123/0x620
> [  110.476683]  [<ffffffff8105164f>] ? recalc_sigpending+0x1f/0x70
> [  110.476688]  [<ffffffff81051e17>] ? __set_task_blocked+0x37/0x80
> [  110.476693]  [<ffffffff81361e51>] kernel_sendmsg+0x41/0x60
> [  110.476698]  [<ffffffffa048d417>] sock_xmit+0xb7/0x300 [nbd]
> [  110.476703]  [<ffffffff8107bad7>] ? load_balance+0xd7/0x490
> [  110.476710]  [<ffffffffa048d7ac>] nbd_send_req+0x14c/0x270 [nbd]
> [  110.476716]  [<ffffffffa048e21e>] nbd_handle_req+0x9e/0x180 [nbd]
> [  110.476721]  [<ffffffffa048e4f2>] nbd_thread+0xb2/0x150 [nbd]
> [  110.476725]  [<ffffffff81062580>] ? wake_up_bit+0x40/0x40
> [  110.476730]  [<ffffffffa048e440>] ? do_nbd_request+0x140/0x140 [nbd]
> [  110.476733]  [<ffffffff81061d7e>] kthread+0x9e/0xb0
> [  110.476739]  [<ffffffff81439d64>] kernel_thread_helper+0x4/0x10
> [  110.476743]  [<ffffffff81061ce0>] ? flush_kthread_worker+0xc0/0xc0
> [  110.476748]  [<ffffffff81439d60>] ? gs_change+0x13/0x13
> 
> There were two ways this could be addressed. The first would be to
> teach __tcp_push_pending_frames() to use __GFP_MEMALLOC if the socket
> has SOCK_MEMALLOC set. This potentially defers the time of allocation
> to a point where we are applying greater pressure on PFMEMALLOC reserves
> which is undesirable.  The second approach is to disable skb coalescing
> for SOCK_MEMALLOC sockets and process them immediately. This patch takes
> the second approach.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  net/core/skbuff.c    |    7 +++++++
>  net/ipv4/tcp_input.c |    8 ++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index d78671e..1d6ecc8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3370,6 +3370,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>  
>  	*fragstolen = false;
>  
> +	/*
> +	 * Avoid coalescing of SOCK_MEMALLOC socks are we do not want to defer
> +	 * RX/TX to a time when pfmemallo reserves are under greater pressure
> +	 */
> +	if (sk_memalloc_socks() && sock_flag(to->sk, SOCK_MEMALLOC))
> +		return false;
> +
>  	if (skb_cloned(to))
>  		return false;
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index b224eb8..448f130 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4553,6 +4553,14 @@ static bool tcp_try_coalesce(struct sock *sk,
>  
>  	*fragstolen = false;
>  
> +	/*
> +	 * Do not attempt merging if the socket is used by the VM for swapping.
> +	 * Attempts to defer can result in allocation failures during RX when
> +	 * an attempt is made to push pending frames
> +	 */
> +	if (sk_memalloc_socks() && sock_flag(sk, SOCK_MEMALLOC))
> +		return false;
> +
>  	if (tcp_hdr(from)->fin)
>  		return false;
>  


This makes absolutely no sense to me.

This patch changes input path, while your stack trace is about output
path and a packet being fragmented.

If you really want to avoid this kind of thing, you'll need to disable
TSO/GSO on the socket.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 01/17] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120620110512.GA4208@breakpoint.cc>

On Wed, Jun 20, 2012 at 01:05:13PM +0200, Sebastian Andrzej Siewior wrote:
> On Wed, Jun 20, 2012 at 10:35:04AM +0100, Mel Gorman wrote:
> > [a.p.zijlstra@chello.nl: Original implementation]
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > diff --git a/mm/slab.c b/mm/slab.c
> > index e901a36..b190cac 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -1851,6 +1984,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
> >  	while (i--) {
> >  		BUG_ON(!PageSlab(page));
> >  		__ClearPageSlab(page);
> > +		__ClearPageSlabPfmemalloc(page);
> >  		page++;
> >  	}
> >  	if (current->reclaim_state)
> > @@ -3120,16 +3254,19 @@ bad:
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 8c691fa..43738c9 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1414,6 +1418,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
> >  		-pages);
> >  
> >  	__ClearPageSlab(page);
> > +	__ClearPageSlabPfmemalloc(page);
> >  	reset_page_mapcount(page);
> >  	if (current->reclaim_state)
> >  		current->reclaim_state->reclaimed_slab += pages;
> 
> So you mention a change here in v11's changelog but I don't see it.
> 

Because I'm an idiot and send out the wrong branch and then was rude
enough to not include you on the CC. I have resent the series, correctly
this time I hope. Sorry about that.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply

* [PATCH 17/17] mm: Account for the number of times direct reclaimers get throttled
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

Under significant pressure when writing back to network-backed storage,
direct reclaimers may get throttled. This is expected to be a
short-lived event and the processes get woken up again but processes do
get stalled. This patch counts how many times such stalling occurs. It's
up to the administrator whether to reduce these stalls by increasing
min_free_kbytes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/vm_event_item.h |    1 +
 mm/vmscan.c                   |    3 +++
 mm/vmstat.c                   |    1 +
 3 files changed, 5 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 06f8e38..57f7b10 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -30,6 +30,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
 		FOR_ALL_ZONES(PGSCAN_KSWAPD),
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
+		PGSCAN_DIRECT_THROTTLE,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf8625c..bcc65c5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2166,6 +2166,9 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	if (pfmemalloc_watermark_ok(pgdat))
 		return;
 
+	/* Account for the throttling */
+	count_vm_event(PGSCAN_DIRECT_THROTTLE);
+
 	/*
 	 * If the caller cannot enter the filesystem, it's possible that it
 	 * is due to the caller holding an FS lock or performing a journal
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1bbbbd9..df7a674 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -745,6 +745,7 @@ const char * const vmstat_text[] = {
 	TEXTS_FOR_ZONES("pgsteal_direct")
 	TEXTS_FOR_ZONES("pgscan_kswapd")
 	TEXTS_FOR_ZONES("pgscan_direct")
+	"pgscan_direct_throttle",
 
 #ifdef CONFIG_NUMA
 	"zone_reclaim_failed",
-- 
1.7.9.2

^ permalink raw reply related

* [PATCH 16/17] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

If swap is backed by network storage such as NBD, there is a risk
that a large number of reclaimers can hang the system by consuming
all PF_MEMALLOC reserves. To avoid these hangs, the administrator
must tune min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother
but the system will keep running.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    1 +
 mm/page_alloc.c        |    1 +
 mm/vmscan.c            |  128 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 122 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2427706..b97c3d3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -694,6 +694,7 @@ typedef struct pglist_data {
 					     range, including holes */
 	int node_id;
 	wait_queue_head_t kswapd_wait;
+	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6c48965..95da786 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4379,6 +4379,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeb3bc9..bf8625c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2111,6 +2111,80 @@ out:
 	return 0;
 }
 
+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	unsigned long pfmemalloc_reserve = 0;
+	unsigned long free_pages = 0;
+	int i;
+	bool wmark_ok;
+
+	for (i = 0; i <= ZONE_NORMAL; i++) {
+		zone = &pgdat->node_zones[i];
+		pfmemalloc_reserve += min_wmark_pages(zone);
+		free_pages += zone_page_state(zone, NR_FREE_PAGES);
+	}
+
+	wmark_ok = free_pages > pfmemalloc_reserve / 2;
+
+	/* kswapd must be awake if processes are being throttled */
+	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+		pgdat->classzone_idx = min(pgdat->classzone_idx,
+						(enum zone_type)ZONE_NORMAL);
+		wake_up_interruptible(&pgdat->kswapd_wait);
+	}
+
+	return wmark_ok;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+					nodemask_t *nodemask)
+{
+	struct zone *zone;
+	int high_zoneidx = gfp_zone(gfp_mask);
+	pg_data_t *pgdat;
+
+	/*
+	 * Kernel threads should not be throttled as they may be indirectly
+	 * responsible for cleaning pages necessary for reclaim to make forward
+	 * progress. kjournald for example may enter direct reclaim while
+	 * committing a transaction where throttling it could forcing other
+	 * processes to block on log_wait_commit().
+	 */
+	if (current->flags & PF_KTHREAD)
+		return;
+
+	/* Check if the pfmemalloc reserves are ok */
+	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+	pgdat = zone->zone_pgdat;
+	if (pfmemalloc_watermark_ok(pgdat))
+		return;
+
+	/*
+	 * If the caller cannot enter the filesystem, it's possible that it
+	 * is due to the caller holding an FS lock or performing a journal
+	 * transaction in the case of a filesystem like ext[3|4]. In this case,
+	 * it is not safe to block on pfmemalloc_wait as kswapd could be
+	 * blocked waiting on the same lock. Instead, throttle for up to a
+	 * second before continuing.
+	 */
+	if (!(gfp_mask & __GFP_FS)) {
+		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
+			pfmemalloc_watermark_ok(pgdat), HZ);
+		return;
+	}
+
+	/* Throttle until kswapd wakes the process */
+	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+		pfmemalloc_watermark_ok(pgdat));
+}
+
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
@@ -2130,6 +2204,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.gfp_mask = sc.gfp_mask,
 	};
 
+	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
+
+	/*
+	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
+	 * that the page allocator does not consider triggering OOM
+	 */
+	if (fatal_signal_pending(current))
+		return 1;
+
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
 				gfp_mask);
@@ -2274,8 +2357,13 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 	return balanced_pages >= (present_pages >> 2);
 }
 
-/* is kswapd sleeping prematurely? */
-static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
+/*
+ * Prepare kswapd for sleeping. This verifies that there are no processes
+ * waiting in throttle_direct_reclaim() and that watermarks have been met.
+ *
+ * Returns true if kswapd is ready to sleep
+ */
+static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 					int classzone_idx)
 {
 	int i;
@@ -2284,7 +2372,21 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
-		return true;
+		return false;
+
+	/*
+	 * There is a potential race between when kswapd checks its watermarks
+	 * and a process gets throttled. There is also a potential race if
+	 * processes get throttled, kswapd wakes, a large process exits therby
+	 * balancing the zones that causes kswapd to miss a wakeup. If kswapd
+	 * is going to sleep, no process should be sleeping on pfmemalloc_wait
+	 * so wake them now if necessary. If necessary, processes will wake
+	 * kswapd and get throttled again
+	 */
+	if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
+		wake_up(&pgdat->pfmemalloc_wait);
+		return false;
+	}
 
 	/* Check the watermark levels */
 	for (i = 0; i <= classzone_idx; i++) {
@@ -2317,9 +2419,9 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	 * must be balanced
 	 */
 	if (order)
-		return !pgdat_balanced(pgdat, balanced, classzone_idx);
+		return pgdat_balanced(pgdat, balanced, classzone_idx);
 	else
-		return !all_zones_ok;
+		return all_zones_ok;
 }
 
 /*
@@ -2545,6 +2647,16 @@ loop_again:
 			}
 
 		}
+
+		/*
+		 * If the low watermark is met there is no need for processes
+		 * to be throttled on pfmemalloc_wait as they should not be
+		 * able to safely make forward progress. Wake them
+		 */
+		if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
+				pfmemalloc_watermark_ok(pgdat))
+			wake_up(&pgdat->pfmemalloc_wait);
+
 		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
 			break;		/* kswapd: all done */
 		/*
@@ -2646,7 +2758,7 @@ out:
 	}
 
 	/*
-	 * Return the order we were reclaiming at so sleeping_prematurely()
+	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
 	 * makes a decision on the order we were last reclaiming at. However,
 	 * if another caller entered the allocator slow path while kswapd
 	 * was awake, order will remain at the higher level
@@ -2666,7 +2778,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -2676,7 +2788,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
-- 
1.7.9.2

^ permalink raw reply related

* [PATCH 15/17] nbd: Set SOCK_MEMALLOC for access to PFMEMALLOC reserves
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

Set SOCK_MEMALLOC on the NBD socket to allow access to PFMEMALLOC
reserves so pages backed by NBD, particularly if swap related, can
be cleaned to prevent the machine being deadlocked. It is still
possible that the PFMEMALLOC reserves get depleted resulting in
deadlock but this can be resolved by the administrator by increasing
min_free_kbytes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 drivers/block/nbd.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 061427a75d..76bc96f 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -154,6 +154,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
 	struct msghdr msg;
 	struct kvec iov;
 	sigset_t blocked, oldset;
+	unsigned long pflags = current->flags;
 
 	if (unlikely(!sock)) {
 		dev_err(disk_to_dev(nbd->disk),
@@ -167,8 +168,9 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
 	siginitsetinv(&blocked, sigmask(SIGKILL));
 	sigprocmask(SIG_SETMASK, &blocked, &oldset);
 
+	current->flags |= PF_MEMALLOC;
 	do {
-		sock->sk->sk_allocation = GFP_NOIO;
+		sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
 		iov.iov_base = buf;
 		iov.iov_len = size;
 		msg.msg_name = NULL;
@@ -214,6 +216,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
 	} while (size > 0);
 
 	sigprocmask(SIG_SETMASK, &oldset, NULL);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 
 	return result;
 }
@@ -405,6 +408,7 @@ static int nbd_do_it(struct nbd_device *nbd)
 
 	BUG_ON(nbd->magic != NBD_MAGIC);
 
+	sk_set_memalloc(nbd->sock->sk);
 	nbd->pid = task_pid_nr(current);
 	ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
 	if (ret) {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 14/17] mm: Micro-optimise slab to avoid a function call
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

Getting and putting objects in SLAB currently requires a function call
but the bulk of the work is related to PFMEMALLOC reserves which are
only consumed when network-backed storage is critical. Use an inline
function to determine if the function call is required.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/slab.c |   28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 5268368..b1a39f7 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -117,6 +117,8 @@
 #include	<linux/memory.h>
 #include	<linux/prefetch.h>
 
+#include	<net/sock.h>
+
 #include	<asm/cacheflush.h>
 #include	<asm/tlbflush.h>
 #include	<asm/page.h>
@@ -1016,7 +1018,7 @@ out:
 	spin_unlock_irqrestore(&l3->list_lock, flags);
 }
 
-static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
 						gfp_t flags, bool force_refill)
 {
 	int i;
@@ -1063,7 +1065,20 @@ static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
 	return objp;
 }
 
-static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static inline void *ac_get_obj(struct kmem_cache *cachep,
+			struct array_cache *ac, gfp_t flags, bool force_refill)
+{
+	void *objp;
+
+	if (unlikely(sk_memalloc_socks()))
+		objp = __ac_get_obj(cachep, ac, flags, force_refill);
+	else
+		objp = ac->entry[--ac->avail];
+
+	return objp;
+}
+
+static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
 								void *objp)
 {
 	if (unlikely(pfmemalloc_active)) {
@@ -1073,6 +1088,15 @@ static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
 			set_obj_pfmemalloc(&objp);
 	}
 
+	return objp;
+}
+
+static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+								void *objp)
+{
+	if (unlikely(sk_memalloc_socks()))
+		objp = __ac_put_obj(cachep, ac, objp);
+
 	ac->entry[ac->avail++] = objp;
 }
 
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 13/17] netvm: Set PF_MEMALLOC as appropriate during SKB processing
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

In order to make sure pfmemalloc packets receive all memory
needed to proceed, ensure processing of pfmemalloc SKBs happens
under PF_MEMALLOC. This is limited to a subset of protocols that
are expected to be used for writing to swap. Taps are not allowed to
use PF_MEMALLOC as these are expected to communicate with userspace
processes which could be paged out.

[a.p.zijlstra@chello.nl: Ideas taken from various patches]
[jslaby@suse.cz: Lock imbalance fix]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
---
 include/net/sock.h |    5 +++++
 net/core/dev.c     |   53 ++++++++++++++++++++++++++++++++++++++++++++++------
 net/core/sock.c    |   16 ++++++++++++++++
 3 files changed, 68 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index e3fe462..772577f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -743,8 +743,13 @@ static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *s
 	return 0;
 }
 
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+	if (sk_memalloc_socks() && skb_pfmemalloc(skb))
+		return __sk_backlog_rcv(sk, skb);
+
 	return sk->sk_backlog_rcv(sk, skb);
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index cd09819..16f2f58 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3140,6 +3140,23 @@ void netdev_rx_handler_unregister(struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
 
+/*
+ * Limit the use of PFMEMALLOC reserves to those protocols that implement
+ * the special handling of PFMEMALLOC skbs.
+ */
+static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case __constant_htons(ETH_P_ARP):
+	case __constant_htons(ETH_P_IP):
+	case __constant_htons(ETH_P_IPV6):
+	case __constant_htons(ETH_P_8021Q):
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
@@ -3149,14 +3166,27 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	bool deliver_exact = false;
 	int ret = NET_RX_DROP;
 	__be16 type;
+	unsigned long pflags = current->flags;
 
 	net_timestamp_check(!netdev_tstamp_prequeue, skb);
 
 	trace_netif_receive_skb(skb);
 
+	/*
+	 * PFMEMALLOC skbs are special, they should
+	 * - be delivered to SOCK_MEMALLOC sockets only
+	 * - stay away from userspace
+	 * - have bounded memory usage
+	 *
+	 * Use PF_MEMALLOC as this saves us from propagating the allocation
+	 * context down to all allocation sites.
+	 */
+	if (sk_memalloc_socks() && skb_pfmemalloc(skb))
+		current->flags |= PF_MEMALLOC;
+
 	/* if we've gotten here through NAPI, check netpoll */
 	if (netpoll_receive_skb(skb))
-		return NET_RX_DROP;
+		goto out;
 
 	if (!skb->skb_iif)
 		skb->skb_iif = skb->dev->ifindex;
@@ -3177,7 +3207,7 @@ another_round:
 	if (skb->protocol == cpu_to_be16(ETH_P_8021Q)) {
 		skb = vlan_untag(skb);
 		if (unlikely(!skb))
-			goto out;
+			goto unlock;
 	}
 
 #ifdef CONFIG_NET_CLS_ACT
@@ -3187,6 +3217,9 @@ another_round:
 	}
 #endif
 
+	if (sk_memalloc_socks() && skb_pfmemalloc(skb))
+		goto skip_taps;
+
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
@@ -3195,13 +3228,18 @@ another_round:
 		}
 	}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
 	skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
-		goto out;
+		goto unlock;
 ncls:
 #endif
 
+	if (sk_memalloc_socks() && skb_pfmemalloc(skb)
+				&& !skb_pfmemalloc_protocol(skb))
+		goto drop;
+
 	rx_handler = rcu_dereference(skb->dev->rx_handler);
 	if (vlan_tx_tag_present(skb)) {
 		if (pt_prev) {
@@ -3211,7 +3249,7 @@ ncls:
 		if (vlan_do_receive(&skb, !rx_handler))
 			goto another_round;
 		else if (unlikely(!skb))
-			goto out;
+			goto unlock;
 	}
 
 	if (rx_handler) {
@@ -3221,7 +3259,7 @@ ncls:
 		}
 		switch (rx_handler(&skb)) {
 		case RX_HANDLER_CONSUMED:
-			goto out;
+			goto unlock;
 		case RX_HANDLER_ANOTHER:
 			goto another_round;
 		case RX_HANDLER_EXACT:
@@ -3251,6 +3289,7 @@ ncls:
 	if (pt_prev) {
 		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
+drop:
 		atomic_long_inc(&skb->dev->rx_dropped);
 		kfree_skb(skb);
 		/* Jamal, now you will not able to escape explaining
@@ -3259,8 +3298,10 @@ ncls:
 		ret = NET_RX_DROP;
 	}
 
-out:
+unlock:
 	rcu_read_unlock();
+out:
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 	return ret;
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 9f7240f..17f0813 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -298,6 +298,22 @@ void sk_clear_memalloc(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_clear_memalloc);
 
+int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	int ret;
+	unsigned long pflags = current->flags;
+
+	/* these should have been dropped before queueing */
+	BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));
+
+	current->flags |= PF_MEMALLOC;
+	ret = sk->sk_backlog_rcv(sk, skb);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
+
+	return ret;
+}
+EXPORT_SYMBOL(__sk_backlog_rcv);
+
 #if defined(CONFIG_CGROUPS)
 #if !defined(CONFIG_NET_CLS_CGROUP)
 int net_cls_subsys_id = -1;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 12/17] netvm: Propagate page->pfmemalloc from skb_alloc_page to skb
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

The skb->pfmemalloc flag gets set to true iff during the slab
allocation of data in __alloc_skb that the the PFMEMALLOC reserves
were used. If page splitting is used, it is possible that pages will
be allocated from the PFMEMALLOC reserve without propagating this
information to the skb. This patch propagates page->pfmemalloc from
pages allocated for fragments to the skb.

It works by reintroducing and expanding the skb_alloc_page() API
to take an skb. If the page was allocated from pfmemalloc reserves,
it is automatically copied. If the driver allocates the page before
the skb, it should call skb_propagate_pfmemalloc() after the skb is
allocated to ensure the flag is copied properly.

Failure to do so is not critical. The resulting driver may perform
slower if it is used for swap-over-NBD or swap-over-NFS but it should
not result in failure.

[davem@davemloft.net: API rename and consistency]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 +-
 drivers/net/ethernet/intel/igb/igb_main.c         |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    3 +-
 drivers/net/usb/cdc-phonet.c                      |    2 +-
 drivers/usb/gadget/f_phonet.c                     |    2 +-
 include/linux/skbuff.h                            |   55 +++++++++++++++++++++
 8 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index e111d97..496df78 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -528,7 +528,7 @@ static unsigned int refill_fl(struct adapter *adap, struct sge_fl *q, int n,
 #endif
 
 	while (n--) {
-		pg = alloc_page(gfp);
+		pg = __skb_alloc_page(gfp, NULL);
 		if (unlikely(!pg)) {
 			q->alloc_failed++;
 			break;
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
index 0bd585b..dca0716 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
@@ -653,7 +653,7 @@ static unsigned int refill_fl(struct adapter *adapter, struct sge_fl *fl,
 
 alloc_small_pages:
 	while (n--) {
-		page = alloc_page(gfp | __GFP_NOWARN | __GFP_COLD);
+		page = __skb_alloc_page(gfp | __GFP_NOWARN, NULL);
 		if (unlikely(!page)) {
 			fl->alloc_failed++;
 			break;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index dd3bfe8..603a702 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6142,7 +6142,7 @@ static bool igb_alloc_mapped_page(struct igb_ring *rx_ring,
 		return true;
 
 	if (!page) {
-		page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+		page = __skb_alloc_page(GFP_ATOMIC, bi->skb);
 		bi->page = page;
 		if (unlikely(!page)) {
 			rx_ring->rx_stats.alloc_failed++;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 17ad6a3..6f9d902 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1148,7 +1148,7 @@ static bool ixgbe_alloc_mapped_page(struct ixgbe_ring *rx_ring,
 
 	/* alloc new page for storage */
 	if (likely(!page)) {
-		page = alloc_pages(GFP_ATOMIC | __GFP_COLD,
+		page = __skb_alloc_pages(GFP_ATOMIC, bi->skb,
 				   ixgbe_rx_pg_order(rx_ring));
 		if (unlikely(!page)) {
 			rx_ring->rx_stats.alloc_rx_page_failed++;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index f69ec42..cd65fd8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -369,7 +369,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
 		if (!bi->page_dma &&
 		    (adapter->flags & IXGBE_FLAG_RX_PS_ENABLED)) {
 			if (!bi->page) {
-				bi->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+				bi->page = __skb_alloc_page(GFP_ATOMIC, NULL);
 				if (!bi->page) {
 					adapter->alloc_rx_page_failed++;
 					goto no_buffers;
@@ -403,6 +403,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
 			 */
 			skb_reserve(skb, NET_IP_ALIGN);
 
+			skb_propagate_pfmemalloc(bi->page_dma, skb);
 			bi->skb = skb;
 		}
 		if (!bi->dma) {
diff --git a/drivers/net/usb/cdc-phonet.c b/drivers/net/usb/cdc-phonet.c
index d848d4d..85e8bc5 100644
--- a/drivers/net/usb/cdc-phonet.c
+++ b/drivers/net/usb/cdc-phonet.c
@@ -130,7 +130,7 @@ static int rx_submit(struct usbpn_dev *pnd, struct urb *req, gfp_t gfp_flags)
 	struct page *page;
 	int err;
 
-	page = alloc_page(gfp_flags);
+	page = __skb_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL);
 	if (!page)
 		return -ENOMEM;
 
diff --git a/drivers/usb/gadget/f_phonet.c b/drivers/usb/gadget/f_phonet.c
index 965a629..8ee9268 100644
--- a/drivers/usb/gadget/f_phonet.c
+++ b/drivers/usb/gadget/f_phonet.c
@@ -301,7 +301,7 @@ pn_rx_submit(struct f_phonet *fp, struct usb_request *req, gfp_t gfp_flags)
 	struct page *page;
 	int err;
 
-	page = alloc_page(gfp_flags);
+	page = __skb_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL);
 	if (!page)
 		return -ENOMEM;
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c421ee0..3ae2f60 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1761,6 +1761,61 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 	return __netdev_alloc_skb_ip_align(dev, length, GFP_ATOMIC);
 }
 
+/*
+ *	__skb_alloc_page - allocate pages for ps-rx on a skb and preserve pfmemalloc data
+ *	@gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
+ *	@skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used
+ *	@order: size of the allocation
+ *
+ * 	Allocate a new page.
+ *
+ * 	%NULL is returned if there is no free memory.
+*/
+static inline struct page *__skb_alloc_pages(gfp_t gfp_mask,
+					      struct sk_buff *skb,
+					      unsigned int order)
+{
+	struct page *page;
+
+	gfp_mask |= __GFP_COLD;
+
+	if (!(gfp_mask & __GFP_NOMEMALLOC))
+		gfp_mask |= __GFP_MEMALLOC;
+
+	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
+	if (skb && page && page->pfmemalloc)
+		skb->pfmemalloc = true;
+
+	return page;
+}
+
+/**
+ *	__skb_alloc_page - allocate a page for ps-rx for a given skb and preserve pfmemalloc data
+ *	@gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
+ *	@skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used
+ *
+ * 	Allocate a new page.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *__skb_alloc_page(gfp_t gfp_mask,
+					     struct sk_buff *skb)
+{
+	return __skb_alloc_pages(gfp_mask, skb, 0);
+}
+
+/**
+ *	skb_propagate_pfmemalloc - Propagate pfmemalloc if skb is allocated after RX page
+ *	@page: The page that was allocated from skb_alloc_page
+ *	@skb: The skb that may need pfmemalloc set
+ */
+static inline void skb_propagate_pfmemalloc(struct page *page,
+					     struct sk_buff *skb)
+{
+	if (page && page->pfmemalloc)
+		skb->pfmemalloc = true;
+}
+
 /**
  * skb_frag_page - retrieve the page refered to by a paged fragment
  * @frag: the paged fragment
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 11/17] netvm: Propagate page->pfmemalloc to skb
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

The skb->pfmemalloc flag gets set to true iff during the slab
allocation of data in __alloc_skb that the the PFMEMALLOC reserves
were used. If the packet is fragmented, it is possible that pages
will be allocated from the PFMEMALLOC reserve without propagating
this information to the skb. This patch propagates page->pfmemalloc
from pages allocated for fragments to the skb.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
---
 include/linux/skbuff.h |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 61c951f..c421ee0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1250,6 +1250,17 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 {
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
+	/*
+	 * Propagate page->pfmemalloc to the skb if we can. The problem is
+	 * that not all callers have unique ownership of the page. If
+	 * pfmemalloc is set, we check the mapping as a mapping implies
+	 * page->index is set (index and pfmemalloc share space).
+	 * If it's a valid mapping, we cannot use page->pfmemalloc but we
+	 * do not lose pfmemalloc information as the pages would not be
+	 * allocated using __GFP_MEMALLOC.
+	 */
+	if (page->pfmemalloc && !page->mapping)
+		skb->pfmemalloc	= true;
 	frag->page.p		  = page;
 	frag->page_offset	  = off;
 	skb_frag_size_set(frag, size);
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 10/17] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

Change the skb allocation API to indicate RX usage and use this to fall
back to the PFMEMALLOC reserve when needed. SKBs allocated from the
reserve are tagged in skb->pfmemalloc. If an SKB is allocated from
the reserve and the socket is later found to be unrelated to page
reclaim, the packet is dropped so that the memory remains available
for page reclaim. Network protocols are expected to recover from this
packet loss.

[davem@davemloft.net: Use static branches, coding style corrections]
[a.p.zijlstra@chello.nl: Ideas taken from various patches]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
---
 include/linux/gfp.h    |    3 ++
 include/linux/skbuff.h |   14 +++++-
 include/net/sock.h     |    6 +++
 mm/internal.h          |    3 --
 net/core/filter.c      |    8 ++++
 net/core/skbuff.c      |  124 ++++++++++++++++++++++++++++++++++++++----------
 net/core/sock.c        |    5 ++
 7 files changed, 133 insertions(+), 30 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index cbd7400..4883f39 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -385,6 +385,9 @@ void drain_local_pages(void *dummy);
  */
 extern gfp_t gfp_allowed_mask;
 
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
 extern void pm_restrict_gfp_mask(void);
 extern void pm_restore_gfp_mask(void);
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b534a1b..61c951f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -465,6 +465,7 @@ struct sk_buff {
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
 	__u8			ndisc_nodetype:2;
 #endif
+	__u8			pfmemalloc:1;
 	__u8			ooo_okay:1;
 	__u8			l4_rxhash:1;
 	__u8			wifi_acked_valid:1;
@@ -505,6 +506,15 @@ struct sk_buff {
 #include <linux/slab.h>
 
 
+#define SKB_ALLOC_FCLONE	0x01
+#define SKB_ALLOC_RX		0x02
+
+/* Returns true if the skb was allocated from PFMEMALLOC reserves */
+static inline bool skb_pfmemalloc(struct sk_buff *skb)
+{
+	return unlikely(skb->pfmemalloc);
+}
+
 /*
  * skb might have a dst pointer attached, refcounted or not.
  * _skb_refdst low order bit is set if refcount was _not_ taken
@@ -568,7 +578,7 @@ extern bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 			     bool *fragstolen, int *delta_truesize);
 
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone, int node);
+				   gfp_t priority, int flags, int node);
 extern struct sk_buff *build_skb(void *data, unsigned int frag_size);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
@@ -579,7 +589,7 @@ static inline struct sk_buff *alloc_skb(unsigned int size,
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
 }
 
 extern void skb_recycle(struct sk_buff *skb);
diff --git a/include/net/sock.h b/include/net/sock.h
index 9f38b7d..e3fe462 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -657,6 +657,12 @@ static inline bool sock_flag(const struct sock *sk, enum sock_flags flag)
 	return test_bit(flag, &sk->sk_flags);
 }
 
+extern struct static_key memalloc_socks;
+static inline int sk_memalloc_socks(void)
+{
+	return static_key_false(&memalloc_socks);
+}
+
 static inline gfp_t sk_gfp_atomic(struct sock *sk, gfp_t gfp_mask)
 {
 	return GFP_ATOMIC | (sk->sk_allocation & __GFP_MEMALLOC);
diff --git a/mm/internal.h b/mm/internal.h
index 2d0dd52..2ba87fb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -273,9 +273,6 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
-/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
-
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/net/core/filter.c b/net/core/filter.c
index d4ce2dc..907efd2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -83,6 +83,14 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
 	int err;
 	struct sk_filter *filter;
 
+	/*
+	 * If the skb was allocated from pfmemalloc reserves, only
+	 * allow SOCK_MEMALLOC sockets to use it as this socket is
+	 * helping free memory
+	 */
+	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
+		return -ENOMEM;
+
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)
 		return err;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1d6ecc8..9a58dcc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -145,6 +145,43 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
 	BUG();
 }
 
+
+/*
+ * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
+ * the caller if emergency pfmemalloc reserves are being used. If it is and
+ * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
+ * may be used. Otherwise, the packet data may be discarded until enough
+ * memory is free
+ */
+#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
+	 __kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
+void *__kmalloc_reserve(size_t size, gfp_t flags, int node, unsigned long ip,
+			 bool *pfmemalloc)
+{
+	void *obj;
+	bool ret_pfmemalloc = false;
+
+	/*
+	 * Try a regular allocation, when that fails and we're not entitled
+	 * to the reserves, fail.
+	 */
+	obj = kmalloc_node_track_caller(size,
+					flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
+					node);
+	if (obj || !(gfp_pfmemalloc_allowed(flags)))
+		goto out;
+
+	/* Try again but now we are using pfmemalloc reserves */
+	ret_pfmemalloc = true;
+	obj = kmalloc_node_track_caller(size, flags, node);
+
+out:
+	if (pfmemalloc)
+		*pfmemalloc = ret_pfmemalloc;
+
+	return obj;
+}
+
 /* 	Allocate a new skbuff. We do this ourselves so we can fill in a few
  *	'private' fields and also do memory statistics to find all the
  *	[BEEP] leaks.
@@ -155,8 +192,10 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
  *	__alloc_skb	-	allocate a network buffer
  *	@size: size to allocate
  *	@gfp_mask: allocation mask
- *	@fclone: allocate from fclone cache instead of head cache
- *		and allocate a cloned (child) skb
+ *	@flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache
+ *		instead of head cache and allocate a cloned (child) skb.
+ *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
+ *		allocations in case the data is required for writeback
  *	@node: numa node to allocate memory on
  *
  *	Allocate a new &sk_buff. The returned buffer has no headroom and a
@@ -167,14 +206,19 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
  *	%GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone, int node)
+			    int flags, int node)
 {
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	bool pfmemalloc;
 
-	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	cache = (flags & SKB_ALLOC_FCLONE)
+		? skbuff_fclone_cache : skbuff_head_cache;
+
+	if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
+		gfp_mask |= __GFP_MEMALLOC;
 
 	/* Get the HEAD */
 	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
@@ -189,7 +233,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	 */
 	size = SKB_DATA_ALIGN(size);
 	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	data = kmalloc_node_track_caller(size, gfp_mask, node);
+	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
 	if (!data)
 		goto nodata;
 	/* kmalloc(size) might give us more room than requested.
@@ -207,6 +251,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	/* Account for allocated memory : skb + skb->head */
 	skb->truesize = SKB_TRUESIZE(size);
+	skb->pfmemalloc = pfmemalloc;
 	atomic_set(&skb->users, 1);
 	skb->head = data;
 	skb->data = data;
@@ -222,7 +267,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	atomic_set(&shinfo->dataref, 1);
 	kmemcheck_annotate_variable(shinfo->destructor_arg);
 
-	if (fclone) {
+	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff *child = skb + 1;
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
@@ -232,6 +277,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		atomic_set(fclone_ref, 1);
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
+		child->pfmemalloc = pfmemalloc;
 	}
 out:
 	return skb;
@@ -299,14 +345,7 @@ struct netdev_alloc_cache {
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
-/**
- * netdev_alloc_frag - allocate a page fragment
- * @fragsz: fragment size
- *
- * Allocates a frag from a page for receive buffer.
- * Uses GFP_ATOMIC allocations.
- */
-void *netdev_alloc_frag(unsigned int fragsz)
+static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
 	struct netdev_alloc_cache *nc;
 	void *data = NULL;
@@ -316,7 +355,7 @@ void *netdev_alloc_frag(unsigned int fragsz)
 	nc = &__get_cpu_var(netdev_alloc_cache);
 	if (unlikely(!nc->page)) {
 refill:
-		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+		nc->page = alloc_page(gfp_mask);
 		nc->offset = 0;
 	}
 	if (likely(nc->page)) {
@@ -331,6 +370,18 @@ refill:
 	local_irq_restore(flags);
 	return data;
 }
+
+/**
+ * netdev_alloc_frag - allocate a page fragment
+ * @fragsz: fragment size
+ *
+ * Allocates a frag from a page for receive buffer.
+ * Uses GFP_ATOMIC allocations.
+ */
+void *netdev_alloc_frag(unsigned int fragsz)
+{
+	return __netdev_alloc_frag(fragsz, GFP_ATOMIC | __GFP_COLD);
+}
 EXPORT_SYMBOL(netdev_alloc_frag);
 
 /**
@@ -354,7 +405,12 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
-		void *data = netdev_alloc_frag(fragsz);
+		void *data;
+
+		if (sk_memalloc_socks())
+			gfp_mask |= __GFP_MEMALLOC;
+
+		data = __netdev_alloc_frag(fragsz, gfp_mask);
 
 		if (likely(data)) {
 			skb = build_skb(data, fragsz);
@@ -362,7 +418,8 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 				put_page(virt_to_head_page(data));
 		}
 	} else {
-		skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
+		skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
+				  SKB_ALLOC_RX, NUMA_NO_NODE);
 	}
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
@@ -644,6 +701,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #if IS_ENABLED(CONFIG_IP_VS)
 	new->ipvs_property	= old->ipvs_property;
 #endif
+	new->pfmemalloc		= old->pfmemalloc;
 	new->protocol		= old->protocol;
 	new->mark		= old->mark;
 	new->skb_iif		= old->skb_iif;
@@ -803,6 +861,9 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 		n->fclone = SKB_FCLONE_CLONE;
 		atomic_inc(fclone_ref);
 	} else {
+		if (skb_pfmemalloc(skb))
+			gfp_mask |= __GFP_MEMALLOC;
+
 		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 		if (!n)
 			return NULL;
@@ -839,6 +900,13 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
 
+static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
+{
+	if (skb_pfmemalloc((struct sk_buff *)skb))
+		return SKB_ALLOC_RX;
+	return 0;
+}
+
 /**
  *	skb_copy	-	create private copy of an sk_buff
  *	@skb: buffer to copy
@@ -860,7 +928,8 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 {
 	int headerlen = skb_headroom(skb);
 	unsigned int size = skb_end_offset(skb) + skb->data_len;
-	struct sk_buff *n = alloc_skb(size, gfp_mask);
+	struct sk_buff *n = __alloc_skb(size, gfp_mask,
+					skb_alloc_rx_flag(skb), NUMA_NO_NODE);
 
 	if (!n)
 		return NULL;
@@ -895,7 +964,8 @@ EXPORT_SYMBOL(skb_copy);
 struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom, gfp_t gfp_mask)
 {
 	unsigned int size = skb_headlen(skb) + headroom;
-	struct sk_buff *n = alloc_skb(size, gfp_mask);
+	struct sk_buff *n = __alloc_skb(size, gfp_mask,
+					skb_alloc_rx_flag(skb), NUMA_NO_NODE);
 
 	if (!n)
 		goto out;
@@ -970,8 +1040,10 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 
 	size = SKB_DATA_ALIGN(size);
 
-	data = kmalloc(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
-		       gfp_mask);
+	if (skb_pfmemalloc(skb))
+		gfp_mask |= __GFP_MEMALLOC;
+	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
+			       gfp_mask, NUMA_NO_NODE, NULL);
 	if (!data)
 		goto nodata;
 	size = SKB_WITH_OVERHEAD(ksize(data));
@@ -1085,8 +1157,9 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	/*
 	 *	Allocate the copy buffer
 	 */
-	struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
-				      gfp_mask);
+	struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+					gfp_mask, skb_alloc_rx_flag(skb),
+					NUMA_NO_NODE);
 	int oldheadroom = skb_headroom(skb);
 	int head_copy_len, head_copy_off;
 	int off;
@@ -2767,8 +2840,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			skb_release_head_state(nskb);
 			__skb_push(nskb, doffset);
 		} else {
-			nskb = alloc_skb(hsize + doffset + headroom,
-					 GFP_ATOMIC);
+			nskb = __alloc_skb(hsize + doffset + headroom,
+					   GFP_ATOMIC, skb_alloc_rx_flag(skb),
+					   NUMA_NO_NODE);
 
 			if (unlikely(!nskb))
 				goto err;
diff --git a/net/core/sock.c b/net/core/sock.c
index d45d6fd..9f7240f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -271,6 +271,9 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+struct static_key memalloc_socks = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL_GPL(memalloc_socks);
+
 /**
  * sk_set_memalloc - sets %SOCK_MEMALLOC
  * @sk: socket to set it on
@@ -283,6 +286,7 @@ void sk_set_memalloc(struct sock *sk)
 {
 	sock_set_flag(sk, SOCK_MEMALLOC);
 	sk->sk_allocation |= __GFP_MEMALLOC;
+	static_key_slow_inc(&memalloc_socks);
 }
 EXPORT_SYMBOL_GPL(sk_set_memalloc);
 
@@ -290,6 +294,7 @@ void sk_clear_memalloc(struct sock *sk)
 {
 	sock_reset_flag(sk, SOCK_MEMALLOC);
 	sk->sk_allocation &= ~__GFP_MEMALLOC;
+	static_key_slow_dec(&memalloc_socks);
 }
 EXPORT_SYMBOL_GPL(sk_clear_memalloc);
 
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 09/17] netvm: Allow the use of __GFP_MEMALLOC by specific sockets
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

Allow specific sockets to be tagged SOCK_MEMALLOC and use
__GFP_MEMALLOC for their allocations. These sockets will be able to go
below watermarks and allocate from the emergency reserve. Such sockets
are to be used to service the VM (iow. to swap over). They must be
handled kernel side, exposing such a socket to user-space is a bug.

There is a risk that the reserves be depleted so for now, the
administrator is responsible for increasing min_free_kbytes as
necessary to prevent deadlock for their workloads.

[a.p.zijlstra@chello.nl: Original patches]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
---
 include/net/sock.h |    5 ++++-
 net/core/sock.c    |   22 ++++++++++++++++++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5b47673..9f38b7d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -619,6 +619,7 @@ enum sock_flags {
 	SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_MEMALLOC, /* VM depends on this socket for swapping */
 	SOCK_TIMESTAMPING_TX_HARDWARE,  /* %SOF_TIMESTAMPING_TX_HARDWARE */
 	SOCK_TIMESTAMPING_TX_SOFTWARE,  /* %SOF_TIMESTAMPING_TX_SOFTWARE */
 	SOCK_TIMESTAMPING_RX_HARDWARE,  /* %SOF_TIMESTAMPING_RX_HARDWARE */
@@ -658,7 +659,7 @@ static inline bool sock_flag(const struct sock *sk, enum sock_flags flag)
 
 static inline gfp_t sk_gfp_atomic(struct sock *sk, gfp_t gfp_mask)
 {
-	return GFP_ATOMIC;
+	return GFP_ATOMIC | (sk->sk_allocation & __GFP_MEMALLOC);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
@@ -801,6 +802,8 @@ extern int sk_stream_wait_memory(struct sock *sk, long *timeo_p);
 extern void sk_stream_wait_close(struct sock *sk, long timeo_p);
 extern int sk_stream_error(struct sock *sk, int flags, int err);
 extern void sk_stream_kill_queues(struct sock *sk);
+extern void sk_set_memalloc(struct sock *sk);
+extern void sk_clear_memalloc(struct sock *sk);
 
 extern int sk_wait_data(struct sock *sk, long *timeo);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 9e5b71f..d45d6fd 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -271,6 +271,28 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+/**
+ * sk_set_memalloc - sets %SOCK_MEMALLOC
+ * @sk: socket to set it on
+ *
+ * Set %SOCK_MEMALLOC on a socket for access to emergency reserves.
+ * It's the responsibility of the admin to adjust min_free_kbytes
+ * to meet the requirements
+ */
+void sk_set_memalloc(struct sock *sk)
+{
+	sock_set_flag(sk, SOCK_MEMALLOC);
+	sk->sk_allocation |= __GFP_MEMALLOC;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
+void sk_clear_memalloc(struct sock *sk)
+{
+	sock_reset_flag(sk, SOCK_MEMALLOC);
+	sk->sk_allocation &= ~__GFP_MEMALLOC;
+}
+EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+
 #if defined(CONFIG_CGROUPS)
 #if !defined(CONFIG_NET_CLS_CGROUP)
 int net_cls_subsys_id = -1;
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox