Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 01/17] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2012-06-20 11:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120620110512.GA4208@breakpoint.cc>

On Wed, Jun 20, 2012 at 01:05:13PM +0200, Sebastian Andrzej Siewior wrote:
> On Wed, Jun 20, 2012 at 10:35:04AM +0100, Mel Gorman wrote:
> > [a.p.zijlstra@chello.nl: Original implementation]
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > diff --git a/mm/slab.c b/mm/slab.c
> > index e901a36..b190cac 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -1851,6 +1984,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
> >  	while (i--) {
> >  		BUG_ON(!PageSlab(page));
> >  		__ClearPageSlab(page);
> > +		__ClearPageSlabPfmemalloc(page);
> >  		page++;
> >  	}
> >  	if (current->reclaim_state)
> > @@ -3120,16 +3254,19 @@ bad:
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 8c691fa..43738c9 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1414,6 +1418,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
> >  		-pages);
> >  
> >  	__ClearPageSlab(page);
> > +	__ClearPageSlabPfmemalloc(page);
> >  	reset_page_mapcount(page);
> >  	if (current->reclaim_state)
> >  		current->reclaim_state->reclaimed_slab += pages;
> 
> So you mention a change here in v11's changelog but I don't see it.
> 

Because I'm an idiot and send out the wrong branch and then was rude
enough to not include you on the CC. I have resent the series, correctly
this time I hope. Sorry about that.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply

* Re: [PATCH 08/17] net: Do not coalesce skbs belonging to PFMEMALLOC sockets
From: Eric Dumazet @ 2012-06-20 12:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340192652-31658-9-git-send-email-mgorman@suse.de>

On Wed, 2012-06-20 at 12:44 +0100, Mel Gorman wrote:
> Commit [bad43ca8: net: introduce skb_try_coalesce()] introduced an
> optimisation to coalesce skbs to reduce memory usage and cache line
> misses. In the case where the socket is used for swapping this can result
> in a warning like the following.
> 
> [  110.476565] nbd0: page allocation failure: order:0, mode:0x20
> [  110.476568] Pid: 2714, comm: nbd0 Not tainted 3.5.0-rc2-swapnbd-v12r2-slab #3
> [  110.476569] Call Trace:
> [  110.476573]  [<ffffffff811042d3>] warn_alloc_failed+0xf3/0x160
> [  110.476578]  [<ffffffff81107c92>] __alloc_pages_nodemask+0x6e2/0x930
> [  110.476582]  [<ffffffff81107c92>] ?  __alloc_pages_nodemask+0x6e2/0x930
> [  110.476588]  [<ffffffff81149f09>] kmem_getpages+0x59/0x1a0
> [  110.476593]  [<ffffffff8114ae5b>] fallback_alloc+0x17b/0x260
> [  110.476597]  [<ffffffff8114ac26>] ____cache_alloc_node+0x96/0x150
> [  110.476602]  [<ffffffff8114a458>] kmem_cache_alloc_node+0x78/0x1b0
> [  110.476607]  [<ffffffff8136c127>] __alloc_skb+0x57/0x1e0
> [  110.476612]  [<ffffffff813b9f81>] sk_stream_alloc_skb+0x41/0x120
> [  110.476617]  [<ffffffff813c8c72>] tcp_fragment+0x62/0x370
> [  110.476622]  [<ffffffff813c8fb9>] tso_fragment+0x39/0x180
> [  110.476628]  [<ffffffff813ca2a9>] tcp_write_xmit+0x1a9/0x3f0
> [  110.476634]  [<ffffffff813ca556>] __tcp_push_pending_frames+0x26/0xd0
> [  110.476639]  [<ffffffff813c61f5>] tcp_rcv_established+0x385/0x760
> [  110.476644]  [<ffffffff813ce671>] tcp_v4_do_rcv+0x111/0x1f0
> [  110.476648]  [<ffffffff81367259>] release_sock+0x99/0x140
> [  110.476652]  [<ffffffff813ba82b> tcp_sendmsg+0x7cb/0xe80
> [  110.476657]  [<ffffffff813df9b4>] inet_sendmsg+0x64/0xb0
> [  110.476661]  [<ffffffff811f0a00>] ? security_socket_sendmsg+0x10/0x20
> [  110.476666]  [<ffffffff81361dd8>] sock_sendmsg+0xf8/0x130
> [  110.476672]  [<ffffffff8124ba4c>] ? cpumask_next_and+0x3c/0x50
> [  110.476677]  [<ffffffff8107b053>] ? update_sd_lb_stats+0x123/0x620
> [  110.476683]  [<ffffffff8105164f>] ? recalc_sigpending+0x1f/0x70
> [  110.476688]  [<ffffffff81051e17>] ? __set_task_blocked+0x37/0x80
> [  110.476693]  [<ffffffff81361e51>] kernel_sendmsg+0x41/0x60
> [  110.476698]  [<ffffffffa048d417>] sock_xmit+0xb7/0x300 [nbd]
> [  110.476703]  [<ffffffff8107bad7>] ? load_balance+0xd7/0x490
> [  110.476710]  [<ffffffffa048d7ac>] nbd_send_req+0x14c/0x270 [nbd]
> [  110.476716]  [<ffffffffa048e21e>] nbd_handle_req+0x9e/0x180 [nbd]
> [  110.476721]  [<ffffffffa048e4f2>] nbd_thread+0xb2/0x150 [nbd]
> [  110.476725]  [<ffffffff81062580>] ? wake_up_bit+0x40/0x40
> [  110.476730]  [<ffffffffa048e440>] ? do_nbd_request+0x140/0x140 [nbd]
> [  110.476733]  [<ffffffff81061d7e>] kthread+0x9e/0xb0
> [  110.476739]  [<ffffffff81439d64>] kernel_thread_helper+0x4/0x10
> [  110.476743]  [<ffffffff81061ce0>] ? flush_kthread_worker+0xc0/0xc0
> [  110.476748]  [<ffffffff81439d60>] ? gs_change+0x13/0x13
> 
> There were two ways this could be addressed. The first would be to
> teach __tcp_push_pending_frames() to use __GFP_MEMALLOC if the socket
> has SOCK_MEMALLOC set. This potentially defers the time of allocation
> to a point where we are applying greater pressure on PFMEMALLOC reserves
> which is undesirable.  The second approach is to disable skb coalescing
> for SOCK_MEMALLOC sockets and process them immediately. This patch takes
> the second approach.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  net/core/skbuff.c    |    7 +++++++
>  net/ipv4/tcp_input.c |    8 ++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index d78671e..1d6ecc8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3370,6 +3370,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>  
>  	*fragstolen = false;
>  
> +	/*
> +	 * Avoid coalescing of SOCK_MEMALLOC socks are we do not want to defer
> +	 * RX/TX to a time when pfmemallo reserves are under greater pressure
> +	 */
> +	if (sk_memalloc_socks() && sock_flag(to->sk, SOCK_MEMALLOC))
> +		return false;
> +
>  	if (skb_cloned(to))
>  		return false;
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index b224eb8..448f130 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4553,6 +4553,14 @@ static bool tcp_try_coalesce(struct sock *sk,
>  
>  	*fragstolen = false;
>  
> +	/*
> +	 * Do not attempt merging if the socket is used by the VM for swapping.
> +	 * Attempts to defer can result in allocation failures during RX when
> +	 * an attempt is made to push pending frames
> +	 */
> +	if (sk_memalloc_socks() && sock_flag(sk, SOCK_MEMALLOC))
> +		return false;
> +
>  	if (tcp_hdr(from)->fin)
>  		return false;
>  


This makes absolutely no sense to me.

This patch changes input path, while your stack trace is about output
path and a packet being fragmented.

If you really want to avoid this kind of thing, you'll need to disable
TSO/GSO on the socket.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 07/17] net: Introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket
From: Eric Dumazet @ 2012-06-20 12:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340192652-31658-8-git-send-email-mgorman@suse.de>

On Wed, 2012-06-20 at 12:44 +0100, Mel Gorman wrote:
> Introduce sk_gfp_atomic(), this function allows to inject sock specific
> flags to each sock related allocation. It is only used on allocation
> paths that may be required for writing pages back to network storage.
> 
> [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: David S. Miller <davem@davemloft.net>
> ---
>  include/net/sock.h    |    5 +++++
>  net/ipv4/tcp_output.c |    9 +++++----
>  net/ipv6/tcp_ipv6.c   |    8 +++++---
>  3 files changed, 15 insertions(+), 7 deletions(-)
> 

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 803cbfe..440b47e 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2461,7 +2461,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
>  
>  	if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired)
>  		s_data_desired = cvp->s_data_desired;
> -	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1, GFP_ATOMIC);
> +	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1,
> +			   sk_gfp_atomic(sk, GFP_ATOMIC));
>  	if (skb == NULL)
>  

This bit no longer applies on net-next, sock_wmalloc() was changed to a
mere alloc_skb()


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: Eric Dumazet @ 2012-06-20 12:38 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <20120620.031543.1511134879638711616.davem@davemloft.net>

On Wed, 2012-06-20 at 03:15 -0700, David Miller wrote:
> 			       dev->ifindex);
> +		if (sk) {
> +			skb_orphan(skb);
> +			skb->sk = sk;
> +			skb->destructor = sock_edemux;
> +			if (!skb_dst(skb) &&
> +			    sk->sk_state != TCP_TIME_WAIT) {
> +				struct dst_entry *dst = sk->sk_rx_dst;
> +				if (dst)
> +					dst = dst_check(dst, 0);
> +				if (dst) {
> +					struct rtable *rt = (struct rtable *) dst;
> +
> +					if (rt->rt_iif == dev->ifindex)
> +						skb_dst_set_noref(skb, dst);
> +				}
> +			}
> +		}
> +	}
> +	return pp;

I am trying to convince myself its safe.

skb_dst_set_noref() assumes caller hold rcu_read_lock() until we use the
skb dst.

And dev_gro_receive() releases RCU...

Problem could happen if sk->sk_rx_dst is freed while some packets are
still in napi or socket backlog (can happen with some network
reordering)

1) Socket backlog must be flushed before sk->sk_rx_dst freeing

2) Even if we move rcu_read_lock() in net_rx_action(), we need some
napi_gro_forcedstrefs() in case we sofnet_break

Or maybe just use napi_gro_flush() ?

diff --git a/net/core/dev.c b/net/core/dev.c
index 57c4f9b..c0f71a0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3861,6 +3861,9 @@ static void net_rx_action(struct softirq_action *h)
 
 		budget -= work;
 
+		if (work == weight)
+			napi_gro_flush(n);
+
 		local_irq_disable();
 
 		/* Drivers must not modify the NAPI state if they

^ permalink raw reply related

* RE: [PATCH] netxen : Error return off by one for XG port.
From: Rajesh Borundia @ 2012-06-20 13:06 UTC (permalink / raw)
  To: santosh nayak, Sony Chacko; +Cc: netdev, kernel-janitors@vger.kernel.org
In-Reply-To: <1340189578-18308-1-git-send-email-santoshprasadnayak@gmail.com>

______________________________________
From: santosh nayak [santoshprasadnayak@gmail.com]
Sent: Wednesday, June 20, 2012 4:22 PM
To: Sony Chacko; Rajesh Borundia
Cc: netdev; kernel-janitors@vger.kernel.org; Santosh Nayak
Subject: [PATCH] netxen : Error return off by one for XG port.

From: Santosh Nayak <santoshprasadnayak@gmail.com>

There are  NETXEN_NIU_MAX_XG_PORTS ports.
Port indexing starts from zero.
Hence we should also return error for  'port == NETXEN_NIU_MAX_XG_PORTS'.

Signed-off-by: Santosh Nayak <santoshprasadnayak@gmail.com>
---
 .../ethernet/qlogic/netxen/netxen_nic_ethtool.c    |    4 ++--
 drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c
index d4f179f..9103e3e 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c
@@ -511,7 +511,7 @@ netxen_nic_get_pauseparam(struct net_device *dev,
                                break;
                }
        } else if (adapter->ahw.port_type == NETXEN_NIC_XGBE) {
-               if ((port < 0) || (port > NETXEN_NIU_MAX_XG_PORTS))
+               if ((port < 0) || (port >= NETXEN_NIU_MAX_XG_PORTS))
                        return;
                pause->rx_pause = 1;
                val = NXRD32(adapter, NETXEN_NIU_XG_PAUSE_CTL);
@@ -577,7 +577,7 @@ netxen_nic_set_pauseparam(struct net_device *dev,
                }
                NXWR32(adapter, NETXEN_NIU_GB_PAUSE_CTL, val);
        } else if (adapter->ahw.port_type == NETXEN_NIC_XGBE) {
-               if ((port < 0) || (port > NETXEN_NIU_MAX_XG_PORTS))
+               if ((port < 0) || (port >= NETXEN_NIU_MAX_XG_PORTS))
                        return -EIO;
                val = NXRD32(adapter, NETXEN_NIU_XG_PAUSE_CTL);
                if (port == 0) {
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
index de96a94..946160f 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
@@ -365,7 +365,7 @@ static int netxen_niu_disable_xg_port(struct netxen_adapter *adapter)
        if (NX_IS_REVISION_P3(adapter->ahw.revision_id))
                return 0;

-       if (port > NETXEN_NIU_MAX_XG_PORTS)
+       if (port >= NETXEN_NIU_MAX_XG_PORTS)
                return -EINVAL;

        mac_cfg = 0;
@@ -392,7 +392,7 @@ static int netxen_p2_nic_set_promisc(struct netxen_adapter *adapter, u32 mode)
        u32 port = adapter->physical_port;
        u16 board_type = adapter->ahw.board_type;

-       if (port > NETXEN_NIU_MAX_XG_PORTS)
+       if (port >= NETXEN_NIU_MAX_XG_PORTS)
                return -EINVAL;

        mac_cfg = NXRD32(adapter, NETXEN_NIU_XGE_CONFIG_0 + (0x10000 * port));
--
1.7.4.4

Looks ok to me.

Rajesh

^ permalink raw reply related

* Re: [PATCH] net: sh_eth: fix the rxdesc pointer when rx descriptor empty happens
From: Guennadi Liakhovetski @ 2012-06-20 13:10 UTC (permalink / raw)
  To: Shimoda, Yoshihiro; +Cc: netdev, SH-Linux
In-Reply-To: <4FC491EB.4040002@renesas.com>

Hello Shimoda-san

On Tue, 29 May 2012, Shimoda, Yoshihiro wrote:

> When Receive Descriptor Empty happens, rxdesc pointer of the driver
> and actual next descriptor of the controller may be mismatch.
> This patch fixes it.

Unfortunately, this patch breaks networking on ecovec (sh7724). Booting 
with dhcp and NFS-root progresses very slowly with lots of "nfs: server 
not responding / Ok" messages and never completes. Reverting it in current 
Linus' tree fixes the problem.

Thanks
Guennadi

> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> ---
>  drivers/net/ethernet/renesas/sh_eth.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> index be3c221..667169b 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -1101,8 +1101,12 @@ static int sh_eth_rx(struct net_device *ndev)
> 
>  	/* Restart Rx engine if stopped. */
>  	/* If we don't need to check status, don't. -KDU */
> -	if (!(sh_eth_read(ndev, EDRRR) & EDRRR_R))
> +	if (!(sh_eth_read(ndev, EDRRR) & EDRRR_R)) {
> +		/* fix the values for the next receiving */
> +		mdp->cur_rx = mdp->dirty_rx = (sh_eth_read(ndev, RDFAR) -
> +					       sh_eth_read(ndev, RDLAR)) >> 4;
>  		sh_eth_write(ndev, EDRRR_R, EDRRR);
> +	}
> 
>  	return 0;
>  }
> @@ -1199,8 +1203,6 @@ static void sh_eth_error(struct net_device *ndev, int intr_status)
>  		/* Receive Descriptor Empty int */
>  		ndev->stats.rx_over_errors++;
> 
> -		if (sh_eth_read(ndev, EDRRR) ^ EDRRR_R)
> -			sh_eth_write(ndev, EDRRR_R, EDRRR);
>  		if (netif_msg_rx_err(mdp))
>  			dev_err(&ndev->dev, "Receive Descriptor Empty\n");
>  	}
> -- 
> 1.7.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sh" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/

^ permalink raw reply

* Re: [net-next:master 257/266] drivers/net/team/team_mode_loadbalance.c:99:30: sparse: incompatible types in comparison expression (different address spaces)
From: Eric Dumazet @ 2012-06-20 13:11 UTC (permalink / raw)
  To: paulmck; +Cc: Fengguang Wu, Jiri Pirko, LKML, netdev
In-Reply-To: <20120620124913.GF2432@linux.vnet.ibm.com>

On Wed, 2012-06-20 at 05:49 -0700, Paul E. McKenney wrote:
> On Wed, Jun 20, 2012 at 02:50:55PM +0800, Fengguang Wu wrote:
> > [CC Paul, the RCU maintainer]
> > 
> > On Wed, Jun 20, 2012 at 08:36:07AM +0200, Jiri Pirko wrote:
> > > Wed, Jun 20, 2012 at 06:27:43AM CEST, wfg@linux.intel.com wrote:
> > > >Hi Jiri,
> > > >
> > > >There are new sparse warnings show up in
> > > >
> > > >tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
> > > >head:   677a3d60fb3153f786a0d28fcf0287670e7bd3c2
> > > >commit: ab8250d70063f77929fc404c02390a1f64d66416 [257/266] team: lb: introduce infrastructure for userspace driven tx loadbalancing
> > > >
> > > >All sparse warnings:
> > > >
> > > >drivers/net/team/team_mode_loadbalance.c:99:30: sparse: incompatible types in comparison expression (different address spaces)
> > > >
> > > >drivers/net/team/team_mode_loadbalance.c:99:
> > > >    96			struct lb_port_mapping *pm;
> > > >    97	
> > > >    98			pm = &lb_priv->ex->tx_hash_to_port_mapping[i];
> > > >  > 99			if (pm->port == port) {
> > > 
> > > This looks like your checker does not like
> > > (struct team_port __rcu *) == (struct team_port *)
> > > But I wonder why (or how should I fix that)
> 
> Because you said that it was an RCU-protected pointer, but then did
> not use an RCU primitive to access it.  In this case, where you are
> just using the value but not dereferencing it, you can use
> rcu_access_pointer().

Yes, and please Jiri change the 

rcu_assign_pointer(pm->port, NULL);

by

RCU_INIT_POINTER(pm->port, NULL);

^ permalink raw reply

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
From: Eric Dumazet @ 2012-06-20 13:21 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher
In-Reply-To: <1340180223.4604.828.camel@edumazet-glaptop>

On Wed, 2012-06-20 at 10:17 +0200, Eric Dumazet wrote:

> Strange, I did again benchs with order-2 allocations and got good
> results this time, but with latest net-next, maybe things have changed
> since last time I did this.
> 
> (netdev_alloc_frag(), get_page_from_freelist() and put_page() less
> prevalent in perf results)
> 

In fact, since SLUB uses order-3 for kmalloc-2048, I felt lucky to try
this as well, and results are really good, on ixgbe at least.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..ffd2cba 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -299,6 +299,9 @@ struct netdev_alloc_cache {
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
+#define MAX_NETDEV_FRAGSIZE	max_t(unsigned int, PAGE_SIZE, 32768)
+#define NETDEV_FRAG_ORDER	get_order(MAX_NETDEV_FRAGSIZE)
+
 /**
  * netdev_alloc_frag - allocate a page fragment
  * @fragsz: fragment size
@@ -316,11 +319,13 @@ void *netdev_alloc_frag(unsigned int fragsz)
 	nc = &__get_cpu_var(netdev_alloc_cache);
 	if (unlikely(!nc->page)) {
 refill:
-		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
+				       (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
+				       NETDEV_FRAG_ORDER);
 		nc->offset = 0;
 	}
 	if (likely(nc->page)) {
-		if (nc->offset + fragsz > PAGE_SIZE) {
+		if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
 			put_page(nc->page);
 			goto refill;
 		}
@@ -353,7 +358,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
 			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+	if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
 		void *data = netdev_alloc_frag(fragsz);
 
 		if (likely(data)) {

^ permalink raw reply related

* Re: [net-next:master 257/266] drivers/net/team/team_mode_loadbalance.c:99:30: sparse: incompatible types in comparison expression (different address spaces)
From: Jiri Pirko @ 2012-06-20 13:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: paulmck, Fengguang Wu, LKML, netdev
In-Reply-To: <1340197909.4604.956.camel@edumazet-glaptop>

Wed, Jun 20, 2012 at 03:11:49PM CEST, eric.dumazet@gmail.com wrote:
>On Wed, 2012-06-20 at 05:49 -0700, Paul E. McKenney wrote:
>> On Wed, Jun 20, 2012 at 02:50:55PM +0800, Fengguang Wu wrote:
>> > [CC Paul, the RCU maintainer]
>> > 
>> > On Wed, Jun 20, 2012 at 08:36:07AM +0200, Jiri Pirko wrote:
>> > > Wed, Jun 20, 2012 at 06:27:43AM CEST, wfg@linux.intel.com wrote:
>> > > >Hi Jiri,
>> > > >
>> > > >There are new sparse warnings show up in
>> > > >
>> > > >tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
>> > > >head:   677a3d60fb3153f786a0d28fcf0287670e7bd3c2
>> > > >commit: ab8250d70063f77929fc404c02390a1f64d66416 [257/266] team: lb: introduce infrastructure for userspace driven tx loadbalancing
>> > > >
>> > > >All sparse warnings:
>> > > >
>> > > >drivers/net/team/team_mode_loadbalance.c:99:30: sparse: incompatible types in comparison expression (different address spaces)
>> > > >
>> > > >drivers/net/team/team_mode_loadbalance.c:99:
>> > > >    96			struct lb_port_mapping *pm;
>> > > >    97	
>> > > >    98			pm = &lb_priv->ex->tx_hash_to_port_mapping[i];
>> > > >  > 99			if (pm->port == port) {
>> > > 
>> > > This looks like your checker does not like
>> > > (struct team_port __rcu *) == (struct team_port *)
>> > > But I wonder why (or how should I fix that)
>> 
>> Because you said that it was an RCU-protected pointer, but then did
>> not use an RCU primitive to access it.  In this case, where you are
>> just using the value but not dereferencing it, you can use
>> rcu_access_pointer().
>
>Yes, and please Jiri change the 
>
>rcu_assign_pointer(pm->port, NULL);
>
>by
>
>RCU_INIT_POINTER(pm->port, NULL);

Will do.

>
>
>

^ permalink raw reply

* Re: [PATCH 08/17] net: Do not coalesce skbs belonging to PFMEMALLOC sockets
From: Mel Gorman @ 2012-06-20 13:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340193892.4604.865.camel@edumazet-glaptop>

On Wed, Jun 20, 2012 at 02:04:52PM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 12:44 +0100, Mel Gorman wrote:
> > Commit [bad43ca8: net: introduce skb_try_coalesce()] introduced an
> > optimisation to coalesce skbs to reduce memory usage and cache line
> > misses. In the case where the socket is used for swapping this can result
> > in a warning like the following.
> > 
> > [  110.476565] nbd0: page allocation failure: order:0, mode:0x20
> > [  110.476568] Pid: 2714, comm: nbd0 Not tainted 3.5.0-rc2-swapnbd-v12r2-slab #3
> > [  110.476569] Call Trace:
> > [  110.476573]  [<ffffffff811042d3>] warn_alloc_failed+0xf3/0x160
> > [  110.476578]  [<ffffffff81107c92>] __alloc_pages_nodemask+0x6e2/0x930
> >
> > <SNIP
> >  
> 
> 
> This makes absolutely no sense to me.
> 
> This patch changes input path, while your stack trace is about output
> path and a packet being fragmented.
> 

The intention was to avoid any coalescing in the input path due to avoid
packets that "were held back due to TCP_CORK or attempt at coalescing
tiny packet". I recognise that it is clumsy and will take the approach
instead of having __tcp_push_pending_frames() use sk_gfp_atomic() in the
output path.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 08/17] net: Do not coalesce skbs belonging to PFMEMALLOC sockets
From: Eric Dumazet @ 2012-06-20 13:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <20120620133656.GH4011@suse.de>

On Wed, 2012-06-20 at 14:36 +0100, Mel Gorman wrote:
> The intention was to avoid any coalescing in the input path due to avoid
> packets that "were held back due to TCP_CORK or attempt at coalescing
> tiny packet". I recognise that it is clumsy and will take the approach
> instead of having __tcp_push_pending_frames() use sk_gfp_atomic() in the
> output path.

But coalescing in input path needs no additional memory allocation, it
can actually free some memory.

And it avoids most of the time the infamous "tcp collapses" that needed
extra memory allocations to group tcp payload on single pages.


If you want tcp output path being safer, you should disable TSO/GSO
because some drivers have special handling for skbs that cannot be
mapped because of various hardware limitations.

(for example, tg3 and its tg3_tso_bug() or tigon3_dma_hwbug_workaround()
functions)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH net-next] inetpeer: inetpeer_invalidate_tree() cleanup
From: Eric Dumazet @ 2012-06-20 14:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Steffen Klassert

From: Eric Dumazet <edumazet@google.com>

No need to use cmpxchg() in inetpeer_invalidate_tree() since we hold
base lock.

Also use correct rcu annotations to remove sparse errors
(CONFIG_SPARSE_RCU_POINTER=y)

net/ipv4/inetpeer.c:144:19: error: incompatible types in comparison
expression (different address spaces)
net/ipv4/inetpeer.c:149:20: error: incompatible types in comparison
expression (different address spaces)
net/ipv4/inetpeer.c:595:10: error: incompatible types in comparison
expression (different address spaces)


Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv4/inetpeer.c |   34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index cac02ad..da90a8c 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -126,7 +126,7 @@ int inet_peer_maxttl __read_mostly = 10 * 60 * HZ;	/* usual time to live: 10 min
 
 static void inetpeer_gc_worker(struct work_struct *work)
 {
-	struct inet_peer *p, *n;
+	struct inet_peer *p, *n, *c;
 	LIST_HEAD(list);
 
 	spin_lock_bh(&gc_lock);
@@ -138,17 +138,19 @@ static void inetpeer_gc_worker(struct work_struct *work)
 
 	list_for_each_entry_safe(p, n, &list, gc_list) {
 
-		if(need_resched())
+		if (need_resched())
 			cond_resched();
 
-		if (p->avl_left != peer_avl_empty) {
-			list_add_tail(&p->avl_left->gc_list, &list);
-			p->avl_left = peer_avl_empty;
+		c = rcu_dereference_protected(p->avl_left, 1);
+		if (c != peer_avl_empty) {
+			list_add_tail(&c->gc_list, &list);
+			p->avl_left = peer_avl_empty_rcu;
 		}
 
-		if (p->avl_right != peer_avl_empty) {
-			list_add_tail(&p->avl_right->gc_list, &list);
-			p->avl_right = peer_avl_empty;
+		c = rcu_dereference_protected(p->avl_right, 1);
+		if (c != peer_avl_empty) {
+			list_add_tail(&c->gc_list, &list);
+			p->avl_right = peer_avl_empty_rcu;
 		}
 
 		n = list_entry(p->gc_list.next, struct inet_peer, gc_list);
@@ -587,23 +589,17 @@ static void inetpeer_inval_rcu(struct rcu_head *head)
 
 void inetpeer_invalidate_tree(struct inet_peer_base *base)
 {
-	struct inet_peer *old, *new, *prev;
+	struct inet_peer *root;
 
 	write_seqlock_bh(&base->lock);
 
-	old = base->root;
-	if (old == peer_avl_empty_rcu)
-		goto out;
-
-	new = peer_avl_empty_rcu;
-
-	prev = cmpxchg(&base->root, old, new);
-	if (prev == old) {
+	root = rcu_deref_locked(base->root, base);
+	if (root != peer_avl_empty) {
+		base->root = peer_avl_empty_rcu;
 		base->total = 0;
-		call_rcu(&prev->gc_rcu, inetpeer_inval_rcu);
+		call_rcu(&root->gc_rcu, inetpeer_inval_rcu);
 	}
 
-out:
 	write_sequnlock_bh(&base->lock);
 }
 EXPORT_SYMBOL(inetpeer_invalidate_tree);

^ permalink raw reply related

* Re: [PATCH 08/17] net: Do not coalesce skbs belonging to PFMEMALLOC sockets
From: Mel Gorman @ 2012-06-20 14:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340200312.4604.1008.camel@edumazet-glaptop>

On Wed, Jun 20, 2012 at 03:51:52PM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 14:36 +0100, Mel Gorman wrote:
> > The intention was to avoid any coalescing in the input path due to avoid
> > packets that "were held back due to TCP_CORK or attempt at coalescing
> > tiny packet". I recognise that it is clumsy and will take the approach
> > instead of having __tcp_push_pending_frames() use sk_gfp_atomic() in the
> > output path.
> 
> But coalescing in input path needs no additional memory allocation, it
> can actually free some memory.
> 

When I wrote it I thought the timing of the transmission of pending frames
was the problem rather than the actual memory usage. My intention was that
any data related to swapping be handled immediately without delay instead of
deferring until a time when GFP_ATOMIC allocations might fail. I arrived
at this patch because tcp_input.c does call tcp_push_pending_frames()
on the receive path and that led me to believe that coalescing was a
factor.

> And it avoids most of the time the infamous "tcp collapses" that needed
> extra memory allocations to group tcp payload on single pages.
> 
> If you want tcp output path being safer, you should disable TSO/GSO
> because some drivers have special handling for skbs that cannot be
> mapped because of various hardware limitations.
> 

Understood. Thanks for the explanation.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 07/17] net: Introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket
From: Mel Gorman @ 2012-06-20 14:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340193999.4604.867.camel@edumazet-glaptop>

On Wed, Jun 20, 2012 at 02:06:39PM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 12:44 +0100, Mel Gorman wrote:
> > Introduce sk_gfp_atomic(), this function allows to inject sock specific
> > flags to each sock related allocation. It is only used on allocation
> > paths that may be required for writing pages back to network storage.
> > 
> > [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: David S. Miller <davem@davemloft.net>
> > ---
> >  include/net/sock.h    |    5 +++++
> >  net/ipv4/tcp_output.c |    9 +++++----
> >  net/ipv6/tcp_ipv6.c   |    8 +++++---
> >  3 files changed, 15 insertions(+), 7 deletions(-)
> > 
> 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 803cbfe..440b47e 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2461,7 +2461,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
> >  
> >  	if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired)
> >  		s_data_desired = cvp->s_data_desired;
> > -	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1, GFP_ATOMIC);
> > +	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1,
> > +			   sk_gfp_atomic(sk, GFP_ATOMIC));
> >  	if (skb == NULL)
> >  
> 
> This bit no longer applies on net-next, sock_wmalloc() was changed to a
> mere alloc_skb()
> 

Thanks, I'll rebase and retest.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 01/12] netvm: Prevent a stream-specific deadlock
From: Mel Gorman @ 2012-06-20 14:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
	Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
	Mike Christie, Eric B Munson
In-Reply-To: <1340185081-22525-2-git-send-email-mgorman@suse.de>

On Wed, Jun 20, 2012 at 10:37:50AM +0100, Mel Gorman wrote:
> It could happen that all !SOCK_MEMALLOC sockets have buffered so
> much data that we're over the global rmem limit. This will prevent
> SOCK_MEMALLOC buffers from receiving data, which will prevent userspace
> from running, which is needed to reduce the buffered data.
> 
> Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.
> Once this change it applied, it is important that sockets that set
> SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
> If this happens, a warning is generated and the tokens reclaimed to
> avoid accounting errors until the bug is fixed.
> 
> [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: David S. Miller <davem@davemloft.net>

This patch introduced a new warning that I had previously missed. I'll
fix it up when rebasing this series on top of linux-next.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* How does SACK or FACK determine the time to start fast retransmition?
From: LovelyLich @ 2012-06-20 14:31 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

HI all,
    When tcp uses reno as its congestion control algothim, it uses
tp->sacked_out as dup-ack. When the third dup-ack(under default
condition) comes, tcp will initiate its fast retransmition.
    But how about sack ?
    According to kernel source code comments, when sack or fack tcp option
is enabled, there is no dup-ack counter. See comments for function
tcp_dupack_heuristics():
http://lxr.linux.no/linux+v2.6.37/net/ipv4/tcp_input.c#L2300
    So , how does tcp know the current dup-ack is the last one which
triggers the fast retransmition?

    According to rfc3517 section 5:
    "Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the
scoreboard is to be updated as normal."
    "When a TCP sender receives the duplicate ACK corresponding to
DupThresh ACKs,
the scoreboard MUST be updated with the new SACK information (via
Update ()). If no previous loss event has occurred
on the connection or the cumulative acknowledgment point is beyond
the last value of RecoveryPoint, a loss recovery phase SHOULD be
initiated, per the fast retransmit algorithm outlined in [RFC2581]."

    But these sentences doesn't describe how tcp knows the current ack
is the dup-threshold dup-ack.

    Accorrding to rfc3517 seciton 4 and isLost(Seqnum) function:
    "The routine returns true when either
DupThresh discontiguous SACKed sequences have arrived above
’SeqNum’ or (DupThresh * SMSS) bytes with sequence numbers greater
than ’SeqNum’ have been SACKed. Otherwise, the routine returns
false."
    I think this is just what I am searching for, but I still don't know
which line of code in Linux tcp protocol does this check.
    Can any one help me ? thks in advance.

^ permalink raw reply

* Re: [PATCH net-next 2/6] bnx2x: link cleanup
From: Joe Perches @ 2012-06-20 14:53 UTC (permalink / raw)
  To: Yuval Mintz; +Cc: davem, netdev, eilong, Yaniv Rosner
In-Reply-To: <1340182175-916-3-git-send-email-yuvalmin@broadcom.com>

On Wed, 2012-06-20 at 11:49 +0300, Yuval Mintz wrote:
> This patch does several things:
[]
>  3. Change msleep(1) --> usleep_range(1000, 1000)

I believe replacing msleep(small) with
usleep_range(small * 1000, small * 1000) is
not generally a good idea.

Please give usleep_range an actual range to
work with and not a repeated single value.

Please think a little more about what a
good upper range for the maximum time to
sleep should be.

usleep_range(small * 1000, small * 2000)
or something similar maybe.

^ permalink raw reply

* [PATCH net-next] ipv4: tcp: dont cache output dst for syncookies
From: Eric Dumazet @ 2012-06-20 15:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Hans Schillstrom

From: Eric Dumazet <edumazet@google.com>

Don't cache output dst for syncookies, as this adds pressure on IP route
cache and rcu subsystem for no gain.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
---
 include/net/flow.h                 |    1 +
 include/net/inet_connection_sock.h |    3 ++-
 net/dccp/ipv4.c                    |    2 +-
 net/ipv4/inet_connection_sock.c    |    8 ++++++--
 net/ipv4/route.c                   |    5 ++++-
 net/ipv4/tcp_ipv4.c                |   12 +++++++-----
 6 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 6c469db..bd524f5 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -22,6 +22,7 @@ struct flowi_common {
 #define FLOWI_FLAG_ANYSRC		0x01
 #define FLOWI_FLAG_PRECOW_METRICS	0x02
 #define FLOWI_FLAG_CAN_SLEEP		0x04
+#define FLOWI_FLAG_RT_NOCACHE		0x08
 	__u32	flowic_secid;
 };
 
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index e1b7734..af3c743 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -251,7 +251,8 @@ extern int inet_csk_get_port(struct sock *sk, unsigned short snum);
 
 extern struct dst_entry* inet_csk_route_req(struct sock *sk,
 					    struct flowi4 *fl4,
-					    const struct request_sock *req);
+					    const struct request_sock *req,
+					    bool nocache);
 extern struct dst_entry* inet_csk_route_child_sock(struct sock *sk,
 						   struct sock *newsk,
 						   const struct request_sock *req);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 07f5579..3eb76b5 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -504,7 +504,7 @@ static int dccp_v4_send_response(struct sock *sk, struct request_sock *req,
 	struct dst_entry *dst;
 	struct flowi4 fl4;
 
-	dst = inet_csk_route_req(sk, &fl4, req);
+	dst = inet_csk_route_req(sk, &fl4, req, false);
 	if (dst == NULL)
 		goto out;
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f9ee741..034ddbe 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -368,17 +368,21 @@ EXPORT_SYMBOL(inet_csk_reset_keepalive_timer);
 
 struct dst_entry *inet_csk_route_req(struct sock *sk,
 				     struct flowi4 *fl4,
-				     const struct request_sock *req)
+				     const struct request_sock *req,
+				     bool nocache)
 {
 	struct rtable *rt;
 	const struct inet_request_sock *ireq = inet_rsk(req);
 	struct ip_options_rcu *opt = inet_rsk(req)->opt;
 	struct net *net = sock_net(sk);
+	int flags = inet_sk_flowi_flags(sk) & ~FLOWI_FLAG_PRECOW_METRICS;
 
+	if (nocache)
+		flags |= FLOWI_FLAG_RT_NOCACHE;
 	flowi4_init_output(fl4, sk->sk_bound_dev_if, sk->sk_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE,
 			   sk->sk_protocol,
-			   inet_sk_flowi_flags(sk) & ~FLOWI_FLAG_PRECOW_METRICS,
+			   flags,
 			   (opt && opt->opt.srr) ? opt->opt.faddr : ireq->rmt_addr,
 			   ireq->loc_addr, ireq->rmt_port, inet_sk(sk)->inet_sport);
 	security_req_classify_flow(req, flowi4_to_flowi(fl4));
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a91f6d3..8d62d85 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1156,7 +1156,7 @@ restart:
 	candp = NULL;
 	now = jiffies;
 
-	if (!rt_caching(dev_net(rt->dst.dev))) {
+	if (!rt_caching(dev_net(rt->dst.dev)) || (rt->dst.flags & DST_NOCACHE)) {
 		/*
 		 * If we're not caching, just tell the caller we
 		 * were successful and don't touch the route.  The
@@ -2582,6 +2582,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 
 	rt_set_nexthop(rth, fl4, res, fi, type, 0);
 
+	if (fl4->flowi4_flags & FLOWI_FLAG_RT_NOCACHE)
+		rth->dst.flags |= DST_NOCACHE;
+
 	return rth;
 }
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 13857df..6abc0fd 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -825,7 +825,8 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
 static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
 			      struct request_sock *req,
 			      struct request_values *rvp,
-			      u16 queue_mapping)
+			      u16 queue_mapping,
+			      bool nocache)
 {
 	const struct inet_request_sock *ireq = inet_rsk(req);
 	struct flowi4 fl4;
@@ -833,7 +834,7 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
 	struct sk_buff * skb;
 
 	/* First, grab a route. */
-	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
+	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req, nocache)) == NULL)
 		return -1;
 
 	skb = tcp_make_synack(sk, dst, req, rvp);
@@ -855,7 +856,7 @@ static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req,
 			      struct request_values *rvp)
 {
 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-	return tcp_v4_send_synack(sk, NULL, req, rvp, 0);
+	return tcp_v4_send_synack(sk, NULL, req, rvp, 0, false);
 }
 
 /*
@@ -1388,7 +1389,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		 */
 		if (tmp_opt.saw_tstamp &&
 		    tcp_death_row.sysctl_tw_recycle &&
-		    (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
+		    (dst = inet_csk_route_req(sk, &fl4, req, want_cookie)) != NULL &&
 		    fl4.daddr == saddr &&
 		    (peer = rt_get_peer((struct rtable *)dst, fl4.daddr)) != NULL) {
 			inet_peer_refcheck(peer);
@@ -1424,7 +1425,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	if (tcp_v4_send_synack(sk, dst, req,
 			       (struct request_values *)&tmp_ext,
-			       skb_get_queue_mapping(skb)) ||
+			       skb_get_queue_mapping(skb),
+			       want_cookie) ||
 	    want_cookie)
 		goto drop_and_free;
 

^ permalink raw reply related

* Re: [PATCH 01/12] netvm: Prevent a stream-specific deadlock
From: Rik van Riel @ 2012-06-20 15:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, Linux-NFS, LKML,
	David Miller, Trond Myklebust, Neil Brown, Christoph Hellwig,
	Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <1340185081-22525-2-git-send-email-mgorman@suse.de>

On 06/20/2012 05:37 AM, Mel Gorman wrote:
> It could happen that all !SOCK_MEMALLOC sockets have buffered so
> much data that we're over the global rmem limit. This will prevent
> SOCK_MEMALLOC buffers from receiving data, which will prevent userspace
> from running, which is needed to reduce the buffered data.
>
> Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.
> Once this change it applied, it is important that sockets that set
> SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
> If this happens, a warning is generated and the tokens reclaimed to
> avoid accounting errors until the bug is fixed.
>
> [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
> Signed-off-by: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> Acked-by: David S. Miller<davem@davemloft.net>

Acked-by: Rik van Riel<riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC net-next 07/14] Fix intel/ixgbe
From: John Fastabend @ 2012-06-20 15:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Yuval Mintz, netdev, davem, eilong, Jeff Kirsher
In-Reply-To: <1340182514.4604.843.camel@edumazet-glaptop>

On 6/20/2012 1:55 AM, Eric Dumazet wrote:
> On Tue, 2012-06-19 at 08:54 -0700, Alexander Duyck wrote:
>
>> This patch doesn't limit the number of queues.  It is limiting the
>> number of interrupts.  The two are not directly related as we can
>> support multiple queues per interrupt.
>>
>> Also this change assumes we are only using receive side scaling.  We
>> have other features such as DCB, FCoE, and Flow Director which require
>> additional queues.
>>
>
> Yet, it would be good if ixgbe doesnt allocate 36 queues on a 4 cpu
> machine.
>
> "tc -s class show dev eth0" output is full of not used classes.
>
>
>

We do this for the DCB/FCoE/RSS/Flow Director case where we want to
use multiple queues per traffic class (802.1Qaz). As it is now we
have to set the max queues at alloc_etherdev_mq() time so we use a
max of
	(num_cpu * max traffic classes) + num_cpu

The last num_cpu is in error and I have a patch in JeffK's tree to
remove this. In many cases it seems excessive but sometimes it is
helpful.

.John

^ permalink raw reply

* [patch net-next 0/2] team: two RCU fixups
From: Jiri Pirko @ 2012-06-20 15:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, jbrouer, paulmck, wfg

Jiri Pirko (2):
  team: use rcu_access_pointer to access RCU pointer by writer
  team: use RCU_INIT_POINTER for NULL assignment of RCU pointer

 drivers/net/team/team_mode_activebackup.c |    7 +++++--
 drivers/net/team/team_mode_loadbalance.c  |   10 ++++++----
 2 files changed, 11 insertions(+), 6 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* [patch net-next 2/2] team: use RCU_INIT_POINTER for NULL assignment of RCU pointer
From: Jiri Pirko @ 2012-06-20 15:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, jbrouer, paulmck, wfg
In-Reply-To: <1340206321-5986-1-git-send-email-jpirko@redhat.com>

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/team/team_mode_loadbalance.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/team/team_mode_loadbalance.c b/drivers/net/team/team_mode_loadbalance.c
index b4475a5..c385b45 100644
--- a/drivers/net/team/team_mode_loadbalance.c
+++ b/drivers/net/team/team_mode_loadbalance.c
@@ -97,7 +97,7 @@ static void lb_tx_hash_to_port_mapping_null_port(struct team *team,
 
 		pm = &lb_priv->ex->tx_hash_to_port_mapping[i];
 		if (rcu_access_pointer(pm->port) == port) {
-			rcu_assign_pointer(pm->port, NULL);
+			RCU_INIT_POINTER(pm->port, NULL);
 			team_option_inst_set_change(pm->opt_inst_info);
 			changed = true;
 		}
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 1/2] team: use rcu_access_pointer to access RCU pointer by writer
From: Jiri Pirko @ 2012-06-20 15:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, jbrouer, paulmck, wfg
In-Reply-To: <1340206321-5986-1-git-send-email-jpirko@redhat.com>

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/team/team_mode_activebackup.c |    7 +++++--
 drivers/net/team/team_mode_loadbalance.c  |    8 +++++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/team/team_mode_activebackup.c b/drivers/net/team/team_mode_activebackup.c
index 2fe02a8..c9e7621 100644
--- a/drivers/net/team/team_mode_activebackup.c
+++ b/drivers/net/team/team_mode_activebackup.c
@@ -61,8 +61,11 @@ static void ab_port_leave(struct team *team, struct team_port *port)
 
 static int ab_active_port_get(struct team *team, struct team_gsetter_ctx *ctx)
 {
-	if (ab_priv(team)->active_port)
-		ctx->data.u32_val = ab_priv(team)->active_port->dev->ifindex;
+	struct team_port *active_port;
+
+	active_port = rcu_access_pointer(ab_priv(team)->active_port);
+	if (active_port)
+		ctx->data.u32_val = active_port->dev->ifindex;
 	else
 		ctx->data.u32_val = 0;
 	return 0;
diff --git a/drivers/net/team/team_mode_loadbalance.c b/drivers/net/team/team_mode_loadbalance.c
index 45cc095..b4475a5 100644
--- a/drivers/net/team/team_mode_loadbalance.c
+++ b/drivers/net/team/team_mode_loadbalance.c
@@ -96,7 +96,7 @@ static void lb_tx_hash_to_port_mapping_null_port(struct team *team,
 		struct lb_port_mapping *pm;
 
 		pm = &lb_priv->ex->tx_hash_to_port_mapping[i];
-		if (pm->port == port) {
+		if (rcu_access_pointer(pm->port) == port) {
 			rcu_assign_pointer(pm->port, NULL);
 			team_option_inst_set_change(pm->opt_inst_info);
 			changed = true;
@@ -292,7 +292,7 @@ static int lb_bpf_func_set(struct team *team, struct team_gsetter_ctx *ctx)
 	if (lb_priv->ex->orig_fprog) {
 		/* Clear old filter data */
 		__fprog_destroy(lb_priv->ex->orig_fprog);
-		sk_unattached_filter_destroy(lb_priv->fp);
+		sk_unattached_filter_destroy(rcu_access_pointer(lb_priv->fp));
 	}
 
 	rcu_assign_pointer(lb_priv->fp, fp);
@@ -303,9 +303,11 @@ static int lb_bpf_func_set(struct team *team, struct team_gsetter_ctx *ctx)
 static int lb_tx_method_get(struct team *team, struct team_gsetter_ctx *ctx)
 {
 	struct lb_priv *lb_priv = get_lb_priv(team);
+	lb_select_tx_port_func_t *func;
 	char *name;
 
-	name = lb_select_tx_port_get_name(lb_priv->select_tx_port_func);
+	func = rcu_access_pointer(lb_priv->select_tx_port_func);
+	name = lb_select_tx_port_get_name(func);
 	BUG_ON(!name);
 	ctx->data.str_val = name;
 	return 0;
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 02/12] selinux: tag avc cache alloc as non-critical
From: Rik van Riel @ 2012-06-20 15:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, Linux-NFS, LKML,
	David Miller, Trond Myklebust, Neil Brown, Christoph Hellwig,
	Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <1340185081-22525-3-git-send-email-mgorman@suse.de>

On 06/20/2012 05:37 AM, Mel Gorman wrote:
> Failing to allocate a cache entry will only harm performance not
> correctness.  Do not consume valuable reserve pages for something
> like that.
>
> Signed-off-by: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> Acked-by: Eric Paris<eparis@redhat.com>

Acked-by: Rik van Riel<riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 03/12] mm: Methods for teaching filesystems about PG_swapcache pages
From: Rik van Riel @ 2012-06-20 15:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, Linux-NFS, LKML,
	David Miller, Trond Myklebust, Neil Brown, Christoph Hellwig,
	Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <1340185081-22525-4-git-send-email-mgorman@suse.de>

On 06/20/2012 05:37 AM, Mel Gorman wrote:
> In order to teach filesystems to handle swap cache pages, three new
> page functions are introduced:
>
>    pgoff_t page_file_index(struct page *);
>    loff_t page_file_offset(struct page *);
>    struct address_space *page_file_mapping(struct page *);
>
> page_file_index() - gives the offset of this page in the file in
> PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this
> function also gives the correct index for PG_swapcache pages.
>
> page_file_offset() - uses page_file_index(), so that it will give
> the expected result, even for PG_swapcache pages.
>
> page_file_mapping() - gives the mapping backing the actual page;
> that is for swap cache pages it will give swap_file->f_mapping.
>
> Signed-off-by: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Reviewed-by: Rik van Riel<riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox