Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch v4 2/2] infiniband: pass rdma_cm module to netlink_dump_start
From: David Miller @ 2012-10-07  4:31 UTC (permalink / raw)
  To: gaofeng-BthXqXjhjHXQFUHtdCDX3A
  Cc: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	steffen.klassert-opNxpl+3fjRBDgjK7y7TUQ,
	pablo-Cap9r6Oaw4JrovVCs/uTlw, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	stephen.hemminger-ZtmgI6mnKB3QT0dZR+AlfA, jengelh-9+2X+4sQBs8,
	roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <1349417749-24869-2-git-send-email-gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

From: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Date: Fri, 5 Oct 2012 14:15:49 +0800

> set netlink_dump_control.module to avoid panic.
> 
> Signed-off-by: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Cc: Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: remove skb recycling
From: David Miller @ 2012-10-07  4:41 UTC (permalink / raw)
  To: eric.dumazet; +Cc: mbizon, david+ml, romieu, netdev
In-Reply-To: <1349454235.21172.132.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 05 Oct 2012 18:23:55 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> Over time, skb recycling infrastructure got litle interest and
> many bugs. Generic rx path skb allocation is now using page
> fragments for efficient GRO / TCP coalescing, and recyling
> a tx skb for rx path is not worth the pain.
> 
> Last identified bug is that fat skbs can be recycled
> and it can endup using high order pages after few iterations.
> 
> With help from Maxime Bizon, who pointed out that commit
> 87151b8689d (net: allow pskb_expand_head() to get maximum tailroom)
> introduced this regression for recycled skbs.
> 
> Instead of fixing this bug, lets remove skb recycling.
> 
> Drivers wanting really hot skbs should use build_skb() anyway,
> to allocate/populate sk_buff right before netif_receive_skb()
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Maxime Bizon <mbizon@freebox.fr>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH] net: gro: selective flush of packets
From: Eric Dumazet @ 2012-10-07  5:29 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Jesse Gross, Tom Herbert, Yuchung Cheng
In-Reply-To: <20121007003208.GA31839@gondor.apana.org.au>

	On Sun, 2012-10-07 at 08:32 +0800, Herbert Xu wrote:
> On Sat, Oct 06, 2012 at 08:08:49PM +0200, Eric Dumazet wrote:
> >
> > @@ -3981,8 +3996,17 @@ static void net_rx_action(struct softirq_action *h)
> >  				local_irq_enable();
> >  				napi_complete(n);
> >  				local_irq_disable();
> > -			} else
> > +			} else {
> > +				if (n->gro_list) {
> > +					/* flush too old packets
> > +					 * If HZ < 1000, flush all packets.
> > +					 */
> > +					local_irq_enable();
> > +					napi_gro_flush(n, HZ >= 1000);
> > +					local_irq_disable();
> > +				}
> >  				list_move_tail(&n->poll_list, &sd->poll_list);
> > +			}
> 
> Why don't we just always flush everything?

This is what I tried first, but it lowered performance on several
typical workloads.

Using this simple heuristic increases performance.

^ permalink raw reply

* Re: [PATCH] iputils: ping: Fix typo in echo reply
From: YOSHIFUJI Hideaki @ 2012-10-07  5:44 UTC (permalink / raw)
  To: Jan Synacek; +Cc: netdev
In-Reply-To: <506BF6D3.7070206@redhat.com>

Hello.

Jan Synacek wrote:
> Hello,
>
> here is a fix for a typo that's currently present in ping.
>
> Cheers,
> --
> Jan Synacek
> Software Engineer, BaseOS team Brno, Red Hat
>

Good catch.  Applied, thank you!

--yoshfuji

^ permalink raw reply

* skge: Add DMA mask quirk for Marvell 88E8001 on ASUS P5NSLI motherboard.
From: Graham Gower @ 2012-10-07  5:55 UTC (permalink / raw)
  To: netdev, Stephen Hemminger

Marvell 88E8001 on an ASUS P5NSLI motherboard is unable to send/receive
packets on a system with >4gb ram unless a 32bit DMA mask is used.

This issue has been around for years and a fix was sent 3.5 years ago, but
there was some debate as to whether it should instead be fixed as a PCI quirk.
http://www.spinics.net/lists/netdev/msg88670.html

However, 18 months later a similar workaround was introduced for another
chipset exhibiting the same problem.
http://www.spinics.net/lists/netdev/msg142287.html

Signed-off-by: Graham Gower <graham.gower@gmail.com>

--- skge.c.bak	2012-10-07 13:00:56.000000000 +1030
+++ skge.c	2012-10-07 13:26:03.000000000 +1030
@@ -4143,6 +4143,13 @@
  			DMI_MATCH(DMI_BOARD_NAME, "nForce"),
  		},
  	},
+	{
+		.ident = "ASUS P5NSLI",
+		.matches = {
+			DMI_MATCH(DMI_BOARD_VENDOR, "ASUSTeK Computer INC."),
+			DMI_MATCH(DMI_BOARD_NAME, "P5NSLI")
+		},
+	},
  	{}
  };

^ permalink raw reply

* Re: [RFC] ipv6: gro: IPV6_GRO_CB(skb)->proto problem
From: Eric Dumazet @ 2012-10-07  5:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev
In-Reply-To: <20121007042701.GB31839@gondor.apana.org.au>

On Sun, 2012-10-07 at 12:27 +0800, Herbert Xu wrote:
> On Sat, Oct 06, 2012 at 09:15:27PM +0200, Eric Dumazet wrote:
> > It seems IPV6_GRO_CB(skb)->proto can be destroyed in skb_gro_receive()
> > if a new skb is allocated (to serve as an anchor for frag_list)
> > 
> > At line 3049 we copy NAPI_GRO_CB() only (not the IPV6 specific part)
> > 
> > *NAPI_GRO_CB(nskb) = *NAPI_GRO_CB(p);
> > 
> > So we leave IPV6_GRO_CB(nskb)->proto to 0 (fresh skb allocation) instead
> > of PROTO_TCP
> > 
> > So ipv6_gro_complete() wont be able to call ops->gro_complete()
> > [ tcp6_gro_complete() ]
> > 
> > I would fix this by moving proto in NAPI_GRO_CB() [ ie getting rid of
> > IPV6_GRO_CB ]
> > 
> > Am I missing something ?
> > 
> > (I'll submit a proper patch once/if prior GRO ones are accepted/merged
> > by David)
> 
> No I think you're absolutely right.
> 
> >  include/linux/netdevice.h |    2 ++
> >  net/ipv6/af_inet6.c       |   11 ++---------
> >  2 files changed, 4 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 01646aa..3f13441 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1510,6 +1510,8 @@ struct napi_gro_cb {
> >  	int free;
> >  #define NAPI_GRO_FREE		  1
> >  #define NAPI_GRO_FREE_STOLEN_HEAD 2
> > +
> > +	u8 proto;
> 
> I'd prefer to keep it as an int since we're not really running
> short on space.
> 

Sure I can do that for stable anyway, but when net-next is open, I'll
submit the patch to use a hash table, and I'll need one more pointer in
this structure.

On 64bit we reach current cb[48] limit...

(skb->next/skb->prev will be used for the global chain, gro_list becomes
a list_head, and each hash chain will use a pointer in napi_gro_cb

(I'll probably use a u32 instead of "unsigned long" for the age)

Thanks

^ permalink raw reply

* Re: [PATCH] iputils: ping: Fix typo in echo reply
From: YOSHIFUJI Hideaki @ 2012-10-07  6:02 UTC (permalink / raw)
  To: Jan Synacek; +Cc: netdev
In-Reply-To: <506BF6D3.7070206@redhat.com>

Hello.

Jan Synacek wrote:
> Hello,
>
> here is a fix for a typo that's currently present in ping.
>
> Cheers,
> --
> Jan Synacek
> Software Engineer, BaseOS team Brno, Red Hat
>

Good catch.  Applied, thank you!

--yoshfuji

^ permalink raw reply

* [PATCH] net: gro: fix a potential crash in skb_gro_reset_offset
From: Eric Dumazet @ 2012-10-07  8:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Herbert Xu

From: Eric Dumazet <edumazet@google.com>

Before accessing skb first fragment, better make sure there
is one.

This is probably not needed for old kernels, since an ethernet frame
cannot contain only an ethernet header, but the recent GRO addition
to tunnels makes this patch needed.

Also skb_gro_reset_offset() can be static, it actually allows
compiler to inline it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---
 include/linux/netdevice.h |    1 -
 net/core/dev.c            |   14 ++++++++------
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 01646aa..a659fd0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1663,7 +1663,6 @@ extern int		netpoll_trap(void);
 #endif
 extern int	       skb_gro_receive(struct sk_buff **head,
 				       struct sk_buff *skb);
-extern void	       skb_gro_reset_offset(struct sk_buff *skb);
 
 static inline unsigned int skb_gro_offset(const struct sk_buff *skb)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 1e0a184..de2bad7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3631,20 +3631,22 @@ gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(napi_skb_finish);
 
-void skb_gro_reset_offset(struct sk_buff *skb)
+static void skb_gro_reset_offset(struct sk_buff *skb)
 {
+	const struct skb_shared_info *pinfo = skb_shinfo(skb);
+	const skb_frag_t *frag0 = &pinfo->frags[0];
+
 	NAPI_GRO_CB(skb)->data_offset = 0;
 	NAPI_GRO_CB(skb)->frag0 = NULL;
 	NAPI_GRO_CB(skb)->frag0_len = 0;
 
 	if (skb->mac_header == skb->tail &&
-	    !PageHighMem(skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
-		NAPI_GRO_CB(skb)->frag0 =
-			skb_frag_address(&skb_shinfo(skb)->frags[0]);
-		NAPI_GRO_CB(skb)->frag0_len = skb_frag_size(&skb_shinfo(skb)->frags[0]);
+	    pinfo->nr_frags &&
+	    !PageHighMem(skb_frag_page(frag0))) {
+		NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
+		NAPI_GRO_CB(skb)->frag0_len = skb_frag_size(frag0);
 	}
 }
-EXPORT_SYMBOL(skb_gro_reset_offset);
 
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 {

^ permalink raw reply related

* Re: skge: Add DMA mask quirk for Marvell 88E8001 on ASUS P5NSLI motherboard.
From: Jan Ceuleers @ 2012-10-07  9:32 UTC (permalink / raw)
  To: Graham Gower; +Cc: netdev, Stephen Hemminger
In-Reply-To: <50711961.1010101@gmail.com>

On 10/07/2012 07:55 AM, Graham Gower wrote:
> Signed-off-by: Graham Gower <graham.gower@gmail.com>
> 
> --- skge.c.bak    2012-10-07 13:00:56.000000000 +1030
> +++ skge.c    2012-10-07 13:26:03.000000000 +1030
> @@ -4143,6 +4143,13 @@

Graham,

Your patch should include the path to the file being patched relative to
the base directory. So in this case something like

--- a/drivers/net/ethernet/marvell/skge.c ...
+++ b/drivers/net/ethernet/marvell/skge.c ...

HTH, Jan

^ permalink raw reply

* Re: [PATCH] Packet mmap : allow the user to choose the offset of the tx payload.
From: Daniel Borkmann @ 2012-10-07 10:50 UTC (permalink / raw)
  To: pchavent; +Cc: davem, edumazet, xemul, herbert, netdev, johann.baudy, uaca
In-Reply-To: <0a20d3610fc578b516cad3f81d14bf50@sybille.onecert.fr>

On Sat, Oct 6, 2012 at 9:43 AM, pchavent <Paul.Chavent@onera.fr> wrote:
> On Fri, 5 Oct 2012 21:37:58 +0200, Daniel Borkmann wrote:
>> On Fri, Oct 5, 2012 at 9:21 PM, pchavent <Paul.Chavent@onera.fr> wrote:
>>> On Fri, 5 Oct 2012 16:17:12 +0200, Daniel Borkmann wrote:
>>>> On Fri, Oct 5, 2012 at 3:10 PM, Paul Chavent <Paul.Chavent@onera.fr>
>>>> wrote:
>>>>>
>>>>>
>>>>> The tx offset of packet mmap tx ring used to be :
>>>>> (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll))
>>>>>
>>>>> The problem is that depending on the usage of SOCK_DGRAM or
>>>>> SOCK_RAW, the payload could be aligned or not.
>>>>>
>>>>> This patch allow to let the user give an offset for it's tx
>>>>> payload if he desires.
>>>>>
>>>>> Signed-off-by: Paul Chavent <paul.chavent@onera.fr>
>>>>
>>>>
>>>> Can you provide an example when it doesn't hit TPACKET_ALIGNMENT?
>>>
>>>
>>> When we use tx ring, the user have to write at (TPACKET_HDRLEN -
>>> sizeof(struct sockaddr_ll))
>>>
>>> This adress is aligned on TPACKET_ALIGNMENT since
>>> TPACKET_HDRLEN = (TPACKET_ALIGN(sizeof(struct tpacket_hdr)) +
>>> sizeof(struct
>>> sockaddr_ll))
>>>
>>> When we use the tx ring with SOCK_RAW option, the mac header is aligned
>>> on
>>> TPACKET_ALIGNMENT, but not the payload (14 bytes away).
>>
>>
>> Okay, I'm confused about your intentions, maybe I'm missing something.
>> The man-page of packet(7) clearly says:
>>
>> The socket_type is either SOCK_RAW for raw packets *including* the
>> link level header or SOCK_DGRAM for cooked packets with  the link
>> level header *removed*.
>>
>> So this is perfectly intended behavior of PF_PACKET.
>>
>> Cheers,
>>
>> Daniel
>
>
> Yes, i also expect to be able to include the link level header when i use
> SOCK_RAW.
>
> My intention is to send a frame with this payload (for example) :
> typedef struct
> {
>   double   ts;
>   uint64_t foo;
> } test_t;
>
> So i get a pointer to the raw packet :
> void * raw_packet = frame_base + (TPACKET_HDRLEN - sizeof(struct
> sockaddr_ll));
>
> I cook the header :
> memcpy(raw_packet +  0, dst_addr, sizeof(dst_addr));
> memcpy(raw_packet +  6, src_addr, sizeof(src_addr));
> memcpy(raw_packet + 12, type    , sizeof(type));
>
> Then i get a pointer to the beginning of payload :
> test_t * payload = raw_packet + 14;
>
> Here payload is at 58 bytes from the beginning of the frame.
>
> Then i fill the payload :
> payload->ts = 1.0;
> payload->foo = 2;
> ...
>
> These are misaligned accesses.
>
> I don't care to fill the cooked header if it's misaligned, but i would like
> to be able to fill the frame directly in the ring buffer being on aligned
> boundaries.

Okay.

Maybe what you could do in a new version of your patch is to introduce
a TP_STATUS flag, e.g. TP_STATUS_SEND_HAS_OFF that you pass along
binary or'ed with the commonly used flags, and then you can fill
tp_mac resp. tp_net with offsets. By that, you won't break legacy
stuff.

^ permalink raw reply

* [PATCH net 0/6] ipv4: Changes for rt_gateway
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev

	This patchset fixes some problems for the routing caused
by the new rt_gateway semantics. What started as a fix for
IPVS-DR ended as fixes for more problems. To solve the IPVS
problem I decided to name the flag FLOWI_FLAG_KNOWN_NH, so that
we can even get route cached in FNHE or FIB NH.

	Different flag FLOWI_FLAG_RT_NOCACHE could be equally good
for IPVS, we again would be able to use data from fnhe but working
with cached routes should be preferred. If there is no FNHE, the
common case is IPVS to get uncached route, of course, IPVS caches
it itself.

	Patches 1-3 are fixes not related to IPVS problem,
4 and 5 add code that will be used by IPVS in patch 6.

Julian Anastasov (6):
  ipv4: fix sending of redirects
  ipv4: fix forwarding for strict source routes
  ipv4: add check if nh_pcpu_rth_output is allocated
  ipv4: introduce rt_uses_gateway
  ipv4: Add FLOWI_FLAG_KNOWN_NH
  ipvs: fix ARP resolving for direct routing mode

 include/net/flow.h              |    1 +
 include/net/route.h             |    3 +-
 net/ipv4/fib_frontend.c         |    3 +-
 net/ipv4/inet_connection_sock.c |    4 +-
 net/ipv4/ip_forward.c           |    2 +-
 net/ipv4/ip_output.c            |    4 +-
 net/ipv4/route.c                |   96 ++++++++++++++++++++++++---------------
 net/ipv4/xfrm4_policy.c         |    1 +
 net/netfilter/ipvs/ip_vs_xmit.c |    6 ++-
 9 files changed, 75 insertions(+), 45 deletions(-)

-- 
1.7.3.4

^ permalink raw reply

* [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1349609168-9848-1-git-send-email-ja@ssi.bg>

	Avoid NULL ptr dereference and caching if
nh_pcpu_rth_output is not allocated.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/route.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 488a8bb..0a600cc 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1798,18 +1798,24 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	fnhe = NULL;
 	if (fi) {
 		struct rtable __rcu **prth;
+		struct fib_nh *nh = &FIB_RES_NH(*res);
 
-		fnhe = find_exception(&FIB_RES_NH(*res), fl4->daddr);
+		fnhe = find_exception(nh, fl4->daddr);
 		if (fnhe)
 			prth = &fnhe->fnhe_rth;
-		else
-			prth = __this_cpu_ptr(FIB_RES_NH(*res).nh_pcpu_rth_output);
+		else {
+			if (!nh->nh_pcpu_rth_output)
+				goto add;
+			prth = __this_cpu_ptr(nh->nh_pcpu_rth_output);
+		}
 		rth = rcu_dereference(*prth);
 		if (rt_cache_valid(rth)) {
 			dst_hold(&rth->dst);
 			return rth;
 		}
 	}
+
+add:
 	rth = rt_dst_alloc(dev_out,
 			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
 			   IN_DEV_CONF_GET(in_dev, NOXFRM),
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH net 2/6] ipv4: fix forwarding for strict source routes
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1349609168-9848-1-git-send-email-ja@ssi.bg>

	After the change "Adjust semantics of rt->rt_gateway"
(commit f8126f1d51) rt_gateway can be 0 but ip_forward() compares
it directly with nexthop. What we want here is to check if traffic
is to directly connected nexthop and to fail if using gateway.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/ip_forward.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index ab09b12..7f35ac2 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -85,7 +85,7 @@ int ip_forward(struct sk_buff *skb)
 
 	rt = skb_rtable(skb);
 
-	if (opt->is_strictroute && opt->nexthop != rt->rt_gateway)
+	if (opt->is_strictroute && rt->rt_gateway)
 		goto sr_failed;
 
 	if (unlikely(skb->len > dst_mtu(&rt->dst) && !skb_is_gso(skb) &&
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH net 1/6] ipv4: fix sending of redirects
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1349609168-9848-1-git-send-email-ja@ssi.bg>

	After "Cache input routes in fib_info nexthops" (commit
d2d68ba9fe) and "Elide fib_validate_source() completely when possible"
(commit 7a9bc9b81a) we can not send ICMP redirects. It seems we
should not cache the RTCF_DOREDIRECT flag in nh_rth_input because
the same fib_info can be used for traffic that is not redirected,
eg. from other input devices or from sources that are not in same subnet.

	As result, we have to disable the caching of RTCF_DOREDIRECT
flag and to force source validation for the case when forwarding
traffic to the input device. If traffic comes from directly connected
source we allow redirection as it was done before both changes.

	After the change "Adjust semantics of rt->rt_gateway"
(commit f8126f1d51) we should make sure our ICMP_REDIR_HOST messages
contain daddr instead of 0.0.0.0 when target is directly connected.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/fib_frontend.c |    3 ++-
 net/ipv4/route.c        |   28 +++++++++++++++-------------
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 68c93d1..6dacae6 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -322,7 +322,8 @@ int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 {
 	int r = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(idev);
 
-	if (!r && !fib_num_tclassid_users(dev_net(dev))) {
+	if (!r && !fib_num_tclassid_users(dev_net(dev)) &&
+	    dev->ifindex != oif) {
 		*itag = 0;
 		return 0;
 	}
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ff62206..488a8bb 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -802,7 +802,8 @@ void ip_rt_send_redirect(struct sk_buff *skb)
 	net = dev_net(rt->dst.dev);
 	peer = inet_getpeer_v4(net->ipv4.peers, ip_hdr(skb)->saddr, 1);
 	if (!peer) {
-		icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, rt->rt_gateway);
+		icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST,
+			  rt->rt_gateway ? : ip_hdr(skb)->daddr);
 		return;
 	}
 
@@ -827,7 +828,9 @@ void ip_rt_send_redirect(struct sk_buff *skb)
 	    time_after(jiffies,
 		       (peer->rate_last +
 			(ip_rt_redirect_load << peer->rate_tokens)))) {
-		icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, rt->rt_gateway);
+		__be32 gw = rt->rt_gateway ? : ip_hdr(skb)->daddr;
+
+		icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, gw);
 		peer->rate_last = jiffies;
 		++peer->rate_tokens;
 #ifdef CONFIG_IP_ROUTE_VERBOSE
@@ -835,7 +838,7 @@ void ip_rt_send_redirect(struct sk_buff *skb)
 		    peer->rate_tokens == ip_rt_redirect_number)
 			net_warn_ratelimited("host %pI4/if%d ignores redirects for %pI4 to %pI4\n",
 					     &ip_hdr(skb)->saddr, inet_iif(skb),
-					     &ip_hdr(skb)->daddr, &rt->rt_gateway);
+					     &ip_hdr(skb)->daddr, &gw);
 #endif
 	}
 out_put_peer:
@@ -1439,10 +1442,13 @@ static int __mkroute_input(struct sk_buff *skb,
 		goto cleanup;
 	}
 
+	do_cache = res->fi && !itag;
 	if (out_dev == in_dev && err &&
 	    (IN_DEV_SHARED_MEDIA(out_dev) ||
-	     inet_addr_onlink(out_dev, saddr, FIB_RES_GW(*res))))
+	     inet_addr_onlink(out_dev, saddr, FIB_RES_GW(*res)))) {
 		flags |= RTCF_DOREDIRECT;
+		do_cache = false;
+	}
 
 	if (skb->protocol != htons(ETH_P_IP)) {
 		/* Not IP (i.e. ARP). Do not create route, if it is
@@ -1459,15 +1465,11 @@ static int __mkroute_input(struct sk_buff *skb,
 		}
 	}
 
-	do_cache = false;
-	if (res->fi) {
-		if (!itag) {
-			rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
-			if (rt_cache_valid(rth)) {
-				skb_dst_set_noref(skb, &rth->dst);
-				goto out;
-			}
-			do_cache = true;
+	if (do_cache) {
+		rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
+		if (rt_cache_valid(rth)) {
+			skb_dst_set_noref(skb, &rth->dst);
+			goto out;
 		}
 	}
 
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH net 4/6] ipv4: introduce rt_uses_gateway
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1349609168-9848-1-git-send-email-ja@ssi.bg>

	Add new flag to remember when route is via gateway.
We will use it to allow rt_gateway to contain address of
directly connected host for the cases when DST_NOCACHE is
used or when the NH exception caches per-destination route
without DST_NOCACHE flag, i.e. when routes are not used for
other destinations. By this way we force the neighbour
resolving to work with the routed destination but we
can use different address in the packet, feature needed
for IPVS-DR where original packet for virtual IP is routed
via route to real IP.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/route.h             |    3 +-
 net/ipv4/inet_connection_sock.c |    4 +-
 net/ipv4/ip_forward.c           |    2 +-
 net/ipv4/ip_output.c            |    4 +-
 net/ipv4/route.c                |   45 +++++++++++++++++++++-----------------
 net/ipv4/xfrm4_policy.c         |    1 +
 6 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index da22243..bc40b63 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -48,7 +48,8 @@ struct rtable {
 	int			rt_genid;
 	unsigned int		rt_flags;
 	__u16			rt_type;
-	__u16			rt_is_input;
+	__u8			rt_is_input;
+	__u8			rt_uses_gateway;
 
 	int			rt_iif;
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f0c5b9c..d34ce29 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -406,7 +406,7 @@ struct dst_entry *inet_csk_route_req(struct sock *sk,
 	rt = ip_route_output_flow(net, fl4, sk);
 	if (IS_ERR(rt))
 		goto no_route;
-	if (opt && opt->opt.is_strictroute && rt->rt_gateway)
+	if (opt && opt->opt.is_strictroute && rt->rt_uses_gateway)
 		goto route_err;
 	return &rt->dst;
 
@@ -442,7 +442,7 @@ struct dst_entry *inet_csk_route_child_sock(struct sock *sk,
 	rt = ip_route_output_flow(net, fl4, sk);
 	if (IS_ERR(rt))
 		goto no_route;
-	if (opt && opt->opt.is_strictroute && rt->rt_gateway)
+	if (opt && opt->opt.is_strictroute && rt->rt_uses_gateway)
 		goto route_err;
 	rcu_read_unlock();
 	return &rt->dst;
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 7f35ac2..694de3b 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -85,7 +85,7 @@ int ip_forward(struct sk_buff *skb)
 
 	rt = skb_rtable(skb);
 
-	if (opt->is_strictroute && rt->rt_gateway)
+	if (opt->is_strictroute && rt->rt_uses_gateway)
 		goto sr_failed;
 
 	if (unlikely(skb->len > dst_mtu(&rt->dst) && !skb_is_gso(skb) &&
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 24a29a3..6537a40 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -193,7 +193,7 @@ static inline int ip_finish_output2(struct sk_buff *skb)
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt->rt_gateway ? rt->rt_gateway : ip_hdr(skb)->daddr;
+	nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
 	neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
 	if (unlikely(!neigh))
 		neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
@@ -371,7 +371,7 @@ int ip_queue_xmit(struct sk_buff *skb, struct flowi *fl)
 	skb_dst_set_noref(skb, &rt->dst);
 
 packet_routed:
-	if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_gateway)
+	if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_uses_gateway)
 		goto no_route;
 
 	/* OK, we know where to send it, allocate and build IP header. */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0a600cc..eaf9575 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1123,7 +1123,7 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst)
 	mtu = dst->dev->mtu;
 
 	if (unlikely(dst_metric_locked(dst, RTAX_MTU))) {
-		if (rt->rt_gateway && mtu > 576)
+		if (rt->rt_uses_gateway && mtu > 576)
 			mtu = 576;
 	}
 
@@ -1174,7 +1174,9 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 		if (fnhe->fnhe_gw) {
 			rt->rt_flags |= RTCF_REDIRECTED;
 			rt->rt_gateway = fnhe->fnhe_gw;
-		}
+			rt->rt_uses_gateway = 1;
+		} else if (!rt->rt_gateway)
+			rt->rt_gateway = daddr;
 
 		orig = rcu_dereference(fnhe->fnhe_rth);
 		rcu_assign_pointer(fnhe->fnhe_rth, rt);
@@ -1183,13 +1185,6 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 
 		fnhe->fnhe_stamp = jiffies;
 		ret = true;
-	} else {
-		/* Routes we intend to cache in nexthop exception have
-		 * the DST_NOCACHE bit clear.  However, if we are
-		 * unsuccessful at storing this route into the cache
-		 * we really need to set it.
-		 */
-		rt->dst.flags |= DST_NOCACHE;
 	}
 	spin_unlock_bh(&fnhe_lock);
 
@@ -1215,13 +1210,7 @@ static bool rt_cache_route(struct fib_nh *nh, struct rtable *rt)
 		if (orig)
 			rt_free(orig);
 	} else {
-		/* Routes we intend to cache in the FIB nexthop have
-		 * the DST_NOCACHE bit clear.  However, if we are
-		 * unsuccessful at storing this route into the cache
-		 * we really need to set it.
-		 */
 nocache:
-		rt->dst.flags |= DST_NOCACHE;
 		ret = false;
 	}
 
@@ -1284,8 +1273,10 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 	if (fi) {
 		struct fib_nh *nh = &FIB_RES_NH(*res);
 
-		if (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK)
+		if (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) {
 			rt->rt_gateway = nh->nh_gw;
+			rt->rt_uses_gateway = 1;
+		}
 		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
@@ -1294,8 +1285,18 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			cached = rt_bind_exception(rt, fnhe, daddr);
 		else if (!(rt->dst.flags & DST_NOCACHE))
 			cached = rt_cache_route(nh, rt);
-	}
-	if (unlikely(!cached))
+		if (unlikely(!cached)) {
+			/* Routes we intend to cache in nexthop exception or
+			 * FIB nexthop have the DST_NOCACHE bit clear.
+			 * However, if we are unsuccessful at storing this
+			 * route into the cache we really need to set it.
+			 */
+			rt->dst.flags |= DST_NOCACHE;
+			if (!rt->rt_gateway)
+				rt->rt_gateway = daddr;
+			rt_add_uncached_list(rt);
+		}
+	} else
 		rt_add_uncached_list(rt);
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
@@ -1363,6 +1364,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rth->rt_iif	= 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway	= 0;
+	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 	if (our) {
 		rth->dst.input= ip_local_deliver;
@@ -1432,7 +1434,6 @@ static int __mkroute_input(struct sk_buff *skb,
 		return -EINVAL;
 	}
 
-
 	err = fib_validate_source(skb, saddr, daddr, tos, FIB_RES_OIF(*res),
 				  in_dev->dev, in_dev, &itag);
 	if (err < 0) {
@@ -1488,6 +1489,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->rt_iif 	= 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway	= 0;
+	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 
 	rth->dst.input = ip_forward;
@@ -1658,6 +1660,7 @@ local_input:
 	rth->rt_iif	= 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway	= 0;
+	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 	if (res.type == RTN_UNREACHABLE) {
 		rth->dst.input= ip_error;
@@ -1832,6 +1835,7 @@ add:
 	rth->rt_iif	= orig_oif ? : 0;
 	rth->rt_pmtu	= 0;
 	rth->rt_gateway = 0;
+	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 
 	RT_CACHE_STAT_INC(out_slow_tot);
@@ -2110,6 +2114,7 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
 		rt->rt_flags = ort->rt_flags;
 		rt->rt_type = ort->rt_type;
 		rt->rt_gateway = ort->rt_gateway;
+		rt->rt_uses_gateway = ort->rt_uses_gateway;
 
 		INIT_LIST_HEAD(&rt->rt_uncached);
 
@@ -2188,7 +2193,7 @@ static int rt_fill_info(struct net *net,  __be32 dst, __be32 src,
 		if (nla_put_be32(skb, RTA_PREFSRC, fl4->saddr))
 			goto nla_put_failure;
 	}
-	if (rt->rt_gateway &&
+	if (rt->rt_uses_gateway &&
 	    nla_put_be32(skb, RTA_GATEWAY, rt->rt_gateway))
 		goto nla_put_failure;
 
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 681ea2f..05c5ab8 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -91,6 +91,7 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 					      RTCF_LOCAL);
 	xdst->u.rt.rt_type = rt->rt_type;
 	xdst->u.rt.rt_gateway = rt->rt_gateway;
+	xdst->u.rt.rt_uses_gateway = rt->rt_uses_gateway;
 	xdst->u.rt.rt_pmtu = rt->rt_pmtu;
 	INIT_LIST_HEAD(&xdst->u.rt.rt_uncached);
 
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH net 5/6] ipv4: Add FLOWI_FLAG_KNOWN_NH
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1349609168-9848-1-git-send-email-ja@ssi.bg>

	Add flag to request that output route should be
returned with known rt_gateway, in case we want to use
it as nexthop for neighbour resolving.

	The returned route can be cached as follows:

- in NH exception: because the cached routes are not shared
	with other destinations
- in FIB NH: when using gateway because all destinations for
	NH share same gateway

	As last option, to return rt_gateway!=0 we have to
set DST_NOCACHE.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/flow.h |    1 +
 net/ipv4/route.c   |   11 ++++++++++-
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index e1dd508..628e11b 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -21,6 +21,7 @@ struct flowi_common {
 	__u8	flowic_flags;
 #define FLOWI_FLAG_ANYSRC		0x01
 #define FLOWI_FLAG_CAN_SLEEP		0x02
+#define FLOWI_FLAG_KNOWN_NH		0x04
 	__u32	flowic_secid;
 };
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index eaf9575..7f20b6e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1763,6 +1763,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	struct in_device *in_dev;
 	u16 type = res->type;
 	struct rtable *rth;
+	bool do_cache;
 
 	in_dev = __in_dev_get_rcu(dev_out);
 	if (!in_dev)
@@ -1799,6 +1800,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	fnhe = NULL;
+	do_cache = fi != NULL;
 	if (fi) {
 		struct rtable __rcu **prth;
 		struct fib_nh *nh = &FIB_RES_NH(*res);
@@ -1807,6 +1809,13 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 		if (fnhe)
 			prth = &fnhe->fnhe_rth;
 		else {
+			if (unlikely(fl4->flowi4_flags &
+				     FLOWI_FLAG_KNOWN_NH &&
+				     !(nh->nh_gw &&
+				       nh->nh_scope == RT_SCOPE_LINK))) {
+				do_cache = false;
+				goto add;
+			}
 			if (!nh->nh_pcpu_rth_output)
 				goto add;
 			prth = __this_cpu_ptr(nh->nh_pcpu_rth_output);
@@ -1822,7 +1831,7 @@ add:
 	rth = rt_dst_alloc(dev_out,
 			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
 			   IN_DEV_CONF_GET(in_dev, NOXFRM),
-			   fi);
+			   do_cache);
 	if (!rth)
 		return ERR_PTR(-ENOBUFS);
 
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH net 6/6] ipvs: fix ARP resolving for direct routing mode
From: Julian Anastasov @ 2012-10-07 11:26 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1349609168-9848-1-git-send-email-ja@ssi.bg>

	After the change "Make neigh lookups directly in output packet path"
(commit a263b30936) IPVS can not reach the real server for DR mode
because we resolve the destination address from IP header, not from
route neighbour. Use the new FLOWI_FLAG_KNOWN_NH flag to request
output routes with known nexthop, so that it has preference
on resolving.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/netfilter/ipvs/ip_vs_xmit.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 56f6d5d..cc4c809 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -50,6 +50,7 @@ enum {
 				      * local
 				      */
 	IP_VS_RT_MODE_CONNECT	= 8, /* Always bind route to saddr */
+	IP_VS_RT_MODE_KNOWN_NH	= 16,/* Route via remote addr */
 };
 
 /*
@@ -113,6 +114,8 @@ static struct rtable *do_output_route4(struct net *net, __be32 daddr,
 	fl4.daddr = daddr;
 	fl4.saddr = (rt_mode & IP_VS_RT_MODE_CONNECT) ? *saddr : 0;
 	fl4.flowi4_tos = rtos;
+	fl4.flowi4_flags = (rt_mode & IP_VS_RT_MODE_KNOWN_NH) ?
+			   FLOWI_FLAG_KNOWN_NH : 0;
 
 retry:
 	rt = ip_route_output_key(net, &fl4);
@@ -1061,7 +1064,8 @@ ip_vs_dr_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	if (!(rt = __ip_vs_get_out_rt(skb, cp->dest, cp->daddr.ip,
 				      RT_TOS(iph->tos),
 				      IP_VS_RT_MODE_LOCAL |
-					IP_VS_RT_MODE_NON_LOCAL, NULL)))
+				      IP_VS_RT_MODE_NON_LOCAL |
+				      IP_VS_RT_MODE_KNOWN_NH, NULL)))
 		goto tx_error_icmp;
 	if (rt->rt_flags & RTCF_LOCAL) {
 		ip_rt_put(rt);
-- 
1.7.3.4

^ permalink raw reply related

* Re: [1/2] PCI-Express Non-Transparent Bridge Support
From: Jakub Kicinski @ 2012-10-07 12:13 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang, Nicholas Bellinger
In-Reply-To: <1349213177-9985-2-git-send-email-jon.mason@intel.com>

Hi,

it's good to see some NTB code getting into mainline! I have a few comments
though.

On Tue, 02 Oct 2012 21:26:16 -0000, Jon Mason <jon.mason@intel.com>
wrote:

[...]
>+/**
>+ * ntb_write_local_spad() - write to the secondary scratchpad register
>+ * @ndev: pointer to ntb_device instance
>+ * @idx: index to the scratchpad register, 0 based
>+ * @val: the data value to put into the register
>+ *
>+ * This function allows writing of a 32bit value to the indexed scratchpad
>+ * register. The register resides on the secondary (external) side.
>+ *
>+ * RETURNS: An appropriate -ERRNO error value on error, or zero for success.
>+ */
[...]
>+/**
>+ * ntb_write_remote_spad() - write to the secondary scratchpad register
>+ * @ndev: pointer to ntb_device instance
>+ * @idx: index to the scratchpad register, 0 based
>+ * @val: the data value to put into the register
>+ *
>+ * This function allows writing of a 32bit value to the indexed scratchpad
>+ * register. The register resides on the secondary (external) side.
>+ *
>+ * RETURNS: An appropriate -ERRNO error value on error, or zero for success.
>+ */

Those comments look suspiciously similar. I think one of the functions
does write to primary scratchpad?

[...]
>+/**
>+ * ntb_read_local_spad() - read from the primary scratchpad register
>+ * @ndev: pointer to ntb_device instance
>+ * @idx: index to scratchpad register, 0 based
>+ * @val: pointer to 32bit integer for storing the register value
>+ *
>+ * This function allows reading of the 32bit scratchpad register on
>+ * the primary (internal) side.
>+ *
>+ * RETURNS: An appropriate -ERRNO error value on error, or zero for success.
>+ */
[...]
>+/**
>+ * ntb_read_remote_spad() - read from the primary scratchpad register
>+ * @ndev: pointer to ntb_device instance
>+ * @idx: index to scratchpad register, 0 based
>+ * @val: pointer to 32bit integer for storing the register value
>+ *
>+ * This function allows reading of the 32bit scratchpad register on
>+ * the primary (internal) side.
>+ *
>+ * RETURNS: An appropriate -ERRNO error value on error, or zero for success.
>+ */

Same here.

[...]
>+static int ntb_setup_msix(struct ntb_device *ndev)
>+{
>+	struct pci_dev *pdev = ndev->pdev;
>+	struct msix_entry *msix;
>+	int msix_entries;
>+	int rc, i, pos;
>+	u16 val;
>+
>+	pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
>+	if (!pos) {
>+		rc = -EIO;
>+		goto err;
>+	}
>+
>+	rc = pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &val);
>+	if (rc)
>+		goto err;
>+
>+	msix_entries = msix_table_size(val);
>+	if (msix_entries > ndev->limits.msix_cnt) {
>+		rc = -EINVAL;
>+		goto err;
>+	}
>+
>+	ndev->msix_entries = kmalloc(sizeof(struct msix_entry) * msix_entries,
>+				     GFP_KERNEL);
>+	if (!ndev->msix_entries) {
>+		rc = -ENOMEM;
>+		goto err;
>+	}
>+
>+	for (i = 0; i < msix_entries; i++)
>+		ndev->msix_entries[i].entry = i;
>+
>+	rc = pci_enable_msix(pdev, ndev->msix_entries, msix_entries);
>+	if (rc < 0)
>+		goto err1;
>+	if (rc > 0) {

rc > 0 doesn't mean that vectors were allocated. Have a look at the
example in Documentation/PCI/MSI-HOWTO.txt.

>+		/* On SNB, the link interrupt is always tied to 4th vector.  If
>+		 * we can't get all 4, then we can't use MSI-X.
>+		 */
>+		if (ndev->hw_type != BWD_HW) {
>+			rc = -EIO;
>+			goto err1;
>+		}

This looks fragile, what if msix_table_size(val) was < 4? 

>+
>+		dev_warn(&pdev->dev,
>+			 "Only %d MSI-X vectors.  Limiting the number of queues to that number.\n",
>+			 rc);
>+		msix_entries = rc;
>+	}
>+
>+	for (i = 0; i < msix_entries; i++) {
>+		msix = &ndev->msix_entries[i];
>+		WARN_ON(!msix->vector);
>+
>+		/* Use the last MSI-X vector for Link status */
>+		if (ndev->hw_type == BWD_HW) {
>+			rc = request_irq(msix->vector, bwd_callback_msix_irq, 0,
>+					 "ntb-callback-msix", &ndev->db_cb[i]);
>+			if (rc)
>+				goto err2;
>+		} else {
>+			if (i == msix_entries - 1) {
>+				rc = request_irq(msix->vector,
>+						 xeon_event_msix_irq, 0,
>+						 "ntb-event-msix", ndev);
>+				if (rc)
>+					goto err2;
>+			} else {
>+				rc = request_irq(msix->vector,
>+						 xeon_callback_msix_irq, 0,
>+						 "ntb-callback-msix",
>+						 &ndev->db_cb[i]);
>+				if (rc)
>+					goto err2;
>+			}
>+		}
>+	}
>+
>+	ndev->num_msix = msix_entries;
>+	if (ndev->hw_type == BWD_HW)
>+		ndev->max_cbs = msix_entries;
>+	else
>+		ndev->max_cbs = msix_entries - 1;
>+
>+	return 0;
>+
>+err2:
>+	while (--i >= 0) {
>+		msix = &ndev->msix_entries[i];
>+		if (ndev->hw_type != BWD_HW && i == ndev->num_msix - 1)
>+			free_irq(msix->vector, ndev);
>+		else
>+			free_irq(msix->vector, &ndev->db_cb[i]);
>+	}
>+	pci_disable_msix(pdev);
>+err1:
>+	kfree(ndev->msix_entries);
>+	dev_err(&pdev->dev, "Error allocating MSI-X interrupt\n");
>+err:
>+	ndev->num_msix = 0;
>+	return rc;
>+}

Thanks for your work,

  -- Kuba

^ permalink raw reply

* Re: [PATCH] Packet mmap : allow the user to choose the offset of the tx payload.
From: Daniel Borkmann @ 2012-10-07 12:44 UTC (permalink / raw)
  To: pchavent; +Cc: davem, edumazet, xemul, herbert, netdev, johann.baudy, uaca
In-Reply-To: <CAD6jFUS+ovW79iOZxN-ptTZiTF9VsYnoLA-aWFSBX-zUR8OKGQ@mail.gmail.com>

On Sun, Oct 7, 2012 at 12:50 PM, Daniel Borkmann
<danborkmann@iogearbox.net> wrote:
> On Sat, Oct 6, 2012 at 9:43 AM, pchavent <Paul.Chavent@onera.fr> wrote:
>> On Fri, 5 Oct 2012 21:37:58 +0200, Daniel Borkmann wrote:
>>> On Fri, Oct 5, 2012 at 9:21 PM, pchavent <Paul.Chavent@onera.fr> wrote:
>>>> On Fri, 5 Oct 2012 16:17:12 +0200, Daniel Borkmann wrote:
>>>>> On Fri, Oct 5, 2012 at 3:10 PM, Paul Chavent <Paul.Chavent@onera.fr>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> The tx offset of packet mmap tx ring used to be :
>>>>>> (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll))
>>>>>>
>>>>>> The problem is that depending on the usage of SOCK_DGRAM or
>>>>>> SOCK_RAW, the payload could be aligned or not.
>>>>>>
>>>>>> This patch allow to let the user give an offset for it's tx
>>>>>> payload if he desires.
>>>>>>
>>>>>> Signed-off-by: Paul Chavent <paul.chavent@onera.fr>
>>>>>
>>>>>
>>>>> Can you provide an example when it doesn't hit TPACKET_ALIGNMENT?
>>>>
>>>>
>>>> When we use tx ring, the user have to write at (TPACKET_HDRLEN -
>>>> sizeof(struct sockaddr_ll))
>>>>
>>>> This adress is aligned on TPACKET_ALIGNMENT since
>>>> TPACKET_HDRLEN = (TPACKET_ALIGN(sizeof(struct tpacket_hdr)) +
>>>> sizeof(struct
>>>> sockaddr_ll))
>>>>
>>>> When we use the tx ring with SOCK_RAW option, the mac header is aligned
>>>> on
>>>> TPACKET_ALIGNMENT, but not the payload (14 bytes away).
>>>
>>>
>>> Okay, I'm confused about your intentions, maybe I'm missing something.
>>> The man-page of packet(7) clearly says:
>>>
>>> The socket_type is either SOCK_RAW for raw packets *including* the
>>> link level header or SOCK_DGRAM for cooked packets with  the link
>>> level header *removed*.
>>>
>>> So this is perfectly intended behavior of PF_PACKET.
>>>
>>> Cheers,
>>>
>>> Daniel
>>
>>
>> Yes, i also expect to be able to include the link level header when i use
>> SOCK_RAW.
>>
>> My intention is to send a frame with this payload (for example) :
>> typedef struct
>> {
>>   double   ts;
>>   uint64_t foo;
>> } test_t;
>>
>> So i get a pointer to the raw packet :
>> void * raw_packet = frame_base + (TPACKET_HDRLEN - sizeof(struct
>> sockaddr_ll));
>>
>> I cook the header :
>> memcpy(raw_packet +  0, dst_addr, sizeof(dst_addr));
>> memcpy(raw_packet +  6, src_addr, sizeof(src_addr));
>> memcpy(raw_packet + 12, type    , sizeof(type));
>>
>> Then i get a pointer to the beginning of payload :
>> test_t * payload = raw_packet + 14;
>>
>> Here payload is at 58 bytes from the beginning of the frame.
>>
>> Then i fill the payload :
>> payload->ts = 1.0;
>> payload->foo = 2;
>> ...
>>
>> These are misaligned accesses.
>>
>> I don't care to fill the cooked header if it's misaligned, but i would like
>> to be able to fill the frame directly in the ring buffer being on aligned
>> boundaries.

Just a minor remark: speaking about your patch, in the case of
SOCK_RAW I think you rather might want to care whether your *mac
header* starts at an aligned offset or not, since the kernel doesn't
care about your particular payload, but about the frame as a whole
that it needs to process.

> Okay.
>
> Maybe what you could do in a new version of your patch is to introduce
> a TP_STATUS flag, e.g. TP_STATUS_SEND_HAS_OFF that you pass along
> binary or'ed with the commonly used flags, and then you can fill
> tp_mac resp. tp_net with offsets. By that, you won't break legacy
> stuff.

Also, note that you should wait with your submission until net-next is reopened.

^ permalink raw reply

* [PATCH] vxlan: remove unused including <linux/version.h>
From: Wei Yongjun @ 2012-10-07 13:23 UTC (permalink / raw)
  To: shemminger; +Cc: yongjun_wei, netdev

From: Wei Yongjun <yongjun_wei@trendmicro.com.cn>

Remove including <linux/version.h> that don't need it.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
---
 drivers/net/vxlan.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 51de9ed..6f95580 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -28,7 +28,6 @@
 #include <linux/igmp.h>
 #include <linux/etherdevice.h>
 #include <linux/if_ether.h>
-#include <linux/version.h>
 #include <linux/hash.h>
 #include <net/ip.h>
 #include <net/icmp.h>

^ permalink raw reply related

* Re: [PATCH net 3/6] ipv4: add check if nh_pcpu_rth_output is allocated
From: Eric Dumazet @ 2012-10-07 13:34 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev
In-Reply-To: <1349609168-9848-4-git-send-email-ja@ssi.bg>

On Sun, 2012-10-07 at 14:26 +0300, Julian Anastasov wrote:
> 	Avoid NULL ptr dereference and caching if
> nh_pcpu_rth_output is not allocated.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
>  net/ipv4/route.c |   12 +++++++++---
>  1 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 488a8bb..0a600cc 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1798,18 +1798,24 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
>  	fnhe = NULL;
>  	if (fi) {
>  		struct rtable __rcu **prth;
> +		struct fib_nh *nh = &FIB_RES_NH(*res);
>  
> -		fnhe = find_exception(&FIB_RES_NH(*res), fl4->daddr);
> +		fnhe = find_exception(nh, fl4->daddr);
>  		if (fnhe)
>  			prth = &fnhe->fnhe_rth;
> -		else
> -			prth = __this_cpu_ptr(FIB_RES_NH(*res).nh_pcpu_rth_output);
> +		else {
> +			if (!nh->nh_pcpu_rth_output)
> +				goto add;
> +			prth = __this_cpu_ptr(nh->nh_pcpu_rth_output);
> +		}
>  		rth = rcu_dereference(*prth);
>  		if (rt_cache_valid(rth)) {
>  			dst_hold(&rth->dst);
>  			return rth;
>  		}
>  	}
> +
> +add:
>  	rth = rt_dst_alloc(dev_out,
>  			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
>  			   IN_DEV_CONF_GET(in_dev, NOXFRM),

Alternative would be to make sure the allocation succeeded in
fib_create_info(), but I have no idea on the maximal number of fib_info
a complex routing setup might need ?

I guess a typical machine needs less than 30 fib_info...

^ permalink raw reply

* [PATCH] ptp: use list_move instead of list_del/list_add
From: Wei Yongjun @ 2012-10-07 13:41 UTC (permalink / raw)
  To: linux-net-drivers, bhutchings; +Cc: yongjun_wei, netdev

From: Wei Yongjun <yongjun_wei@trendmicro.com.cn>

Using list_move() instead of list_del() + list_add().

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
---
 drivers/net/ethernet/sfc/ptp.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
index 5b3dd02..0767043f 100644
--- a/drivers/net/ethernet/sfc/ptp.c
+++ b/drivers/net/ethernet/sfc/ptp.c
@@ -640,8 +640,7 @@ static void efx_ptp_drop_time_expired_events(struct efx_nic *efx)
 			evt = list_entry(cursor, struct efx_ptp_event_rx,
 					 link);
 			if (time_after(jiffies, evt->expiry)) {
-				list_del(&evt->link);
-				list_add(&evt->link, &ptp->evt_free_list);
+				list_move(&evt->link, &ptp->evt_free_list);
 				netif_warn(efx, hw, efx->net_dev,
 					   "PTP rx event dropped\n");
 			}
@@ -684,8 +683,7 @@ static enum ptp_packet_state efx_ptp_match_rx(struct efx_nic *efx,
 
 			match->state = PTP_PACKET_STATE_MATCHED;
 			rc = PTP_PACKET_STATE_MATCHED;
-			list_del(&evt->link);
-			list_add(&evt->link, &ptp->evt_free_list);
+			list_move(&evt->link, &ptp->evt_free_list);
 			break;
 		}
 	}
@@ -820,8 +818,7 @@ static int efx_ptp_stop(struct efx_nic *efx)
 	/* Drop any pending receive events */
 	spin_lock_bh(&efx->ptp_data->evt_lock);
 	list_for_each_safe(cursor, next, &efx->ptp_data->evt_list) {
-		list_del(cursor);
-		list_add(cursor, &efx->ptp_data->evt_free_list);
+		list_move(cursor, &efx->ptp_data->evt_free_list);
 	}
 	spin_unlock_bh(&efx->ptp_data->evt_lock);
 

^ permalink raw reply related

* [PATCH] openvswitch: using nla_for_each_X to simplify the code
From: Wei Yongjun @ 2012-10-07 13:42 UTC (permalink / raw)
  To: jesse-l0M0P4e3n4LQT0dZR+AlfA, davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	yongjun_wei-zrsr2BFq86L20UzCJQGyNP8+0UxHXcjY

From: Wei Yongjun <yongjun_wei-zrsr2BFq86L20UzCJQGyNP8+0UxHXcjY@public.gmane.org>

Using nla_for_each_nested() or nla_for_each_attr()
to simplify the code.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei-zrsr2BFq86L20UzCJQGyNP8+0UxHXcjY@public.gmane.org>
---
 net/openvswitch/actions.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 0811447..b67594f 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -288,8 +288,7 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb,
 	upcall.userdata = NULL;
 	upcall.portid = 0;
 
-	for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
-		 a = nla_next(a, &rem)) {
+	nla_for_each_nested(a, attr, rem) {
 		switch (nla_type(a)) {
 		case OVS_USERSPACE_ATTR_USERDATA:
 			upcall.userdata = a;
@@ -311,8 +310,7 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
 	const struct nlattr *a;
 	int rem;
 
-	for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
-		 a = nla_next(a, &rem)) {
+	nla_for_each_nested(a, attr, rem) {
 		switch (nla_type(a)) {
 		case OVS_SAMPLE_ATTR_PROBABILITY:
 			if (net_random() >= nla_get_u32(a))
@@ -371,8 +369,7 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 	const struct nlattr *a;
 	int rem;
 
-	for (a = attr, rem = len; rem > 0;
-	     a = nla_next(a, &rem)) {
+	nla_for_each_attr(a, attr, len, rem) {
 		int err = 0;
 
 		if (prev_port != -1) {

^ permalink raw reply related

* [PATCH] Fix PTP dependencies: explicitly select all the possible dependencies.
From: Haicheng Li @ 2012-10-07 14:14 UTC (permalink / raw)
  To: David Miller; +Cc: fengguang.wu, netdev, linux-kernel
In-Reply-To: <20121006.171748.734171045678392820.davem@davemloft.net>

Fengguang reported a kernel build failure as following:
drivers/built-in.o: In function `pch_gbe_ioctl':
pch_gbe_main.c:(.text+0x510370): undefined reference to `pch_ch_control_write'
pch_gbe_main.c:(.text+0x510393): undefined reference to `pch_ch_control_write'
pch_gbe_main.c:(.text+0x5103b3): undefined reference to `pch_ch_control_write'
...

It's a regression by commit da1586461. The root cause is that
the CONFIG_PPS is not set there, consequently CONFIG_PTP_1588_CLOCK
can not be set anyway, which finally causes ptp_pch and pch_gbe_main
build failures.

As David prefers to use *select* to fix such module co-dependency issues,
this patch explicitly selects all the possible dependencies of PCH_PTP.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Haicheng Li <haicheng.lee@gmail.com>
---
  drivers/net/ethernet/oki-semi/pch_gbe/Kconfig |    3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/Kconfig 
b/drivers/net/ethernet/oki-semi/pch_gbe/Kconfig
index 9730241..5296cc8 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/Kconfig
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/Kconfig
@@ -26,6 +26,9 @@ if PCH_GBE
  config PCH_PTP
  	bool "PCH PTP clock support"
  	default n
+	depends on EXPERIMENTAL
+	select PPS
+	select PTP_1588_CLOCK
  	select PTP_1588_CLOCK_PCH
  	---help---
  	  Say Y here if you want to use Precision Time Protocol (PTP) in the
-- 
1.7.1

^ permalink raw reply related

* Re: [Patch net-next] netpoll: call ->ndo_select_queue() in tx path
From: Cong Wang @ 2012-10-07 14:34 UTC (permalink / raw)
  To: Sylvain Munaut; +Cc: David Miller, netdev, edumazet
In-Reply-To: <CAF6-1L5Gp9kpV+Koru6uAmYxQg_WQav9RVD4MVBDa2UGSoPOCA@mail.gmail.com>

On Wed, 2012-10-03 at 11:33 +0200, Sylvain Munaut wrote:
> Hi,
> 

Hi, Sylvain

> 
> Huh, I don't see it in the final 3.6 ?
> That's rather inconvenient :(
> 

We can backport it to 3.6 stable if you request. :)

Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox