Netdev List
 help / color / mirror / Atom feed
* Re: Pull request: sfc-next 2012-09-19
From: David Miller @ 2012-09-20 20:42 UTC (permalink / raw)
  To: bhutchings; +Cc: richardcochran, giometti, linux-net-drivers, netdev, ajackson
In-Reply-To: <1348081775.2636.15.camel@bwh-desktop.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Wed, 19 Sep 2012 20:09:35 +0100

> The following changes since commit b4516a288e71c64d7e214902250baf78b7b3cdcf:
> 
>   llc: Remove stray reference to sysctl_llc_station_ack_timeout. (2012-09-17 13:13:24 -0400)
> 
> are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next.git for-davem
> 
> (commit 450783747f42dfa3883920acfad4acdd93ce69af)
> 
> 1. Extension to PPS/PTP to allow for PHC devices where pulses are
>    subject to a variable but measurable delay.
> 2. PPS/PTP/PHC support for Solarflare boards with a timestamping 
>    peripheral.
> 3. MTD support for updating the timestamping peripheral on those boards.
> 4. Fix for potential over-length requests to firmware.

Pulled, thanks Ben.

^ permalink raw reply

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
From: Eric Dumazet @ 2012-09-20 20:25 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, subramanian.vijay, netdev
In-Reply-To: <505B773E.9070501@hp.com>

On Thu, 2012-09-20 at 13:06 -0700, Rick Jones wrote:

> 
> Yes, I was being too fast and loose with my wording, paying more 
> attention to the netperf tests than the rest of it.  While loopback may 
> be lossless, TCP retransmissions over loopback shouldn't be all *that* 
> surprising.

Sending perfect packets (large packets) should trigger no retransmits.

In your tests, you send one-byte packets, so obviously the receiver will
drop some of them, because sk_rcvbuf limit (or the backlog limit) is hit
very fast.

(This should be less frequent with TCP coalescing that was recently
introduced : We are able to coalesce about 1600 'one-byte packets' into
a single one.)

netperf -t TCP_STREAM over loopback should not drop packets or
retransmit them.

# netstat -s|grep TCPRcvCoalesce
    TCPRcvCoalesce: 0

# netperf -t TCP_RR -- -b 1024 -D -S 16K -o
local_transport_retrans,remote_transport_retrans MIGRATED TCP
REQUEST/RESPONSE TEST from 0.0.0.0 () port 0 AF_INET to localhost ()
port 0 AF_INET : nodelay : first burst 1024
Local Transport Retransmissions,Remote Transport Retransmissions
0,0

# netstat -s|grep TCPRcvCoalesce
    TCPRcvCoalesce: 2072191

^ permalink raw reply

* RE: bnx2x: link detected up at startup even when it should be down
From: Dmitry Kravkov @ 2012-09-20 20:08 UTC (permalink / raw)
  To: Jean-Michel Hautbois, netdev
  Cc: Barak Witkowski, Eilon Greenstein, davem@davemloft.net
In-Reply-To: <CAL8zT=hFPQ-NZw8eKQwnktRrcpsOFk3aa_ac5mSWa+TAwyTGBQ@mail.gmail.com>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Jean-Michel Hautbois
> Sent: Thursday, September 20, 2012 6:39 PM
> To: netdev
> Cc: Barak Witkowski; Eilon Greenstein; davem@davemloft.net
> Subject: bnx2x: link detected up at startup even when it should be down
> 
> Hi all,
> 
> I am working with a HP blade which has a bnx2x based card (Broadcom
> NetXtreme II BCM57810 10 Gigabit Ethernet).
> I am using a 3.2 linux kernel, which works very well except on
> detecting the link state at startup.
> I have my ethernet interfaces linked with a bond, and I want to
> configure it for HA (in miimon mode).
> I am using a managed switch which helps me in disabling/enabling ports.
> 
> When the port is disabled, at startup, the link is detected "UP".
> When I enable the port, it is still "UP", and when I disable it again,
> then it is detected "DOWN".
> 
> I have tried the latest 3.6-rc6 kernel, and it works well (link is
> "DOWN" at startup when port is disabled).
> Then I bisected it, and I found out that the commit which makes it
> working (yes, it is an inverse bisect, thanks to this powerful git
> tool :)) is :
> 
> a334872224a67b614dc888460377862621f3dac7 is the first bad commit
> commit a334872224a67b614dc888460377862621f3dac7
> Author: Barak Witkowski <barak@broadcom.com>
> Date:   Mon Apr 23 03:04:46 2012 +0000
> 
>     bnx2x: add afex support
> 
>     Following patch adds afex multifunction support to the driver (afex
>     multifunction is based on vntag header) and updates FW version
> used to 7.2.51.
> 
>     Support includes the following:
> 
>     1. Configure vif parameters in firmware (default vlan, vif id, default
>        priority, allowed priorities) according to values received from NIC.
>     2. Configure FW to strip/add default vlan according to afex vlan mode.
>     3. Notify link up to OS only after vif is fully initialized.
>     4. Support vif list set/get requests and configure FW accordingly.
>     5. Supply afex statistics upon request from NIC.
>     6. Special handling to L2 interface in case of FCoE vif.
> 
>     Signed-off-by: Barak Witkowski <barak@broadcom.com>
>     Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> This commit is present in the 3.5.y stable branch, but not the 3.2.y one.
> Is there a workaround which would make this feature work correctly
> even on older kernels ?
> It does not seem to be trivial, but I may miss something as this
> driver is pretty big...

 Jean,
I have passed over the patch, but was unable to catch link related change out of the 
AFEX flow. We will get closer look asap in out lab (guys are out for the weekend already)

Can you double check bisect result for me, pls?

Thanks
> 
> Cheers,
> JM
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
From: Rick Jones @ 2012-09-20 20:06 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, subramanian.vijay, netdev
In-Reply-To: <20120920.154007.1073423645697694793.davem@davemloft.net>

On 09/20/2012 12:40 PM, David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Thu, 20 Sep 2012 10:10:43 -0700
>
>> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
>>> loopback is lossless, so its always surprising we can have TCP
>>> retransmits on this medium ;)
>>
>> Is it lossless?
>>
>> raj@tardy:~/netperf2_trunk$ netstat -s | grep pru
>>      19 packets pruned from receive queue because of socket buffer overrun
>
> Those packets are not being dropped by the loopback device.
>

Yes, I was being too fast and loose with my wording, paying more 
attention to the netperf tests than the rest of it.  While loopback may 
be lossless, TCP retransmissions over loopback shouldn't be all *that* 
surprising.

rick

^ permalink raw reply

* [PATCH v3 5/7] xfrm_user: ensure user supplied esn replay window is valid
From: Mathias Krause @ 2012-09-20 20:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: Steffen Klassert, netdev, linux-kernel, Mathias Krause,
	Martin Willi, Ben Hutchings
In-Reply-To: <20120920070508.GA4221@secunet.com>

The current code fails to ensure that the netlink message actually
contains as many bytes as the header indicates. If a user creates a new
state or updates an existing one but does not supply the bytes for the
whole ESN replay window, the kernel copies random heap bytes into the
replay bitmap, the ones happen to follow the XFRMA_REPLAY_ESN_VAL
netlink attribute. This leads to following issues:

1. The replay window has random bits set confusing the replay handling
   code later on.

2. A malicious user could use this flaw to leak up to ~3.5kB of heap
   memory when she has access to the XFRM netlink interface (requires
   CAP_NET_ADMIN).

Known users of the ESN replay window are strongSwan and Steffen's
iproute2 patch (<http://patchwork.ozlabs.org/patch/85962/>). The latter
uses the interface with a bitmap supplied while the former does not.
strongSwan is therefore prone to run into issue 1.

To fix both issues without breaking existing userland allow using the
XFRMA_REPLAY_ESN_VAL netlink attribute with either an empty bitmap or a
fully specified one. For the former case we initialize the in-kernel
bitmap with zero, for the latter we copy the user supplied bitmap. For
state updates the full bitmap must be supplied.

To prevent overflows in the bitmap length calculation the maximum size
of bmp_len is limited to 128 by this patch -- resulting in a maximum
replay window of 4096 packets. This should be sufficient for all real
life scenarios (RFC 4303 recommends a default replay window size of 64).

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Martin Willi <martin@revosec.ch>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
v3:
- revert size_t change to xfrm_replay_state_esn_len() (requested by Steffen)
- switch to int types for lengths (suggested by Ben)
- implement 4096 packets limit for bmp_len to avoid overflows in
  xfrm_replay_state_esn_len()
v2:
- compare against klen in xfrm_alloc_replay_state_esn (suggested by Ben)
- make xfrm_replay_state_esn_len() return size_t

 include/linux/xfrm.h |    2 ++
 net/xfrm/xfrm_user.c |   31 +++++++++++++++++++++++++------
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/linux/xfrm.h b/include/linux/xfrm.h
index 22e61fd..28e493b 100644
--- a/include/linux/xfrm.h
+++ b/include/linux/xfrm.h
@@ -84,6 +84,8 @@ struct xfrm_replay_state {
 	__u32	bitmap;
 };
 
+#define XFRMA_REPLAY_ESN_MAX	4096
+
 struct xfrm_replay_state_esn {
 	unsigned int	bmp_len;
 	__u32		oseq;
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 9f1e749..e761562 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -123,9 +123,21 @@ static inline int verify_replay(struct xfrm_usersa_info *p,
 				struct nlattr **attrs)
 {
 	struct nlattr *rt = attrs[XFRMA_REPLAY_ESN_VAL];
+	struct xfrm_replay_state_esn *rs;
 
-	if ((p->flags & XFRM_STATE_ESN) && !rt)
-		return -EINVAL;
+	if (p->flags & XFRM_STATE_ESN) {
+		if (!rt)
+			return -EINVAL;
+
+		rs = nla_data(rt);
+
+		if (rs->bmp_len > XFRMA_REPLAY_ESN_MAX / sizeof(rs->bmp[0]) / 8)
+			return -EINVAL;
+
+		if (nla_len(rt) < xfrm_replay_state_esn_len(rs) &&
+		    nla_len(rt) != sizeof(*rs))
+			return -EINVAL;
+	}
 
 	if (!rt)
 		return 0;
@@ -370,14 +382,15 @@ static inline int xfrm_replay_verify_len(struct xfrm_replay_state_esn *replay_es
 					 struct nlattr *rp)
 {
 	struct xfrm_replay_state_esn *up;
+	int ulen;
 
 	if (!replay_esn || !rp)
 		return 0;
 
 	up = nla_data(rp);
+	ulen = xfrm_replay_state_esn_len(up);
 
-	if (xfrm_replay_state_esn_len(replay_esn) !=
-			xfrm_replay_state_esn_len(up))
+	if (nla_len(rp) < ulen || xfrm_replay_state_esn_len(replay_esn) != ulen)
 		return -EINVAL;
 
 	return 0;
@@ -388,22 +401,28 @@ static int xfrm_alloc_replay_state_esn(struct xfrm_replay_state_esn **replay_esn
 				       struct nlattr *rta)
 {
 	struct xfrm_replay_state_esn *p, *pp, *up;
+	int klen, ulen;
 
 	if (!rta)
 		return 0;
 
 	up = nla_data(rta);
+	klen = xfrm_replay_state_esn_len(up);
+	ulen = nla_len(rta) >= klen ? klen : sizeof(*up);
 
-	p = kmemdup(up, xfrm_replay_state_esn_len(up), GFP_KERNEL);
+	p = kzalloc(klen, GFP_KERNEL);
 	if (!p)
 		return -ENOMEM;
 
-	pp = kmemdup(up, xfrm_replay_state_esn_len(up), GFP_KERNEL);
+	pp = kzalloc(klen, GFP_KERNEL);
 	if (!pp) {
 		kfree(p);
 		return -ENOMEM;
 	}
 
+	memcpy(p, up, ulen);
+	memcpy(pp, up, ulen);
+
 	*replay_esn = p;
 	*preplay_esn = pp;
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
From: David Miller @ 2012-09-20 19:41 UTC (permalink / raw)
  To: rick.jones2; +Cc: eric.dumazet, sclark46, brutus, edumazet, netdev
In-Reply-To: <505B5154.3020002@hp.com>

From: Rick Jones <rick.jones2@hp.com>
Date: Thu, 20 Sep 2012 10:24:36 -0700

> On 09/20/2012 04:51 AM, Eric Dumazet wrote:
>> On Thu, 2012-09-20 at 07:28 -0400, Stephen Clark wrote:
>>>
>>> Does this mean traffic on the loopback interface will not traverse
>>> netfilter?
>>>
>>
>> Yes this was already mentioned.
>>
>> Only the SYN / SYNACK messages will
>>
>> All data will bypass IP stack, qdisc (if any), loopback driver, and
>> netfilter.
> 
> Does that then lift the tent flap for TOE?  As I recall, TOE's
> bypassing of all those things is one of the reasons used to reject
> TOE.

Wrong.  This bypassing is completely in software, and completely
controlled by us.

Which is completely opposite to TOE.

Don't spread fud like this, even on a whim.

^ permalink raw reply

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
From: David Miller @ 2012-09-20 19:40 UTC (permalink / raw)
  To: rick.jones2; +Cc: eric.dumazet, subramanian.vijay, netdev
In-Reply-To: <505B4E13.5000501@hp.com>

From: Rick Jones <rick.jones2@hp.com>
Date: Thu, 20 Sep 2012 10:10:43 -0700

> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
>> loopback is lossless, so its always surprising we can have TCP
>> retransmits on this medium ;)
> 
> Is it lossless?
> 
> raj@tardy:~/netperf2_trunk$ netstat -s | grep pru
>     19 packets pruned from receive queue because of socket buffer overrun

Those packets are not being dropped by the loopback device.

^ permalink raw reply

* Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
From: David Miller @ 2012-09-20 19:30 UTC (permalink / raw)
  To: sclark46; +Cc: brutus, eric.dumazet, edumazet, netdev
In-Reply-To: <505AFDE9.4080602@earthlink.net>

From: Stephen Clark <sclark46@earthlink.net>
Date: Thu, 20 Sep 2012 07:28:41 -0400

> Does this mean traffic on the loopback interface will not traverse
> netfilter?

Please, we've had this discussion before, let's not have it again.
Read the archives for the postings of the previous versions of
this patch.

^ permalink raw reply

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
From: Yuchung Cheng @ 2012-09-20 18:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, Vijay Subramanian, David Miller, netdev
In-Reply-To: <1348163031.3440.3.camel@edumazet-glaptop>

On Thu, Sep 20, 2012 at 10:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-20 at 10:10 -0700, Rick Jones wrote:
>> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
>> > loopback is lossless, so its always surprising we can have TCP
>> > retransmits on this medium ;)
>>
>> Is it lossless?
>>
>
> It is lossless, yes.
>
> But packets can be dropped by TCP stack for various reasons, including
> reordering and retransmits.
I'd recommend checking reordering stats. If it's lose less, set
tp->reordering = 127 for loopback.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] openvswitch: using kfree_rcu() to simplify the code
From: Paul E. McKenney @ 2012-09-20 17:47 UTC (permalink / raw)
  To: Wei Yongjun
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	yongjun_wei-zrsr2BFq86L20UzCJQGyNP8+0UxHXcjY,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <CAPgLHd_71Qh90j9FCkT2cQ35wNMkZeLDwHT2QLP55_3gfzfjTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Aug 27, 2012 at 12:20:45PM +0800, Wei Yongjun wrote:
> From: Wei Yongjun <yongjun_wei-zrsr2BFq86L20UzCJQGyNP8+0UxHXcjY@public.gmane.org>
> 
> The callback function of call_rcu() just calls a kfree(), so we
> can use kfree_rcu() instead of call_rcu() + callback function.
> 
> spatch with a semantic match is used to found this problem.
> (http://coccinelle.lip6.fr/)
> 
> Signed-off-by: Wei Yongjun <yongjun_wei-zrsr2BFq86L20UzCJQGyNP8+0UxHXcjY@public.gmane.org>

Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

> ---
>  net/openvswitch/flow.c | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index b7f38b1..c7bf2f2 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -427,19 +427,11 @@ void ovs_flow_deferred_free(struct sw_flow *flow)
>  	call_rcu(&flow->rcu, rcu_free_flow_callback);
>  }
> 
> -/* RCU callback used by ovs_flow_deferred_free_acts. */
> -static void rcu_free_acts_callback(struct rcu_head *rcu)
> -{
> -	struct sw_flow_actions *sf_acts = container_of(rcu,
> -			struct sw_flow_actions, rcu);
> -	kfree(sf_acts);
> -}
> -
>  /* Schedules 'sf_acts' to be freed after the next RCU grace period.
>   * The caller must hold rcu_read_lock for this to be sensible. */
>  void ovs_flow_deferred_free_acts(struct sw_flow_actions *sf_acts)
>  {
> -	call_rcu(&sf_acts->rcu, rcu_free_acts_callback);
> +	kfree_rcu(sf_acts, rcu);
>  }
> 
>  static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
From: Eric Dumazet @ 2012-09-20 17:43 UTC (permalink / raw)
  To: Rick Jones; +Cc: Vijay Subramanian, David Miller, netdev
In-Reply-To: <505B4E13.5000501@hp.com>

On Thu, 2012-09-20 at 10:10 -0700, Rick Jones wrote:
> On 09/19/2012 10:37 PM, Eric Dumazet wrote:
> > loopback is lossless, so its always surprising we can have TCP
> > retransmits on this medium ;)
> 
> Is it lossless?
> 

It is lossless, yes.

But packets can be dropped by TCP stack for various reasons, including
reordering and retransmits.

^ permalink raw reply

* Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
From: Bill Fink @ 2012-09-20 16:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: sclark46, Bruce Curtis, David Miller, edumazet, netdev
In-Reply-To: <1348141871.31352.66.camel@edumazet-glaptop>

On Thu, 20 Sep 2012, Eric Dumazet wrote:

> On Thu, 2012-09-20 at 07:28 -0400, Stephen Clark wrote:
> >  
> > Does this mean traffic on the loopback interface will not traverse 
> > netfilter?
> > 
> 
> Yes this was already mentioned.
> 
> Only the SYN / SYNACK messages will
> 
> All data will bypass IP stack, qdisc (if any), loopback driver, and
> netfilter.

These restrictions and any others should be documented in ip-sysctl.txt.

>From Eric's earlier e-mail:

-> no iptables, 
   no qdisc (by default there is no qdisc on lo),
   no loopback stats (ifconfig lo).
   some SNMP stats missing as well (netstat -s)

						-Bill

^ permalink raw reply

* Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
From: Rick Jones @ 2012-09-20 17:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: sclark46, Bruce Curtis, David Miller, edumazet, netdev
In-Reply-To: <1348141871.31352.66.camel@edumazet-glaptop>

On 09/20/2012 04:51 AM, Eric Dumazet wrote:
> On Thu, 2012-09-20 at 07:28 -0400, Stephen Clark wrote:
>>
>> Does this mean traffic on the loopback interface will not traverse
>> netfilter?
>>
>
> Yes this was already mentioned.
>
> Only the SYN / SYNACK messages will
>
> All data will bypass IP stack, qdisc (if any), loopback driver, and
> netfilter.

Does that then lift the tent flap for TOE?  As I recall, TOE's bypassing 
of all those things is one of the reasons used to reject TOE.

rick jones

^ permalink raw reply

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
From: Rick Jones @ 2012-09-20 17:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, netdev
In-Reply-To: <1348119475.31352.60.camel@edumazet-glaptop>

On 09/19/2012 10:37 PM, Eric Dumazet wrote:
> loopback is lossless, so its always surprising we can have TCP
> retransmits on this medium ;)

Is it lossless?

raj@tardy:~/netperf2_trunk$ netstat -s | grep pru
     19 packets pruned from receive queue because of socket buffer overrun

raj@tardy:~/netperf2_trunk$ src/netperf -t TCP_RR -- -b 256 -D -o 
burst_size,local_transport_retrans,remote_transport_retrans
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to localhost.localdomain () port 0 AF_INET : nodelay : histogram : demo 
: first burst 256
Initial Burst Requests,Local Transport Retransmissions,Remote Transport 
Retransmissions
256,151,94

raj@tardy:~/netperf2_trunk$ netstat -s | grep pru    26 packets pruned 
from receive queue because of socket buffer overrun
raj@tardy:~/netperf2_trunk$ uname -a
Linux tardy 2.6.38-16-generic #67-Ubuntu SMP Thu Sep 6 17:58:38 UTC 2012 
x86_64 x86_64 x86_64 GNU/Linux

Admittedly, my test is on an older kernel, but have things changed in 
this regard since then?   I had to get a bit more contrived on a later 
kernel in a VM (vs what is running directly on my workstation):

raj@tardy-ubuntu-1204:~$ netstat -s | grep -e prune -e retrans    1 
segments retransmited
     4 packets pruned from receive queue because of socket buffer overrun
     1 fast retransmits
raj@tardy-ubuntu-1204:~$ netperf -t TCP_RR -- -b 1024 -D -S 16K -o 
local_transport_retrans,remote_transport_retrans
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to localhost () port 0 AF_INET : nodelay : demo : first burst 1024
Local Transport Retransmissions,Remote Transport Retransmissions
1,0
raj@tardy-ubuntu-1204:~$ netstat -s | grep -e prune -e retrans    2 
segments retransmited
     7 packets pruned from receive queue because of socket buffer overrun
     2 fast retransmits
raj@tardy-ubuntu-1204:~$ uname -a
Linux tardy-ubuntu-1204 3.6.0-rc3+ #7 SMP Mon Sep 10 14:46:05 PDT 2012 
x86_64 x86_64 x86_64 GNU/Linux

rick

^ permalink raw reply

* Re: Oops with latest (netfilter) nf-next tree, when unloading iptable_nat
From: Patrick McHardy @ 2012-09-20 17:06 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jesper Dangaard Brouer, Florian Westphal, netfilter-devel, netdev,
	yongjun_wei
In-Reply-To: <Pine.GSO.4.63.1209201212480.1775@stinky-local.trash.net>

On Thu, 20 Sep 2012, Patrick McHardy wrote:

>>> diff --git a/net/netfilter/nf_conntrack_core.c 
>>> b/net/netfilter/nf_conntrack_core.c
>>> index dcb2791..0f241be 100644
>>> --- a/net/netfilter/nf_conntrack_core.c
>>> +++ b/net/netfilter/nf_conntrack_core.c
>>> @@ -1224,6 +1224,8 @@ get_next_corpse(struct net *net, int (*iter)(struct 
>>> nf_conn *i, void *data),
>>>  	spin_lock_bh(&nf_conntrack_lock);
>>>  	for (; *bucket < net->ct.htable_size; (*bucket)++) {
>>>  		hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], 
>>> hnnode) {
>>> +			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
>>> +				continue;
>> 
>> I think this will make the deletion of entries via `conntrack -F'
>> slowier as we'll have to iterate over more entries (we won't delete
>> entries for the reply tuple).
>
> Slightly maybe, but I doubt it makes much of a difference.
>
>> I think I prefer Florian's patch, it's fairly small and it does not
>> change the current nf_ct_iterate behaviour or adding some
>> nf_nat_iterate cleanup.
>
> I don't think I've received it. Could you forward it to me please?

Florian forwarded the patch to me. While it fixes the problem, it
is a workaround and it certainly is inelegant to do the
list_del_rcu_init() and memset up to *four* times for a single conntrack.

The correct thing IMO is to invoke the callbacks exactly once per
conntrack, either through my nf_ct_iterate_cleanup() change or through
a new iteration function for callers that don't kill conntracks. As
soon as we start generating events for NAT section cleanup this will be
needed in any case.

Unless I'm missing something, conntrack flushing is also a really rare 
operation anyways and for large tables where this might make a small
difference will take a quite large time anyway.

^ permalink raw reply

* [PATCH net-next] gianfar: Change default HW Tx queue scheduling mode
From: Claudiu Manoil @ 2012-09-20 15:57 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Paul Gortmaker, Claudiu Manoil

This is primarily to address transmission timeout occurrences, when
multiple H/W Tx queues are being used concurrently. Because in
the priority scheduling mode the controller does not service the
Tx queues equally (but in ascending index order), Tx timeouts are
being triggered rightaway for a basic test with multiple simultaneous
connections like:
iperf -c <server_ip> -n 100M -P 8

resulting in kernel trace:
NETDEV WATCHDOG: eth1 (fsl-gianfar): transmit queue <X> timed out
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:255
...
and controller reset during intense traffic, and possibly further
complications.

This patch changes the default H/W Tx scheduling setting (TXSCHED)
for multi-queue devices, from priority scheduling mode to a weighted
round robin mode with equal weights for all H/W Tx queues, and
addresses the issue above.

The TXSCHED setting may be changed at runtime, via sysfs, for devices
using multiple H/W Tx queues. For single queue devices this config
option is disabled, as the TXSCHED setting is superfluous in those cases.

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c       |   11 +++-
 drivers/net/ethernet/freescale/gianfar.h       |   11 +++-
 drivers/net/ethernet/freescale/gianfar_sysfs.c |   71 ++++++++++++++++++++++++
 3 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 4d5b58c..a1b52ec 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -394,7 +394,13 @@ static void gfar_init_mac(struct net_device *ndev)
 	if (ndev->features & NETIF_F_IP_CSUM)
 		tctrl |= TCTRL_INIT_CSUM;
 
-	tctrl |= TCTRL_TXSCHED_PRIO;
+	if (priv->prio_sched_en)
+		tctrl |= TCTRL_TXSCHED_PRIO;
+	else {
+		tctrl |= TCTRL_TXSCHED_WRRS;
+		gfar_write(&regs->tr03wt, DEFAULT_WRRS_WEIGHT);
+		gfar_write(&regs->tr47wt, DEFAULT_WRRS_WEIGHT);
+	}
 
 	gfar_write(&regs->tctrl, tctrl);
 
@@ -1160,6 +1166,9 @@ static int gfar_probe(struct platform_device *ofdev)
 	priv->rx_filer_enable = 1;
 	/* Enable most messages by default */
 	priv->msg_enable = (NETIF_MSG_IFUP << 1 ) - 1;
+	/* use pritority h/w tx queue scheduling for single queue devices */
+	if (priv->num_tx_queues == 1)
+		priv->prio_sched_en = 1;
 
 	/* Carrier starts down, phylib will bring it up */
 	netif_carrier_off(dev);
diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
index 2136c7f..4141ef2 100644
--- a/drivers/net/ethernet/freescale/gianfar.h
+++ b/drivers/net/ethernet/freescale/gianfar.h
@@ -301,8 +301,16 @@ extern const char gfar_driver_version[];
 #define TCTRL_TFCPAUSE		0x00000008
 #define TCTRL_TXSCHED_MASK	0x00000006
 #define TCTRL_TXSCHED_INIT	0x00000000
+/* priority scheduling */
 #define TCTRL_TXSCHED_PRIO	0x00000002
+/* weighted round-robin scheduling (WRRS) */
 #define TCTRL_TXSCHED_WRRS	0x00000004
+/* default WRRS weight and policy setting,
+ * tailored to the tr03wt and tr47wt registers:
+ * equal weight for all Tx Qs, measured in 64byte units
+ */
+#define DEFAULT_WRRS_WEIGHT	0x18181818
+
 #define TCTRL_INIT_CSUM		(TCTRL_TUCSEN | TCTRL_IPCSEN)
 
 #define IEVENT_INIT_CLEAR	0xffffffff
@@ -1098,7 +1106,8 @@ struct gfar_private {
 		extended_hash:1,
 		bd_stash_en:1,
 		rx_filer_enable:1,
-		wol_en:1; /* Wake-on-LAN enabled */
+		wol_en:1, /* Wake-on-LAN enabled */
+		prio_sched_en:1; /* Enable priorty based Tx scheduling in Hw */
 	unsigned short padding;
 
 	/* PHY stuff */
diff --git a/drivers/net/ethernet/freescale/gianfar_sysfs.c b/drivers/net/ethernet/freescale/gianfar_sysfs.c
index cd14a4d..b942bfc 100644
--- a/drivers/net/ethernet/freescale/gianfar_sysfs.c
+++ b/drivers/net/ethernet/freescale/gianfar_sysfs.c
@@ -319,6 +319,76 @@ static ssize_t gfar_set_fifo_starve_off(struct device *dev,
 static DEVICE_ATTR(fifo_starve_off, 0644, gfar_show_fifo_starve_off,
 		   gfar_set_fifo_starve_off);
 
+static ssize_t gfar_show_tx_prio_sched(struct device *dev,
+				       struct device_attribute *attr,
+				       char *buf)
+{
+	struct gfar_private *priv = netdev_priv(to_net_dev(dev));
+
+	return sprintf(buf, "%s\n", priv->prio_sched_en ? "on" : "off");
+}
+
+static ssize_t gfar_set_tx_prio_sched(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	struct net_device *ndev = to_net_dev(dev);
+	struct gfar_private *priv = netdev_priv(to_net_dev(dev));
+	bool new_setting;
+
+	/* no changes if device doesn't have multiple Tx queues */
+	if (priv->num_tx_queues <= 1)
+		return count;
+
+	/* find out the new setting */
+	if (!strncmp("on", buf, count - 1) || !strncmp("1", buf, count - 1))
+		new_setting = 1;
+	else if (!strncmp("off", buf, count - 1) ||
+		 !strncmp("0", buf, count - 1))
+		new_setting = 0;
+	else
+		return count;
+
+	if (priv->prio_sched_en == new_setting)
+		return count;
+
+	/* Only stop and start the controller if it isn't already
+	 * stopped and we're about to update TCTRL with the new Tx
+	 * schedulling policy
+	 */
+	if (ndev->flags & IFF_UP) {
+		unsigned long flags;
+
+		/* Halt TX and RX, and process the frames which
+		 * have already been received
+		 */
+		netif_tx_stop_all_queues(ndev);
+		local_irq_save(flags);
+		lock_tx_qs(priv);
+		lock_rx_qs(priv);
+
+		gfar_halt(ndev);
+
+		unlock_tx_qs(priv);
+		unlock_rx_qs(priv);
+		local_irq_restore(flags);
+
+		/* take down the rings to rebuild them */
+		stop_gfar(ndev);
+
+		priv->prio_sched_en = new_setting;
+
+		startup_gfar(ndev);
+		netif_tx_wake_all_queues(ndev);
+	} else
+		priv->prio_sched_en = new_setting;
+
+	return count;
+}
+
+static DEVICE_ATTR(tx_prio_sched, 0644, gfar_show_tx_prio_sched,
+		   gfar_set_tx_prio_sched);
+
 void gfar_init_sysfs(struct net_device *dev)
 {
 	struct gfar_private *priv = netdev_priv(dev);
@@ -336,6 +406,7 @@ void gfar_init_sysfs(struct net_device *dev)
 	rc |= device_create_file(&dev->dev, &dev_attr_fifo_threshold);
 	rc |= device_create_file(&dev->dev, &dev_attr_fifo_starve);
 	rc |= device_create_file(&dev->dev, &dev_attr_fifo_starve_off);
+	rc |= device_create_file(&dev->dev, &dev_attr_tx_prio_sched);
 	if (rc)
 		dev_err(&dev->dev, "Error creating gianfar sysfs files.\n");
 }
-- 
1.7.6.5

^ permalink raw reply related

* bnx2x: link detected up at startup even when it should be down
From: Jean-Michel Hautbois @ 2012-09-20 15:39 UTC (permalink / raw)
  To: netdev; +Cc: barak, eilong, davem

Hi all,

I am working with a HP blade which has a bnx2x based card (Broadcom
NetXtreme II BCM57810 10 Gigabit Ethernet).
I am using a 3.2 linux kernel, which works very well except on
detecting the link state at startup.
I have my ethernet interfaces linked with a bond, and I want to
configure it for HA (in miimon mode).
I am using a managed switch which helps me in disabling/enabling ports.

When the port is disabled, at startup, the link is detected "UP".
When I enable the port, it is still "UP", and when I disable it again,
then it is detected "DOWN".

I have tried the latest 3.6-rc6 kernel, and it works well (link is
"DOWN" at startup when port is disabled).
Then I bisected it, and I found out that the commit which makes it
working (yes, it is an inverse bisect, thanks to this powerful git
tool :)) is :

a334872224a67b614dc888460377862621f3dac7 is the first bad commit
commit a334872224a67b614dc888460377862621f3dac7
Author: Barak Witkowski <barak@broadcom.com>
Date:   Mon Apr 23 03:04:46 2012 +0000

    bnx2x: add afex support

    Following patch adds afex multifunction support to the driver (afex
    multifunction is based on vntag header) and updates FW version
used to 7.2.51.

    Support includes the following:

    1. Configure vif parameters in firmware (default vlan, vif id, default
       priority, allowed priorities) according to values received from NIC.
    2. Configure FW to strip/add default vlan according to afex vlan mode.
    3. Notify link up to OS only after vif is fully initialized.
    4. Support vif list set/get requests and configure FW accordingly.
    5. Supply afex statistics upon request from NIC.
    6. Special handling to L2 interface in case of FCoE vif.

    Signed-off-by: Barak Witkowski <barak@broadcom.com>
    Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

This commit is present in the 3.5.y stable branch, but not the 3.2.y one.
Is there a workaround which would make this feature work correctly
even on older kernels ?
It does not seem to be trivial, but I may miss something as this
driver is pretty big...

Cheers,
JM

^ permalink raw reply

* RE: New commands to configure IOV features
From: Rose, Gregory V @ 2012-09-20 15:39 UTC (permalink / raw)
  To: Ben Hutchings, Yinghai Lu
  Cc: Bjorn Helgaas, Yuval Mintz, davem@davemloft.net,
	netdev@vger.kernel.org, Ariel Elior, Eilon Greenstein, linux-pci
In-Reply-To: <1348104190.4836.61.camel@deadeye.wl.decadent.org.uk>

> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Wednesday, September 19, 2012 6:23 PM
> To: Yinghai Lu
> Cc: Bjorn Helgaas; Rose, Gregory V; Yuval Mintz; davem@davemloft.net;
> netdev@vger.kernel.org; Ariel Elior; Eilon Greenstein; linux-pci
> Subject: Re: New commands to configure IOV features
> 
> On Wed, 2012-09-19 at 17:19 -0700, Yinghai Lu wrote:
> > On Wed, Sep 19, 2012 at 3:46 PM, Ben Hutchings
> > <bhutchings@solarflare.com> wrote:
> > > On Wed, 2012-09-19 at 15:17 -0700, Yinghai Lu wrote:
> > >> +max_vfs_store(struct device *dev, struct device_attribute *attr,
> > >> +                const char *buf, size_t count) {
> > >> +       unsigned long val;
> > >> +       struct pci_dev *pdev = to_pci_dev(dev);
> > >> +
> > >> +       if (strict_strtoul(buf, 0, &val) < 0)
> > >> +               return -EINVAL;
> > >> +
> > >> +       pdev->max_vfs = val;
> > >> +
> > >> +       return count;
> > >> +}
> > > [...]
> > >
> > > Then what would actually trigger creation of the VFs?  There's no
> > > way we can assume that some sysfs attribute will be written before
> > > the PF driver is loaded (what if it's built-in?).  I thought the
> > > idea was to add a driver callback that would be called when the
> > > sysfs attribute was written.
> >
> > could just stop the device and add it back again?
> 
> This is highly disruptive and I think it would be totally unacceptable for
> at least networking devices.

Agreed.

We need the driver callback.

- Greg


^ permalink raw reply

* [PATCH net] bnx2x: remove false warning regarding interrupt number
From: Ariel Elior @ 2012-09-20 15:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eilon Greenstein, Ariel Elior

Since version 7.4 the FW configures in the pci config space the max
number of interrupts available to the physical function, instead of
the exact number to use.
This causes a false warning in driver when comparing the number of
configured interrupts to the number about to be used.

Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |   11 ++++++-----
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 211753e..0875ecf 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -9831,12 +9831,13 @@ static void __devinit bnx2x_get_igu_cam_info(struct bnx2x *bp)
 	}
 
 #ifdef CONFIG_PCI_MSI
-	/*
-	 * It's expected that number of CAM entries for this functions is equal
-	 * to the number evaluated based on the MSI-X table size. We want a
-	 * harsh warning if these values are different!
+	/* Due to new PF resource allocation by MFW T7.4 and above, it's
+	 * optional that number of CAM entries will not be equal to the value
+	 * advertised in PCI.
+	 * Driver should use the minimal value of both as the actual status
+	 * block count
 	 */
-	WARN_ON(bp->igu_sb_cnt != igu_sb_cnt);
+	bp->igu_sb_cnt = min_t(int, bp->igu_sb_cnt, igu_sb_cnt);
 #endif
 
 	if (igu_sb_cnt == 0)
-- 
1.7.9.GIT

^ permalink raw reply related

* Re: mlx4: dropping multicast packets at promisc leave
From: Marcelo Ricardo Leitner @ 2012-09-20 15:04 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: netdev, Yevgeny Petrilin, Amir Vadai
In-Reply-To: <505B1874.3040904@mellanox.com>

On 09/20/2012 10:21 AM, Or Gerlitz wrote:
> On 20/09/2012 03:43, Marcelo Ricardo Leitner wrote:
>> I have a report that our mlx4 driver (RHEL 6.3) is dropping multicast
>> packets when NIC leaves promisc mode. It seems this is being cause due
>> to the new steering mode that took place near by commit
>> 1679200f91da6a054b06954c9bd3eeed29b6731f. As it seems, the new
>> steering mode needs more commands/time to leave the promisc mode,
>> which may be leading to packet drops.
>
> Marcelo,
>
> The commit you point on below 6d19993 "net/mlx4_en: Re-design multicast
> attachments flow" makes sure to avoid
> doing extra firmware comments and not leave a window in time where
> "correct" addresses are not attached. Its hard to say what's the case on
> that RHEL 6.3 system, it would be very helpful through if you manage to
> reproduce the problem on an upstream kernel -- BTW you didn't say on

Okay, I understand that the commit prevents a window. I may be missing 
something, but isn't there another one in there? Between:
mlx4_SET_MCAST_FLTR MLX4_MCAST_DISABLE and
mlx4_SET_MCAST_FLTR MLX4_MCAST_ENABLE
because mlx4_multicast_promisc_remove() was called just before those.
Otherwise I don't how is the NIC would be receiving multicast packets in 
there.

I understand the difficulty about the kernel version, I am sorry for 
that. As I'm unable to reproduce the issue by myself, I couldn't run a 
test in a plain upstream kernel so far or experiment much.

I was holding this email: I just access to a server that seems to 
reproduce the issue. It has a MT27500 ConnectX-3 NIC. Only tried our 
RHEL 6.3 stock so far. Keep you posted on further tests!

This was the result of ifconfig mlx4_2 -promisc:
[  3] 34.0-35.0 sec  61.7 MBytes   517 Mbits/sec   0.024 ms  274/43502 
(0.63%)
[  3] 34.0-35.0 sec  756 datagrams received out-of-order

> which kernel you are trying to reproduce this, note that upstream has
> also commit 60d31c1475f2 "net/mlx4_core: Looking for promiscuous entries
> on the correct port" (reported by you...) , so if somehow this commit
> makes the diff you could use it also on their system.

Sorry, that would be 2.6.32-279.el6. It has additional commits up to 
somewhere near commit
58a3de0 - mlx4_core: fix race on comm channel
but maybe not all before that one. Can't tell you for sure.

And then I tried 3 additional patches applied at once:
- 60d31c1475f2 "net/mlx4_core: Looking for promiscuous entries on the 
correct port"
- f1f75f0 - mlx4: attach multicast with correct flag
   - Yes, this one wasn't in 2.6.32-279.el6.
- 6d19993 - net/mlx4_en: Re-design multicast attachments flow

And they still reported drops.

>> It takes 300ms to perform the change there against my 600us. Hitting
>> something like tcpdump -c 10 in a loop helps triggering it.
>
> Do you have any insight for this huge difference?

No idea. Couldn't track it yet.

Thanks,
Marcelo.

^ permalink raw reply

* Re: regression: tethering fails in 3.5 with iwlwifi
From: Eric Dumazet @ 2012-09-20 14:25 UTC (permalink / raw)
  To: artem.bityutskiy; +Cc: Eric Dumazet, Johannes Berg, linux-wireless, netdev
In-Reply-To: <1348147353.2388.19.camel@sauron.fi.intel.com>

On Thu, 2012-09-20 at 16:22 +0300, Artem Bityutskiy wrote:
> On Thu, 2012-09-20 at 15:04 +0200, Eric Dumazet wrote:
> > Try to pull 40 bytes : Thats OK for tcp performance, because 40 bytes
> > is the minimum size of IP+TCP headers
> > 
> > pskb_may_pull(skb, 40);
> 
> OK, I've tried almost this (see below) and it solves my issue:
> 
> diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
> index 965e6ec..7f079d0 100644
> --- a/net/mac80211/rx.c
> +++ b/net/mac80211/rx.c
> @@ -1798,9 +1798,13 @@ ieee80211_deliver_skb(struct ieee80211_rx_data *rx)
>  
>                 if (skb) {
>                         /* deliver to local stack */
> -                       skb->protocol = eth_type_trans(skb, dev);
> -                       memset(skb->cb, 0, sizeof(skb->cb));
> -                       netif_receive_skb(skb);
> +                       if (pskb_may_pull(skb, 40)) {
> +                               skb->protocol = eth_type_trans(skb, dev);
> +                               memset(skb->cb, 0, sizeof(skb->cb));
> +                               netif_receive_skb(skb);
> +                       } else {
> +                               kfree_skb(skb);
> +                       }
>                 }
>         }
> 

Please remove this hack and try the following bugfix in raw handler

icmp_filter() should not modify skb, or else its caller should not
assume ip_hdr() is unchanged.

 net/ipv4/raw.c |   29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index f242578..3fa8c96 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -128,25 +128,30 @@ found:
 }
 
 /*
- *	0 - deliver
- *	1 - block
+ *	false - deliver
+ *	true - block
  */
-static __inline__ int icmp_filter(struct sock *sk, struct sk_buff *skb)
+static bool icmp_filter(struct sock *sk, const struct sk_buff *skb)
 {
-	int type;
-
-	if (!pskb_may_pull(skb, sizeof(struct icmphdr)))
-		return 1;
-
-	type = icmp_hdr(skb)->type;
-	if (type < 32) {
+	__u8 _type;
+	const __u8 *type;
+
+	type = skb_header_pointer(skb,
+				  skb_transport_offset(skb) +
+				  offsetof(struct icmphdr, type),
+				  sizeof(_type),
+				  &_type);
+	if (!type)
+		return true;
+
+	if (*type < 32) {
 		__u32 data = raw_sk(sk)->filter.data;
 
-		return ((1 << type) & data) != 0;
+		return ((1U << *type) & data) != 0;
 	}
 
 	/* Do not block unknown ICMP types */
-	return 0;
+	return false;
 }
 
 /* IP input processing comes here for RAW socket delivery.

^ permalink raw reply related

* Re: mlx4_en: fix endianness with blue frame support
From: Or Gerlitz @ 2012-09-20 13:46 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: cascardo, netdev, Yevgeny Petrilin, Eli Cohen
In-Reply-To: <20120918073448.GA32445@elgon.mountain>

On Tue, Sep 18, 2012 at 10:34 AM, Dan Carpenter
<dan.carpenter@oracle.com> wrote:
> Hello Thadeu Lima de Souza Cascardo,
>
> The patch c5d6136e10d6: "mlx4_en: fix endianness with blue frame
> support" from Oct 10, 2011, leads to the following warning:
> drivers/net/ethernet/mellanox/mlx4/en_tx.c:720 mlx4_en_xmit()
>          warn: potential memory corrupting cast. 4 vs 2 bytes
>
> That patch introduced a call to cpu_to_be32() and added some endian notation.
>         *(__be32 *) (&tx_desc->ctrl.vlan_tag) |= cpu_to_be32(ring->doorbell_qpn);
> But it doesn't make sense because the data type is declared as u16 in
> the header and we would be corrupting the next elements in the struct
> which are ins_vlan and fence_size.
>
> struct mlx4_wqe_ctrl_seg {
>         __be32                  owner_opcode;
>         __be16                  vlan_tag;
>         u8                      ins_vlan;
>         u8                      fence_size;
>
> I guess the reason we get away with it is that the ->doorbell_qpn is
> normally less that 65k. But doorbell_qpn is a u32 type so I think there is a risk here.

Dan,

QP numbers are 24 bit in size, under blue-flame setting the QP number
is written
over the "vlan_tag" field and potentially also the "ins_vlan" field of
the control segment,
we can do a little cleanup here with introducing a modified version of
the mlx4_wqe_ctrl_seg
structure over which the cast is made  under the blue-flame flow.

Or.

^ permalink raw reply

* Re: mlx4: dropping multicast packets at promisc leave
From: Or Gerlitz @ 2012-09-20 13:21 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: netdev, Yevgeny Petrilin, Amir Vadai
In-Reply-To: <505A66CC.8010701@redhat.com>

On 20/09/2012 03:43, Marcelo Ricardo Leitner wrote:
> I have a report that our mlx4 driver (RHEL 6.3) is dropping multicast 
> packets when NIC leaves promisc mode. It seems this is being cause due 
> to the new steering mode that took place near by commit 
> 1679200f91da6a054b06954c9bd3eeed29b6731f. As it seems, the new 
> steering mode needs more commands/time to leave the promisc mode, 
> which may be leading to packet drops.

Marcelo,

The commit you point on below 6d19993 "net/mlx4_en: Re-design multicast 
attachments flow" makes sure to avoid
doing extra firmware comments and not leave a window in time where 
"correct" addresses are not attached. Its hard to say what's the case on 
that RHEL 6.3 system, it would be very helpful through if you manage to 
reproduce the problem on an upstream kernel -- BTW you didn't say on 
which kernel you are  trying to reproduce  this, note that upstream has 
also commit 60d31c1475f2 "net/mlx4_core: Looking for promiscuous entries 
on the correct port" (reported by you...) , so if somehow this commit 
makes the diff you could use it also on their system.

> It takes 300ms to perform the change there against my 600us. Hitting 
> something like tcpdump -c 10 in a loop helps triggering it.

Do you have any insight for this huge difference?

Or.

^ permalink raw reply

* Equal bandwidth for all users - Per Connection Queue (PCQ) on Linux ?
From: valent.turkovic @ 2012-09-20 13:22 UTC (permalink / raw)
  To: netdev

Hi all,
is ther any way that would allow to dynamically distribute bandwidth
per conneted client?
For example if maximum bandiwdth available is 10/10 MBits/s and only
two clients are connected then they both get 1/2 (5/5 Mbits), and when
there are 10 clients connected and using the connection, each gets
1/10 (1/1 Mbit).

I haven't yet seen any qos technique on linux that allows this, but my
quess is that some linux networking guru has this figured out.

Mikrorik (linux based embedded OS specialized for networking) has qos
type called PCQ [1] that dynamically spreads bandwidth just as
explained.

[1] http://wiki.mikrotik.com/wiki/Manual:Queues_-_PCQ_Examples

-- 
follow me - www.twitter.com/valentt & http://kernelreloaded.blog385.com
linux, anime, spirituality, wireless, scuba, linuxmce smart home, zwave
ICQ: 2125241, Skype: valent.turkovic, MSN: valent.turkovic@hotmail.com

^ permalink raw reply

* Re: regression: tethering fails in 3.5 with iwlwifi
From: Eric Dumazet @ 2012-09-20 13:22 UTC (permalink / raw)
  To: artem.bityutskiy; +Cc: Eric Dumazet, Johannes Berg, linux-wireless, netdev
In-Reply-To: <1348147353.2388.19.camel@sauron.fi.intel.com>

On Thu, 2012-09-20 at 16:22 +0300, Artem Bityutskiy wrote:

> 
> OK, I've tried almost this (see below) and it solves my issue:
> 
> diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
> index 965e6ec..7f079d0 100644
> --- a/net/mac80211/rx.c
> +++ b/net/mac80211/rx.c
> @@ -1798,9 +1798,13 @@ ieee80211_deliver_skb(struct ieee80211_rx_data *rx)
>  
>                 if (skb) {
>                         /* deliver to local stack */
> -                       skb->protocol = eth_type_trans(skb, dev);
> -                       memset(skb->cb, 0, sizeof(skb->cb));
> -                       netif_receive_skb(skb);
> +                       if (pskb_may_pull(skb, 40)) {
> +                               skb->protocol = eth_type_trans(skb, dev);
> +                               memset(skb->cb, 0, sizeof(skb->cb));
> +                               netif_receive_skb(skb);
> +                       } else {
> +                               kfree_skb(skb);
> +                       }
>                 }
>         }
> 

OK but you cant do that, or small frames will be dropped.

Anyway its a hack, we should find the buggy layer.

You could use dropwatch (drop_monitor) to check where frame is dropped.

modprobe drop_monitor
dropwatch -l kas

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox