Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3] mac80211: port CCMP to cryptoapi's CCM driver
From: Johannes Berg @ 2013-10-11 13:44 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, patches-QSEj5FYQhm4dnm+yROfE0A
In-Reply-To: <1381391720-28771-1-git-send-email-ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>

On Thu, 2013-10-10 at 09:55 +0200, Ard Biesheuvel wrote:
> Use the generic CCM aead chaining mode driver rather than a local
> implementation that sits right on top of the core AES cipher.
> 
> This allows the use of accelerated implementations of either
> CCM as a whole or the CTR mode which it encapsulates.

Applied.

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: XGMII patch
From: Rayagond K @ 2013-10-11 13:34 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, afleming
In-Reply-To: <CAGVrzcZpZUVNg9zt3x0UsPuV1fSaEy89cg+1=t038_7FzPWjEw@mail.gmail.com>

On Fri, Oct 11, 2013 at 1:52 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> Adding Andy in CC who is the original author
>
> Le 8 oct. 2013 07:19, "Rayagond K" <rayagond@vayavyalabs.com> a écrit :
>
>
>>
>> Hi All,
>>
>> I came across some old patch series to support XGMII in PHYLIB -
>> http://lwn.net/Articles/463344/
>>
>> These patch are not available in mainline kernel still and also PHYLIB
>> doesn't support XGMII also.
>
> This patch set does much more than adding XGMII support, it also adds a
> generic 10G driver, is that what you need or can you live with just
> extending phy_interface_t to know about XGMII and write your own PHY driver?

I need both ie generic 10G driver and extending phy_interface_t to
know about XGMII.

Currently PAL takes care of PHY driver and handles everything related
to PHY device, MAC driver has just registers to PAL and give some
callback function like read_phy, write_phy and adjust_link etc. But
PAL doesn't support XGMII yet, so this patch does that right ?

>
>>
>> Any plan for applying these patch to mainline kernel.
>
> There were some comments on the patchset and there does not seem to have
> been a follow up.

Since this patch adds generic 10G driver and extends PAL to support
XGMII, its worth to follow up comments and add this patch to mainline
kernel asap.

^ permalink raw reply

* Re: [PATCH v2 tip/core/rcu 07/13] ipv6/ip6_tunnel: Apply rcu_access_pointer() to avoid sparse false positive
From: Paul E. McKenney @ 2013-10-11 13:25 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Eric Dumazet, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, peterz, rostedt, dhowells, edumazet,
	darren, fweisbec, sbw, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <20131011002044.GA32546@jtriplet-mobl1>

On Thu, Oct 10, 2013 at 05:20:44PM -0700, Josh Triplett wrote:
> On Wed, Oct 09, 2013 at 05:28:33PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 09, 2013 at 05:12:40PM -0700, Eric Dumazet wrote:
> > > On Wed, 2013-10-09 at 16:40 -0700, Josh Triplett wrote:
> > > 
> > > > that.  Constructs like list_del_rcu are much clearer, and not
> > > > open-coded.  Open-coding synchronization code is almost always a Bad
> > > > Idea.
> > > 
> > > OK, so you think there is synchronization code.
> > > 
> > > I will shut up then, no need to waste time.
> > 
> > As you said earlier, we should at least get rid of the memory barrier
> > as long as we are changing the code.
> > 
> > Josh, what would you suggest as the best way to avoid the memory barrier,
> > keep sparse happy, and not be too ugly?
> 
> The more I think about it, the more I realize that assigning an __rcu
> pointer to an __rcu pointer *without* a memory barrier is a sufficiently
> uncommon case that you probably *should* just write an open-coded
> assignment.  Just please put a very clear comment right before it.

Fair enough, will do!  Given earlier email, I believe that Eric is
fine with this, and if he isn't I am sure he will let us know.  ;-)

> I'd originally thought it might make sense to have a macro similar to
> rcu_assign_pointer, but I just don't think this is a common enough case,
> and we don't want people thinking they can use this in general for __rcu
> to __rcu assignments (most of which still need a memory barrier).

Yep, it is a rather small fraction of rcu_assign_pointer() instances.

							Thanx, Paul

^ permalink raw reply

* [PATCH] issues with printk "mppe_compress[%d]: osize too small! ..."
From: Helmut Grohne @ 2013-10-11 12:57 UTC (permalink / raw)
  To: Paul Mackerras, linux-ppp, netdev@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1140 bytes --]

Dear kernel maintainers,

I ran into an issue that causes "mppe_compress[%d]: osize too small! …"
message to be printed. While debugging the issue, I stumbled across two
additional issues with the code in question.

1) The printk gives a "have" and a "need" value. However the "need"
   value does not represent the actual need, because the printed need
   is based on the osize variable instead of the isize variable used in
   the test. This appears to by a typo an is as old as the ppp_mppe.c
   file. The first patch attached changes the parameter to printk to
   print the actual need.

2) The message in question can be triggered by network packets. If that
   happens, there are a lot of printks and the whole system slows down
   to a crawl. Whenever network packets are processed, the printks must
   be rate limited in some way, so my second patch adds a call to
   net_ratelimit(). Note that even though checkpatch suggested to
   convert the printks to netdev_dbg, I felt that a bug fix should be
   minimal and thus not change this aspect.

Please consider these patches and CC me in replies.

Helmut

[-- Attachment #2: 0001-mppe-print-correct-osize-required-for-compression.patch --]
[-- Type: application/octet-stream, Size: 1270 bytes --]

From 77ef588c202c1e8b23f5292e6b64d7cd69a9daba Mon Sep 17 00:00:00 2001
From: Helmut Grohne <h.grohne@cygnusnetworks.de>
Date: Fri, 11 Oct 2013 14:18:50 +0200
Subject: [PATCH] mppe: print correct osize required for compression

An example message looks like this:

mppe_compress[76]: osize too small! (have: 1404 need: 1408)

The value printed as "need" is not the one that is actually needed,
because it is based on osize instead of isize as used in the test. In
the old implementation the "need" value would always 4 greater than the
"have" value.

Signed-off-by: Helmut Grohne <h.grohne@cygnusnetworks.de>
---
 drivers/net/ppp/ppp_mppe.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ppp/ppp_mppe.c b/drivers/net/ppp/ppp_mppe.c
index 9a1849a..3d595a8 100644
--- a/drivers/net/ppp/ppp_mppe.c
+++ b/drivers/net/ppp/ppp_mppe.c
@@ -385,7 +385,7 @@ mppe_compress(void *arg, unsigned char *ibuf, unsigned char *obuf,
 		/* Drop the packet if we should encrypt it, but can't. */
 		printk(KERN_DEBUG "mppe_compress[%d]: osize too small! "
 		       "(have: %d need: %d)\n", state->unit,
-		       osize, osize + MPPE_OVHD + 2);
+		       osize, isize + MPPE_OVHD + 2);
 		return -1;
 	}

-- 
1.7.10.4

[-- Attachment #3: Type: text/plain, Size: 1 bytes --]

[-- Attachment #4: 0001-mppe-rate-limit-osize-too-small.patch --]
[-- Type: application/octet-stream, Size: 2557 bytes --]

From 3332d572e21f52d65df841ba85537b65d336007e Mon Sep 17 00:00:00 2001
From: Helmut Grohne <h.grohne@cygnusnetworks.de>
Date: Fri, 11 Oct 2013 14:34:06 +0200
Subject: [PATCH] mppe: rate limit "osize too small"

These messages occur per packet and can be triggered by remote sites. As
such they must be rate limited in order to not clog the kernel log
buffers. The corresponding messages from ppp_generic.c are already rate
limited. Example:

mppe_compress[76]: osize too small! (have: 1404 need: 1408)
ppp: compressor dropped pkt
mppe_compress[76]: osize too small! (have: 1404 need: 1408)
ppp: compressor dropped pkt
mppe_compress[76]: osize too small! (have: 1404 need: 1408)
mppe_compress[76]: osize too small! (have: 1404 need: 1408)

Signed-off-by: Helmut Grohne <h.grohne@cygnusnetworks.de>
---
 drivers/net/ppp/ppp_mppe.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ppp/ppp_mppe.c b/drivers/net/ppp/ppp_mppe.c
index 9a1849a..7e71a7a 100644
--- a/drivers/net/ppp/ppp_mppe.c
+++ b/drivers/net/ppp/ppp_mppe.c
@@ -52,6 +52,7 @@
 #include <linux/string.h>
 #include <linux/crypto.h>
 #include <linux/mm.h>
+#include <linux/net.h>
 #include <linux/ppp_defs.h>
 #include <linux/ppp-comp.h>
 #include <linux/scatterlist.h>
@@ -383,9 +384,10 @@ mppe_compress(void *arg, unsigned char *ibuf, unsigned char *obuf,
 	/* Make sure we have enough room to generate an encrypted packet. */
 	if (osize < isize + MPPE_OVHD + 2) {
 		/* Drop the packet if we should encrypt it, but can't. */
-		printk(KERN_DEBUG "mppe_compress[%d]: osize too small! "
-		       "(have: %d need: %d)\n", state->unit,
-		       osize, osize + MPPE_OVHD + 2);
+		if (net_ratelimit())
+			printk(KERN_DEBUG "mppe_compress[%d]: osize too small! "
+			       "(have: %d need: %d)\n", state->unit,
+			       osize, osize + MPPE_OVHD + 2);
 		return -1;
 	}

@@ -497,9 +499,10 @@ mppe_decompress(void *arg, unsigned char *ibuf, int isize, unsigned char *obuf,
 	 * this is to account for possible PFC.
 	 */
 	if (osize < isize - MPPE_OVHD - 1) {
-		printk(KERN_DEBUG "mppe_decompress[%d]: osize too small! "
-		       "(have: %d need: %d)\n", state->unit,
-		       osize, isize - MPPE_OVHD - 1);
+		if (net_ratelimit())
+			printk(KERN_DEBUG "mppe_decompress[%d]: osize too small! "
+			       "(have: %d need: %d)\n", state->unit,
+			       osize, isize - MPPE_OVHD - 1);
 		return DECOMP_ERROR;
 	}
 	osize = isize - MPPE_OVHD - 2;	/* assume no PFC */
-- 
1.7.10.4

^ permalink raw reply related

* RE: [Xen-devel] [PATCH net-next v3 2/5] xen-netback: add support for IPv6 checksum offload from guest
From: Paul Durrant @ 2013-10-11 12:08 UTC (permalink / raw)
  To: Paul Durrant, Wei Liu
  Cc: netdev@vger.kernel.org, Ian Campbell, Wei Liu, David Vrabel,
	xen-devel@lists.xen.org
In-Reply-To: <9AAE0902D5BC7E449B7C8E4E778ABCD013262E@AMSPEX01CL01.citrite.net>

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of Paul Durrant
> Sent: 11 October 2013 12:10
> To: Wei Liu
> Cc: netdev@vger.kernel.org; Ian Campbell; Wei Liu; David Vrabel; xen-
> devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH net-next v3 2/5] xen-netback: add support
> for IPv6 checksum offload from guest
> 
> > -----Original Message-----
> > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > Sent: 11 October 2013 11:10
> > To: Paul Durrant
> > Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David
> Vrabel;
> > Ian Campbell
> > Subject: Re: [PATCH net-next v3 2/5] xen-netback: add support for IPv6
> > checksum offload from guest
> >
> > On Thu, Oct 10, 2013 at 04:25:30PM +0100, Paul Durrant wrote:
> > [...]
> > > -#define PKT_PROT_LEN    (ETH_HLEN + \
> > > -			 VLAN_HLEN + \
> > > -			 sizeof(struct iphdr) + MAX_IPOPTLEN + \
> > > -			 sizeof(struct tcphdr) + MAX_TCP_OPTION_SPACE)
> > > +#define PKT_PROT_LEN 128
> >
> > I'm not against changing this, but could you please add comment on why
> > 128 is chosen.
> >
> 
> Hmm. I did mod the comment, but that change seems to have got lost
> somewhere.
> 
> > >
> > [...]
> > > +	while ((off <= sizeof(struct ipv6hdr) + ntohs(ipv6h->payload_len))
> > &&
> > > +	       !done) {
> > > +		switch (nexthdr) {
> > > +		case IPPROTO_FRAGMENT: {
> > > +			struct frag_hdr *hp = (void *)(skb->data + off);
> > > +
> > > +			header_size = skb->network_header +
> > > +				off +
> > > +				sizeof(struct frag_hdr);
> > > +			maybe_pull_tail(skb, header_size);
> > > +
> > > +			nexthdr = hp->nexthdr;
> > > +			off += 8;
> >
> > Can you use sizeof(frag_hdr) instead of 8?
> >
> > > +			break;
> > > +		}
> > > +		case IPPROTO_DSTOPTS:
> > > +		case IPPROTO_HOPOPTS:
> > > +		case IPPROTO_ROUTING: {
> > > +			struct ipv6_opt_hdr *hp = (void *)(skb->data + off);
> > > +
> > > +			header_size = skb->network_header +
> > > +				off +
> > > +				sizeof(struct ipv6_opt_hdr);
> > > +			maybe_pull_tail(skb, header_size);
> > > +
> > > +			nexthdr = hp->nexthdr;
> > > +			off += ipv6_optlen(hp);
> > > +			break;
> > > +		}
> > > +		case IPPROTO_AH: {
> > > +			struct ip_auth_hdr *hp = (void *)(skb->data + off);
> > > +
> > > +			header_size = skb->network_header +
> > > +				off +
> > > +				sizeof(struct ip_auth_hdr);
> > > +			maybe_pull_tail(skb, header_size);
> > > +
> > > +			nexthdr = hp->nexthdr;
> > > +			off += (hp->hdrlen+2)<<2;
> > > +			break;
> > > +		}
> > > +		case IPPROTO_TCP:
> > > +		case IPPROTO_UDP:
> > > +			found = true;
> > > +			/* Fall through */
> > > +		default:
> > > +			done = true;
> > > +			break;
> >
> > The above 'switch' doesn't seem to cover all IPPROTO_*. Will it cause
> > the loop to exit too early? In other words, does any IPPROTO_* not
> > listed above marks the end of parsing?
> >
> 
> AFAIK, hitting anything not in that switch would mean we don't have a TCP or
> UDP packet. I think the original code structure made that clearer so I'm going
> to go back to that.
> 

Looking at this again I realize I should not parse the fragment header; there's no way anyone should be sending a fragment with partial checksum.

  Paul

> > [...]
> > >  	unsigned long now = jiffies;
> > > @@ -1428,12 +1598,7 @@ static int xenvif_tx_submit(struct xenvif *vif,
> int
> > budget)
> > >
> > >  		xenvif_fill_frags(vif, skb);
> > >
> > > -		/*
> > > -		 * If the initial fragment was < PKT_PROT_LEN then
> > > -		 * pull through some bytes from the other fragments to
> > > -		 * increase the linear region to PKT_PROT_LEN bytes.
> > > -		 */
> > > -		if (skb_headlen(skb) < PKT_PROT_LEN &&
> > skb_is_nonlinear(skb)) {
> > > +		if (skb_is_nonlinear(skb) && skb_headlen(skb) <
> > PKT_PROT_LEN) {
> >
> > This change is not necessary: the comment is useful, and swapping two
> > operands of && doesn't have any effect.
> 
> Sorry, that was too much cutting and pasting around. I'll just ditch that hunk.
> 
>   Paul
> 
> >
> > Wei.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply

* RE: [PATCH net-next v3 5/5] xen-netback: enable IPv6 TCP GSO to the guest
From: Paul Durrant @ 2013-10-11 11:11 UTC (permalink / raw)
  To: Wei Liu
  Cc: xen-devel@lists.xen.org, netdev@vger.kernel.org, Wei Liu,
	David Vrabel, Ian Campbell
In-Reply-To: <20131011100950.GF15556@zion.uk.xensource.com>

> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: 11 October 2013 11:10
> To: Paul Durrant
> Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David Vrabel;
> Ian Campbell
> Subject: Re: [PATCH net-next v3 5/5] xen-netback: enable IPv6 TCP GSO to
> the guest
> 
> On Thu, Oct 10, 2013 at 04:25:33PM +0100, Paul Durrant wrote:
> [...]
> >  	return max;
> > @@ -312,6 +312,7 @@ static struct xenvif_rx_meta
> *get_next_rx_buffer(struct xenvif *vif,
> >  	req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);
> >
> >  	meta = npo->meta + npo->meta_prod++;
> > +	meta->gso_type = 0;
> 
> Should use XEN_NETIF_GSO_TYPE_NONE here.
> 
> >  	meta->gso_size = 0;
> >  	meta->size = 0;
> [...]
> > +	if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) {
> > +		gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
> > +		gso_size = skb_shinfo(skb)->gso_size;
> > +	} else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) {
> > +		gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
> > +		gso_size = skb_shinfo(skb)->gso_size;
> > +	} else {
> > +		gso_type = 0;
> 
> Ditto.
> 
> > +		gso_size = 0;
> > +	}
> > +
> [...]
> > @@ -438,10 +461,13 @@ static int xenvif_gop_skb(struct sk_buff *skb,
> >  	req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);
> >  	meta = npo->meta + npo->meta_prod++;
> >
> > -	if (!vif->gso_prefix)
> > -		meta->gso_size = skb_shinfo(skb)->gso_size;
> > -	else
> > +	if ((1 << gso_type) & vif->gso_mask) {
> > +		meta->gso_type = gso_type;
> > +		meta->gso_size = gso_size;
> > +	} else {
> > +		meta->gso_type = 0;
> 
> Ditto.
> 
> >  		meta->gso_size = 0;
> > +	}
> >
> >  	meta->size = 0;
> >  	meta->id = req->id;
> > @@ -587,7 +613,8 @@ void xenvif_rx_action(struct xenvif *vif)
> >
> >  		vif = netdev_priv(skb->dev);
> >
> > -		if (vif->meta[npo.meta_cons].gso_size && vif->gso_prefix) {
> > +		if ((1 << vif->meta[npo.meta_cons].gso_type) &
> > +		    vif->gso_prefix_mask) {
> 
> The logic is changed here. Why do you remove the test on gso_size?
> 
> >  			resp = RING_GET_RESPONSE(&vif->rx,
> >  						 vif->rx.rsp_prod_pvt++);
> >
> > @@ -624,7 +651,8 @@ void xenvif_rx_action(struct xenvif *vif)
> >  					vif->meta[npo.meta_cons].size,
> >  					flags);
> >
> > -		if (vif->meta[npo.meta_cons].gso_size && !vif->gso_prefix)
> {
> > +		if ((1 << vif->meta[npo.meta_cons].gso_type) &
> > +		    vif->gso_mask) {
> 
> Ditto.
> 
> >  			struct xen_netif_extra_info *gso =
> >  				(struct xen_netif_extra_info *)
> >  				RING_GET_RESPONSE(&vif->rx,
> > @@ -632,8 +660,8 @@ void xenvif_rx_action(struct xenvif *vif)
> >
> [...]
> >  	 * was turned on by a zero value in (or lack of)
> > diff --git a/include/xen/interface/io/netif.h
> b/include/xen/interface/io/netif.h
> > index d7dd8d7..41674bc 100644
> > --- a/include/xen/interface/io/netif.h
> > +++ b/include/xen/interface/io/netif.h
> > @@ -113,6 +113,7 @@ struct xen_netif_tx_request {
> >  #define  XEN_NETIF_EXTRA_FLAG_MORE
> 	(1U<<_XEN_NETIF_EXTRA_FLAG_MORE)
> >
> >  /* GSO types */
> > +#define XEN_NETIF_GSO_TYPE_NONE         (0)
> 
> Indentation.
> 

Agreed. I managed to lose a load of changes and this patch was one I had to re-construct; I'll fix.

  Paul

^ permalink raw reply

* RE: [PATCH net-next v3 2/5] xen-netback: add support for IPv6 checksum offload from guest
From: Paul Durrant @ 2013-10-11 11:09 UTC (permalink / raw)
  To: Wei Liu
  Cc: xen-devel@lists.xen.org, netdev@vger.kernel.org, Wei Liu,
	David Vrabel, Ian Campbell
In-Reply-To: <20131011100943.GE15556@zion.uk.xensource.com>

> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: 11 October 2013 11:10
> To: Paul Durrant
> Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org; Wei Liu; David Vrabel;
> Ian Campbell
> Subject: Re: [PATCH net-next v3 2/5] xen-netback: add support for IPv6
> checksum offload from guest
> 
> On Thu, Oct 10, 2013 at 04:25:30PM +0100, Paul Durrant wrote:
> [...]
> > -#define PKT_PROT_LEN    (ETH_HLEN + \
> > -			 VLAN_HLEN + \
> > -			 sizeof(struct iphdr) + MAX_IPOPTLEN + \
> > -			 sizeof(struct tcphdr) + MAX_TCP_OPTION_SPACE)
> > +#define PKT_PROT_LEN 128
> 
> I'm not against changing this, but could you please add comment on why
> 128 is chosen.
> 

Hmm. I did mod the comment, but that change seems to have got lost somewhere.

> >
> [...]
> > +	while ((off <= sizeof(struct ipv6hdr) + ntohs(ipv6h->payload_len))
> &&
> > +	       !done) {
> > +		switch (nexthdr) {
> > +		case IPPROTO_FRAGMENT: {
> > +			struct frag_hdr *hp = (void *)(skb->data + off);
> > +
> > +			header_size = skb->network_header +
> > +				off +
> > +				sizeof(struct frag_hdr);
> > +			maybe_pull_tail(skb, header_size);
> > +
> > +			nexthdr = hp->nexthdr;
> > +			off += 8;
> 
> Can you use sizeof(frag_hdr) instead of 8?
> 
> > +			break;
> > +		}
> > +		case IPPROTO_DSTOPTS:
> > +		case IPPROTO_HOPOPTS:
> > +		case IPPROTO_ROUTING: {
> > +			struct ipv6_opt_hdr *hp = (void *)(skb->data + off);
> > +
> > +			header_size = skb->network_header +
> > +				off +
> > +				sizeof(struct ipv6_opt_hdr);
> > +			maybe_pull_tail(skb, header_size);
> > +
> > +			nexthdr = hp->nexthdr;
> > +			off += ipv6_optlen(hp);
> > +			break;
> > +		}
> > +		case IPPROTO_AH: {
> > +			struct ip_auth_hdr *hp = (void *)(skb->data + off);
> > +
> > +			header_size = skb->network_header +
> > +				off +
> > +				sizeof(struct ip_auth_hdr);
> > +			maybe_pull_tail(skb, header_size);
> > +
> > +			nexthdr = hp->nexthdr;
> > +			off += (hp->hdrlen+2)<<2;
> > +			break;
> > +		}
> > +		case IPPROTO_TCP:
> > +		case IPPROTO_UDP:
> > +			found = true;
> > +			/* Fall through */
> > +		default:
> > +			done = true;
> > +			break;
> 
> The above 'switch' doesn't seem to cover all IPPROTO_*. Will it cause
> the loop to exit too early? In other words, does any IPPROTO_* not
> listed above marks the end of parsing?
> 

AFAIK, hitting anything not in that switch would mean we don't have a TCP or UDP packet. I think the original code structure made that clearer so I'm going to go back to that.

> [...]
> >  	unsigned long now = jiffies;
> > @@ -1428,12 +1598,7 @@ static int xenvif_tx_submit(struct xenvif *vif, int
> budget)
> >
> >  		xenvif_fill_frags(vif, skb);
> >
> > -		/*
> > -		 * If the initial fragment was < PKT_PROT_LEN then
> > -		 * pull through some bytes from the other fragments to
> > -		 * increase the linear region to PKT_PROT_LEN bytes.
> > -		 */
> > -		if (skb_headlen(skb) < PKT_PROT_LEN &&
> skb_is_nonlinear(skb)) {
> > +		if (skb_is_nonlinear(skb) && skb_headlen(skb) <
> PKT_PROT_LEN) {
> 
> This change is not necessary: the comment is useful, and swapping two
> operands of && doesn't have any effect.

Sorry, that was too much cutting and pasting around. I'll just ditch that hunk.

  Paul

> 
> Wei.

^ permalink raw reply

* Re: [PATCH net] bridge: allow receiption on disabled port
From: Felix Fietkau @ 2013-10-11 10:18 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20131010193551.462fc1f8@nehalam.linuxnetplumber.net>

On 2013-10-11 4:35 AM, Stephen Hemminger wrote:
> This is what I was thinking would be better.
> 
> Don't want these packets leaking into PRE_ROUTEING chain or have
> any chance to get flooded out other ports.
> 
> Compile tested only...
> 
> I could use another goto instead but that becomes spaghetti and
> never like to jump back into a block.
[...]
>  forward:
>  	switch (p->state) {
> +	case BR_STATE_DISABLED:
> +		if (!ether_addr_equal(p->br->dev->dev_addr, dest))
> +			goto drop;
Checking against the bridge device address isn't enough, WPA EAPOL
packets are addressed to the wifi device MAC address.

- Felix

^ permalink raw reply

* Re: [PATCH net-next v3 5/5] xen-netback: enable IPv6 TCP GSO to the guest
From: Wei Liu @ 2013-10-11 10:09 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, netdev, Wei Liu, David Vrabel, Ian Campbell
In-Reply-To: <1381418733-20470-6-git-send-email-paul.durrant@citrix.com>

On Thu, Oct 10, 2013 at 04:25:33PM +0100, Paul Durrant wrote:
[...]
>  	return max;
> @@ -312,6 +312,7 @@ static struct xenvif_rx_meta *get_next_rx_buffer(struct xenvif *vif,
>  	req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);
>  
>  	meta = npo->meta + npo->meta_prod++;
> +	meta->gso_type = 0;

Should use XEN_NETIF_GSO_TYPE_NONE here.

>  	meta->gso_size = 0;
>  	meta->size = 0;
[...]
> +	if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) {
> +		gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
> +		gso_size = skb_shinfo(skb)->gso_size;
> +	} else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) {
> +		gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
> +		gso_size = skb_shinfo(skb)->gso_size;
> +	} else {
> +		gso_type = 0;

Ditto.

> +		gso_size = 0;
> +	}
> +
[...]
> @@ -438,10 +461,13 @@ static int xenvif_gop_skb(struct sk_buff *skb,
>  	req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++);
>  	meta = npo->meta + npo->meta_prod++;
>  
> -	if (!vif->gso_prefix)
> -		meta->gso_size = skb_shinfo(skb)->gso_size;
> -	else
> +	if ((1 << gso_type) & vif->gso_mask) {
> +		meta->gso_type = gso_type;
> +		meta->gso_size = gso_size;
> +	} else {
> +		meta->gso_type = 0;

Ditto.

>  		meta->gso_size = 0;
> +	}
>  
>  	meta->size = 0;
>  	meta->id = req->id;
> @@ -587,7 +613,8 @@ void xenvif_rx_action(struct xenvif *vif)
>  
>  		vif = netdev_priv(skb->dev);
>  
> -		if (vif->meta[npo.meta_cons].gso_size && vif->gso_prefix) {
> +		if ((1 << vif->meta[npo.meta_cons].gso_type) &
> +		    vif->gso_prefix_mask) {

The logic is changed here. Why do you remove the test on gso_size?

>  			resp = RING_GET_RESPONSE(&vif->rx,
>  						 vif->rx.rsp_prod_pvt++);
>  
> @@ -624,7 +651,8 @@ void xenvif_rx_action(struct xenvif *vif)
>  					vif->meta[npo.meta_cons].size,
>  					flags);
>  
> -		if (vif->meta[npo.meta_cons].gso_size && !vif->gso_prefix) {
> +		if ((1 << vif->meta[npo.meta_cons].gso_type) &
> +		    vif->gso_mask) {

Ditto.

>  			struct xen_netif_extra_info *gso =
>  				(struct xen_netif_extra_info *)
>  				RING_GET_RESPONSE(&vif->rx,
> @@ -632,8 +660,8 @@ void xenvif_rx_action(struct xenvif *vif)
>  
[...]
>  	 * was turned on by a zero value in (or lack of)
> diff --git a/include/xen/interface/io/netif.h b/include/xen/interface/io/netif.h
> index d7dd8d7..41674bc 100644
> --- a/include/xen/interface/io/netif.h
> +++ b/include/xen/interface/io/netif.h
> @@ -113,6 +113,7 @@ struct xen_netif_tx_request {
>  #define  XEN_NETIF_EXTRA_FLAG_MORE	(1U<<_XEN_NETIF_EXTRA_FLAG_MORE)
>  
>  /* GSO types */
> +#define XEN_NETIF_GSO_TYPE_NONE         (0)

Indentation.

Wei.

^ permalink raw reply

* Re: [PATCH net-next v3 2/5] xen-netback: add support for IPv6 checksum offload from guest
From: Wei Liu @ 2013-10-11 10:09 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, netdev, Wei Liu, David Vrabel, Ian Campbell
In-Reply-To: <1381418733-20470-3-git-send-email-paul.durrant@citrix.com>

On Thu, Oct 10, 2013 at 04:25:30PM +0100, Paul Durrant wrote:
[...]
> -#define PKT_PROT_LEN    (ETH_HLEN + \
> -			 VLAN_HLEN + \
> -			 sizeof(struct iphdr) + MAX_IPOPTLEN + \
> -			 sizeof(struct tcphdr) + MAX_TCP_OPTION_SPACE)
> +#define PKT_PROT_LEN 128

I'm not against changing this, but could you please add comment on why
128 is chosen.

>  
[...]
> +	while ((off <= sizeof(struct ipv6hdr) + ntohs(ipv6h->payload_len)) &&
> +	       !done) {
> +		switch (nexthdr) {
> +		case IPPROTO_FRAGMENT: {
> +			struct frag_hdr *hp = (void *)(skb->data + off);
> +
> +			header_size = skb->network_header +
> +				off +
> +				sizeof(struct frag_hdr);
> +			maybe_pull_tail(skb, header_size);
> +
> +			nexthdr = hp->nexthdr;
> +			off += 8;

Can you use sizeof(frag_hdr) instead of 8?

> +			break;
> +		}
> +		case IPPROTO_DSTOPTS:
> +		case IPPROTO_HOPOPTS:
> +		case IPPROTO_ROUTING: {
> +			struct ipv6_opt_hdr *hp = (void *)(skb->data + off);
> +
> +			header_size = skb->network_header +
> +				off +
> +				sizeof(struct ipv6_opt_hdr);
> +			maybe_pull_tail(skb, header_size);
> +
> +			nexthdr = hp->nexthdr;
> +			off += ipv6_optlen(hp);
> +			break;
> +		}
> +		case IPPROTO_AH: {
> +			struct ip_auth_hdr *hp = (void *)(skb->data + off);
> +
> +			header_size = skb->network_header +
> +				off +
> +				sizeof(struct ip_auth_hdr);
> +			maybe_pull_tail(skb, header_size);
> +
> +			nexthdr = hp->nexthdr;
> +			off += (hp->hdrlen+2)<<2;
> +			break;
> +		}
> +		case IPPROTO_TCP:
> +		case IPPROTO_UDP:
> +			found = true;
> +			/* Fall through */
> +		default:
> +			done = true;
> +			break;

The above 'switch' doesn't seem to cover all IPPROTO_*. Will it cause
the loop to exit too early? In other words, does any IPPROTO_* not
listed above marks the end of parsing?

[...]
>  	unsigned long now = jiffies;
> @@ -1428,12 +1598,7 @@ static int xenvif_tx_submit(struct xenvif *vif, int budget)
>  
>  		xenvif_fill_frags(vif, skb);
>  
> -		/*
> -		 * If the initial fragment was < PKT_PROT_LEN then
> -		 * pull through some bytes from the other fragments to
> -		 * increase the linear region to PKT_PROT_LEN bytes.
> -		 */
> -		if (skb_headlen(skb) < PKT_PROT_LEN && skb_is_nonlinear(skb)) {
> +		if (skb_is_nonlinear(skb) && skb_headlen(skb) < PKT_PROT_LEN) {

This change is not necessary: the comment is useful, and swapping two
operands of && doesn't have any effect.

Wei.

^ permalink raw reply

* [patch] farsync: fix info leak in ioctl
From: Dan Carpenter @ 2013-10-11  9:50 UTC (permalink / raw)
  To: Kevin Curtis; +Cc: Salva Peiró, security, netdev
In-Reply-To: <5257BFBA.7030405@ai2.upv.es>

From: Salva Peiró <speiro@ai2.upv.es>

The fst_get_iface() code fails to initialize the two padding bytes of
struct sync_serial_settings after the ->loopback member. Add an explicit
memset(0) before filling the structure to avoid the info leak.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

---
 linux-3.4-xm/drivers/net/wan/farsync.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/linux-3.4-xm/drivers/net/wan/farsync.c b/linux-3.4-xm/drivers/net/wan/farsync.c
index 1a62318..3710427 100644
--- a/drivers/net/wan/farsync.c
+++ b/drivers/net/wan/farsync.c
@@ -1972,6 +1972,7 @@ fst_get_iface(struct fst_card_info *card, struct fst_port_info *port,
 	}
 
 	i = port->index;
+	memset(&sync, 0, sizeof(sync));
 	sync.clock_rate = FST_RDL(card, portConfig[i].lineSpeed);
 	/* Lucky card and linux use same encoding here */
 	sync.clock_type = FST_RDB(card, portConfig[i].internalClock) ==
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH RFC 0/2] xfrm: Remove ancient sleeping code
From: Steffen Klassert @ 2013-10-11  9:21 UTC (permalink / raw)
  To: Fan Du; +Cc: netdev
In-Reply-To: <5257A650.4090408@windriver.com>

On Fri, Oct 11, 2013 at 03:18:40PM +0800, Fan Du wrote:
> >>In xfrm policy queue, XFRM_MAX_QUEUE_LEN is 100, which means 101th skb
> >>will be dropped, how about make it configurable?
> >
> >IMO we would have yet another useless knob then. Currently we send all
> >packets by default to a blackhole as long as the state is not resolved
> >and most people are fine with it. The queueing is mostly to speed up
> >tcp handshakes,
> 
> I cannot follow on this part. Would you please mind to explain how making a
> policy queue will speed up TCP handshakes than orignal CAN_SLEEP mechanism?
> 

I have not said that queueing is any faster than the sleeping mechanism.
But it is faster than the current default that simply sends the first
packets to a blackhole and can be used regardless if we can sleep.

All I wanted to say is that there are not many packets needed to establish
tcp connections, so I think a 100 packets queue is sufficient.

^ permalink raw reply

* Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
From: Alexander Gordeev @ 2013-10-11  8:41 UTC (permalink / raw)
  To: Mark Lord
  Cc: H. Peter Anvin, Benjamin Herrenschmidt, linux-kernel,
	Bjorn Helgaas, Ralf Baechle, Michael Ellerman, Martin Schwidefsky,
	Ingo Molnar, Tejun Heo, Dan Williams, Andy King, Jon Mason,
	Matt Porter, linux-pci, linux-mips, linuxppc-dev, linux390,
	linux-s390, x86, linux-ide, iss_storagedev, linux-nvme,
	linux-rdma, netdev, e1000-dev
In-Reply-To: <5257357E.8080506@start.ca>

On Thu, Oct 10, 2013 at 07:17:18PM -0400, Mark Lord wrote:
> Just to help us all understand "the loop" issue..
> 
> Here's an example of driver code which uses the existing MSI-X interfaces,
> for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt.
> This is from a new driver I'm working on right now:
> 
> 
> static int xx_alloc_msix_irqs (struct xx_dev *dev, int nvec)
> {
>         xx_disable_all_irqs(dev);
>         do {
>                 if (nvec < 2)
>                         xx_prep_for_1_msix_vector(dev);
>                 else if (nvec < 4)
>                         xx_prep_for_2_msix_vectors(dev);
>                 else if (nvec < 8)
>                         xx_prep_for_4_msix_vectors(dev);
>                 else if (nvec < 16)
>                         xx_prep_for_8_msix_vectors(dev);
>                 else
>                         xx_prep_for_16_msix_vectors(dev);
>                 nvec = pci_enable_msix(dev->pdev, dev->irqs, dev->num_vectors);
>         } while (nvec > 0);
> 
>         if (nvec) {
>                 kerr(dev->name, "pci_enable_msix() failed, err=%d", nvec);
>                 dev->num_vectors = 0;
>                 return nvec;
>         }
>         return 0;       /* success */
> }

Yeah, that is a very good example.

I would move all xx_prep_for_<pow2>_msix_vector() functions to a single
helper i.e. xx_prep_msix_vectors(dev, ndev).

Considering also (a) we do not want to waste unused platform resources
associated with MSI-Xs and pull more quota than needed and (b) fixing
couple of bugs, this function could look like this:

static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec_in)
{
	int nvec = roundup_pow_of_two(nvec_in);	/* assume 0 > nvec_in <= 16 */
	int rc;

	xx_disable_all_irqs(dev);

retry:
       	xx_prep_for_msix_vectors(dev, nvec);

	rc = pci_enable_msix(dev->pdev, dev->irqs, nvec);	/* (b) */
	if (rc > 0) {
		nvec = rounddown_pow_of_two(nvec);		/* (a) */
		goto retry;
	}

	if (rc) {
		kerr(dev->name, "pci_enable_msix() failed, err=%d", rc);
		dev->num_vectors = 0;
		return rc;
	}

	dev->num_vectors = nvec;				/* (b) */
	return 0;       /* success */
}

Now, this is a loop-free alternative:

static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec)
{
	nvec = roundup_pow_of_two(nvec);	/* assume 0 > nvec <= 16 */

	xx_disable_all_irqs(dev);

	pci_lock_msi(dev->pdev);

	rc = pci_get_msix_limit(dev->pdev, nvec);
	if (rc < 0)
		goto err;

	nvec = min(nvec, rc);		/* if limit is more than requested */
	nvec = rounddown_pow_of_two(nvec);	/* (a) */

	xx_prep_for_msix_vectors(dev, nvec);

	rc = pci_enable_msix(dev->pdev, dev->irqs, nvec);	/* (b)	*/
	if (rc < 0)
		goto err;

	pci_unlock_msi(dev->pdev);

	dev->num_vectors = nvec;		/* (b) */
	return 0;

err:
	pci_unlock_msi(dev->pdev);

        kerr(dev->name, "pci_enable_msix() failed, err=%d", rc);
        dev->num_vectors = 0;
        return rc;
}

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply

* Re: [PATCH net 0/4] bridge: Fix problems around the PVID
From: Toshiaki Makita @ 2013-10-11  7:34 UTC (permalink / raw)
  To: vyasevic
  Cc: Toshiaki Makita, David Miller, netdev, Fernando Luis Vazquez Cao,
	Patrick McHardy
In-Reply-To: <52556FCB.70904@redhat.com>

On Wed, 2013-10-09 at 11:01 -0400, Vlad Yasevich wrote:
> On 10/01/2013 07:56 AM, Toshiaki Makita wrote:
> > On Mon, 2013-09-30 at 12:01 -0400, Vlad Yasevich wrote:
> >> On 09/30/2013 07:46 AM, Toshiaki Makita wrote:
> >>> On Fri, 2013-09-27 at 14:10 -0400, Vlad Yasevich wrote:
> >>>> On 09/27/2013 01:11 PM, Toshiaki Makita wrote:
> >>>>> On Thu, 2013-09-26 at 10:22 -0400, Vlad Yasevich wrote:
> >>>>>> On 09/26/2013 06:38 AM, Toshiaki Makita wrote:
> >>>>>>> On Tue, 2013-09-24 at 13:55 -0400, Vlad Yasevich wrote:
> >>>>>>>> On 09/24/2013 01:30 PM, Toshiaki Makita wrote:
> >>>>>>>>> On Tue, 2013-09-24 at 09:35 -0400, Vlad Yasevich wrote:
> >>>>>>>>>> On 09/24/2013 07:45 AM, Toshiaki Makita wrote:
> >>>>>>>>>>> On Mon, 2013-09-23 at 10:41 -0400, Vlad Yasevich
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>> On 09/17/2013 04:12 AM, Toshiaki Makita wrote:
> >>>>>>>>>>>>> On Mon, 2013-09-16 at 13:49 -0400, Vlad Yasevich
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> On 09/13/2013 08:06 AM, Toshiaki Makita wrote:
> >>>>>>>>>>>>>>> On Thu, 2013-09-12 at 16:00 -0400, David
> >>>>>>>>>>>>>>> Miller wrote:
> >>>>>>>>>>>>>>>> From: Toshiaki Makita
> >>>>>>>>>>>>>>>> <makita.toshiaki@lab.ntt.co.jp> Date: Tue,
> >>>>>>>>>>>>>>>> 10 Sep 2013 19:27:54 +0900
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> There seem to be some undesirable
> >>>>>>>>>>>>>>>>> behaviors related with PVID. 1. It has no
> >>>>>>>>>>>>>>>>> effect assigning PVID to a port. PVID
> >>>>>>>>>>>>>>>>> cannot be applied to any frame regardless
> >>>>>>>>>>>>>>>>> of whether we set it or not. 2. FDB
> >>>>>>>>>>>>>>>>> entries learned via frames applied PVID
> >>>>>>>>>>>>>>>>> are registered with VID 0 rather than VID
> >>>>>>>>>>>>>>>>> value of PVID. 3. We can set 0 or 4095 as
> >>>>>>>>>>>>>>>>> a PVID that are not allowed in IEEE
> >>>>>>>>>>>>>>>>> 802.1Q. This leads interoperational
> >>>>>>>>>>>>>>>>> problems such as sending frames with VID
> >>>>>>>>>>>>>>>>> 4095, which is not allowed in IEEE
> >>>>>>>>>>>>>>>>> 802.1Q, and treating frames with VID 0 as
> >>>>>>>>>>>>>>>>> they belong to VLAN 0, which is expected
> >>>>>>>>>>>>>>>>> to be handled as they have no VID
> >>>>>>>>>>>>>>>>> according to IEEE 802.1Q.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Note: 2nd and 3rd problems are potential
> >>>>>>>>>>>>>>>>> and not exposed unless 1st problem is
> >>>>>>>>>>>>>>>>> fixed, because we cannot activate PVID
> >>>>>>>>>>>>>>>>> due to it.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Please work out the issues in patch #2 with
> >>>>>>>>>>>>>>>> Vlad and resubmit this series.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thank you.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm hovering between whether we should fix
> >>>>>>>>>>>>>>> the issue by changing vlan 0 interface
> >>>>>>>>>>>>>>> behavior in 8021q module or enabling a bridge
> >>>>>>>>>>>>>>> port to sending priority-tagged frames, or
> >>>>>>>>>>>>>>> another better way.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If you could comment it, I'd appreciate it
> >>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> BTW, I think what is discussed in patch #2 is
> >>>>>>>>>>>>>>> another problem about handling priority-tags,
> >>>>>>>>>>>>>>> and it exists without this patch set
> >>>>>>>>>>>>>>> applied. It looks like that we should prepare
> >>>>>>>>>>>>>>> another patch set than this to fix that
> >>>>>>>>>>>>>>> problem.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Should I include patches that fix the
> >>>>>>>>>>>>>>> priority-tags problem in this patch set and
> >>>>>>>>>>>>>>> resubmit them all together?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am thinking that we might need to do it in
> >>>>>>>>>>>>>> bridge and it looks like the simplest way to do
> >>>>>>>>>>>>>> it is to have default priority regeneration
> >>>>>>>>>>>>>> table (table 6-5 from 802.1Q doc).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> That way I think we would conform to the spec.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -vlad
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Unfortunately I don't think the default priority
> >>>>>>>>>>>>> regeneration table resolves the problem because
> >>>>>>>>>>>>> IEEE 802.1Q says that a VLAN-aware bridge can
> >>>>>>>>>>>>> transmit untagged or VLAN-tagged frames only (the
> >>>>>>>>>>>>> end of section 7.5 and 8.1.7).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> No mechanism to send priority-tagged frames is
> >>>>>>>>>>>>> found as far as I can see the standard. I think
> >>>>>>>>>>>>> the regenerated priority is used for outgoing
> >>>>>>>>>>>>> PCP field only if egress policy is not untagged
> >>>>>>>>>>>>> (i.e. transmitting as VLAN-tagged), and unused if
> >>>>>>>>>>>>> untagged (Section 6.9.2 3rd/4th Paragraph).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If we want to transmit priority-tagged frames
> >>>>>>>>>>>>> from a bridge port, I think we need to implement
> >>>>>>>>>>>>> a new (optional) feature that is above the
> >>>>>>>>>>>>> standard, as I stated previously.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> How do you feel about adding a per-port policy
> >>>>>>>>>>>>> that enables a bridge to send priority-tagged
> >>>>>>>>>>>>> frames instead of untagged frames when egress
> >>>>>>>>>>>>> policy for the port is untagged? With this
> >>>>>>>>>>>>> change, we can transmit frames for a given vlan
> >>>>>>>>>>>>> as either all untagged, all priority-tagged or
> >>>>>>>>>>>>> all VLAN-tagged.
> >>>>>>>>>>>>
> >>>>>>>>>>>> That would work.  What I am thinking is that we do
> >>>>>>>>>>>> it by special casing the vid 0 egress policy
> >>>>>>>>>>>> specification.  Let it be untagged by default and
> >>>>>>>>>>>> if it is tagged, then we preserve the priority
> >>>>>>>>>>>> field and forward it on.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This keeps the API stable and doesn't require
> >>>>>>>>>>>> user/admin from knowing exactly what happens.
> >>>>>>>>>>>> Default operation conforms to the spec and allows
> >>>>>>>>>>>> simple change to make it backward-compatible.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think.  I've done a simple prototype of
> >>>>>>>>>>>> this an it seems to work with the VMs I am testing
> >>>>>>>>>>>> with.
> >>>>>>>>>>>
> >>>>>>>>>>> Are you saying that - by default, set the 0th bit of
> >>>>>>>>>>> untagged_bitmap; and - if we unset the 0th bit and
> >>>>>>>>>>> set the "vid"th bit, we transmit frames classified as
> >>>>>>>>>>> belonging to VLAN "vid" as priority-tagged?
> >>>>>>>>>>>
> >>>>>>>>>>> If so, though it's attractive to keep current API,
> >>>>>>>>>>> I'm worried about if it could be a bit confusing and
> >>>>>>>>>>> not intuitive for kernel/iproute2 developers that VID
> >>>>>>>>>>> 0 has a special meaning only in the egress policy.
> >>>>>>>>>>> Wouldn't it be better to adding a new member to
> >>>>>>>>>>> struct net_port_vlans instead of using VID 0 of
> >>>>>>>>>>> untagged_bitmap?
> >>>>>>>>>>>
> >>>>>>>>>>> Or are you saying that we use a new flag in struct
> >>>>>>>>>>> net_port_vlans but use the BRIDGE_VLAN_INFO_UNTAGGED
> >>>>>>>>>>> bit with VID 0 in netlink to set the flag?
> >>>>>>>>>>>
> >>>>>>>>>>> Even in that case, I'm afraid that it might be
> >>>>>>>>>>> confusing for developers for the same reason. We are
> >>>>>>>>>>> going to prohibit to specify VID with 0 (and 4095) in
> >>>>>>>>>>> adding/deleting a FDB entry or a vlan filtering
> >>>>>>>>>>> entry, but it would allow us to use VID 0 only when a
> >>>>>>>>>>> vlan filtering entry is configured. I am thinking a
> >>>>>>>>>>> new nlattr is a straightforward approach to
> >>>>>>>>>>> configure it.
> >>>>>>>>>>
> >>>>>>>>>> By making this an explicit attribute it makes vid 0 a
> >>>>>>>>>> special case for any automatic tool that would
> >>>>>>>>>> provision such filtering.  Seeing vid 0 would mean that
> >>>>>>>>>> these tools would have to know that this would have to
> >>>>>>>>>> be translated to a different attribute instead of
> >>>>>>>>>> setting the policy values.
> >>>>>>>>>
> >>>>>>>>> Yes, I agree with you that we can do it by the way you
> >>>>>>>>> explained. What I don't understand is the advantage of
> >>>>>>>>> using vid 0 over another way such as adding a new
> >>>>>>>>> nlattr. I think we can indicate transmitting
> >>>>>>>>> priority-tags explicitly by such a nlattr. Using vid 0
> >>>>>>>>> seems to be easier to implement than a new nlattr, but,
> >>>>>>>>> for me, it looks less intuitive and more difficult to
> >>>>>>>>> maintain because we have to care about vid 0 instead of
> >>>>>>>>> simply ignoring it.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> The point I am trying to make is that regardless of the
> >>>>>>>> approach someone has to know what to do when enabling
> >>>>>>>> priority tagged frames.  You proposal would require the
> >>>>>>>> administrator or config tool to have that knowledge.
> >>>>>>>> Example is: Admin does: bridge vlan set priority on dev
> >>>>>>>> eth0 Automated app: if (vid == 0) /* Turn on priority
> >>>>>>>> tagged frame support */
> >>>>>>>>
> >>>>>>>> My proposal would require the bridge filtering
> >>>>>>>> implementation to have it. user tool: bridge vlan add vid 0
> >>>>>>>> tagged Automated app:  No special case.
> >>>>>>>>
> >>>>>>>> IMO its better to have 1 piece code handling the special
> >>>>>>>> case then putting it multiple places.
> >>>>>>>
> >>>>>>> Thank you for the detailed explanation. Now I understand your
> >>>>>>> intention.
> >>>>>>>
> >>>>>>> I have one question about your proposal. I guess the way to
> >>>>>>> enable priority-tagged is something like bridge vlan add vid
> >>>>>>> 10 dev eth0 bridge vlan add vid 10 dev vnet0 pvid untagged
> >>>>>>> bridge vlan add vid 0 dev vnet0 tagged where vnet0 has sub
> >>>>>>> interface vnet0.0.
> >>>>>>>
> >>>>>>> Here the admin have to know the egress policy is applied to a
> >>>>>>> frame twice in a certain order when it is transmitted from
> >>>>>>> the port vnet0 attached, that is, first, a frame with vid 10
> >>>>>>> get untagged, and then, an untagged frame get
> >>>>>>> priority-tagged.
> >>>>>>>
> >>>>>>> This behavior looks difficult to know without previous
> >>>>>>> knowledge. Any good idea to avoid such a need for the admin's
> >>>>>>> additional knowledge?
> >>>>>>
> >>>>>> To me, the fact that there is vnet0.0 (or typically, there is
> >>>>>> eth0.0 in the guest or on the remote host) already tells the
> >>>>>> admin vlan 0 has to be tagged.  The fact that we codify this in
> >>>>>> the policy makes it explicit.
> >>>>>
> >>>>> My worry is that the admin might not be able to guess how to use
> >>>>> bridge commands to enable priority-tag without any additional
> >>>>> hint in "man bridge", "bridge vlan help", etc. I actually
> >>>>> couldn't hit upon such a usage before seeing example commands you
> >>>>> gave, because I had never think the egress policy could be
> >>>>> applied twice.
> >>>>>
> >>>>>>
> >>>>>> However, I can see strong argument to be made for an addition
> >>>>>> egress policy attribute that could be for instance:
> >>>>>>
> >>>>>> bridge vlan add vid 10 dev eth0 pvid bridge vlan add vid 10 dev
> >>>>>> vnet0 pvid untagged prio_tag
> >>>>>>
> >>>>>> But this has the same connotations as wrt to egress policy.
> >>>>>> The 2 policies are applied: (1) untag the frame. (2) add
> >>>>>> priority_tag.
> >>>>>>
> >>>>>> (2) only happens if initial fame received on eth0 was priority
> >>>>>> tagged.
> >>>>>
> >>>>> If we do so, we will not be able to communicate using vlan 0
> >>>>> interface under a certain circumstance. Eth0 can be receive mixed
> >>>>> untagged and priority-tagged frames according to the network
> >>>>> element it is connected to: for example, Open vSwitch can send
> >>>>> such two kinds of frames from the same port even if original
> >>>>> incoming frames belong to the same vlan.
> >>>>
> >>>> Which priority would you assign to the frame that was received
> >>>> untagged?
> >>>
> >>> Untagged frame's priority is by default 0, so I think 0 is likely.
> >>>
> >>> 802.1Q 6.9.1 i) The received priority value and the drop_eligible
> >>> parameter value are the values in the M_UNITDATA.indication.
> >>>
> >>> M_UNITDATA is passed from ISS.
> >>>
> >>> 802.1Q 6.7.1 The priority parameter provided in a data indication
> >>> primitive shall take the value of the Default User Priority parameter
> >>> for the Port through which the MAC frame was received. The default
> >>> value of this parameter is 0, it may be set by management in which
> >>> case the capability to set it to any of the values 0 through 7 shall
> >>> be provided.
> >>>
> >>>>
> >>>>> In this situation, we can only receive frames that is
> >>>>> priority-tagged when received on eth0.
> >>>>
> >>>> Not sure I understand.  Let's look at this config: bridge vlan add
> >>>> vid 10 dev eth0 pvid bridge vlan add vid 10 dev vnet0 pvid untagged
> >>>> prio_tag
> >>>>
> >>>> Here, eth0 is allowed to receive vid 10 tagged, untagged, and
> >>>> prio_tagged (if we look at the patch 2 from this set). Now, frame
> >>>> is forwarded to vnet0. 1) if the frame had vid 10 in the tag or was
> >>>> untagged, it should probably be sent untagged. 2) if the frame had
> >>>> a priority tag, it should probably be sent as such.
> >>>>
> >>>> Now, I think a case could be made that if the frame had any
> >>>> priority markings in the vlan header, we should try to preserve
> >>>> those markings if prio_tag is turned on.  We can assume value of 0
> >>>> means not set.
> >>>
> >>> If we don't insert prio_tag when PCP is 0, we might receive mixed
> >>> priority-tagged and untagged frames on eth0.
> >>
> >> Right, and that's what you were trying to handle in your patch:
> >>
> >>> +		/* PVID is set on this port.  Any untagged or priority-tagged +
> >>> * ingress frame is considered to belong to this vlan. */
> >>
> >> So, in this case we are prepared to handle the "mixed" scenario on ingress.
> >>
> >>> Even if we are sending frames from eth0.0 with some priority other
> >>> than 0, we could receive frames with priority 0 or untagged on the
> >>> other side of the bridge.
> >>> For example, if we receive untagged arp reply on the bridge port, we
> >>> migit not be able to communicate with such an end station, because
> >>> untagged reply will not be passed to eth0.0.
> >>
> >> So the ARP request was sent tagged, but the reply came back untagged?
> >
> > Yes, it can happen.
> > These are problematic cases.
> >
> > Example 1:
> >              prio_tagged         prio_tagged
> > +------------+ ---> +------------+ ---> +----------+
> > |guest eth0.0|------|host1 Bridge|------|host2 eth0|
> > +------------+ <--- +------------+ <--- +----------+
> >               untagged            untagged
> >
> > Note: Host2 eth0, which is an interface on Linux, can receive
> > priority-tagged frames if it doesn't have vlan 0 interface (eth0.0).
> 
> Hmm..  Just to see if this works, I ran the this scenario with
> a dumb switch in the middle, and it did not work as you noted.
> I then realized that one of the kernels was rather old and after 
> updating it, behaved differently.  The communication still didn't
> work, but host2 behaved properly.
> 
> >>
> >> How does that work when the end station is attached directly to the
> >> HW switch instead of a linux bridge?
> >>
> >> The station configures eth0.0 and sends priority-tagged traffic to
> >> the HW switch.  If the HW switch sends back untagged traffic, then
> >> the untagged traffic will never reach eth0.0.
> >
> > Currently we cannot communicate using eth0.0 via directly connected
> > 802.1Q conformed switch, because we never receive priority-tagged frames
> > from the switch.
> > It is not a problem of Linux bridge and is why I wondered whether it
> > should be fixed by bridge or vlan 0 interface.
> >
> >>
> >>>
> >>>>
> >>>>> IMO, if prio_tag is configured, the bridge should send any
> >>>>> untagged frame as priority-tagged regardless of whatever it is on
> >>>>> eth0.
> >>>>
> >>>> Which priority would you use, 0?  You are not guaranteed to
> >>>> properly deliver the traffic then for a configuration such as: VM:
> >>>> eth0: 10.0.0.1/24 eth0.0: 10.0.1.1/24
> >>>
> >>> I'd like to use priority 0 for untagged frames.
> >>>
> >>> I am assuming that one of our goals is at least that eth0.0 comes to
> >>> be able to communicate with another end station. It seems to be hard
> >>> to use both eth0 and eth0.0 simultaneously.
> >>
> >> I understand, but I don't agree that we should always tag.
> >>
> >> Consider config:
> >>
> >>      hw switch <---> (eth0: Linux Bridge: eth1) <--- (em1.0:end station)
> >>
> >> If the end station sends priority-tagged traffic it should receive
> >> priority tagged traffic back.  Otherwise, untagged traffic may be
> >> dropped by the end station.  This is true whether it is connected to
> >> the hw switch or Linux bridge.
> >
> > Though such a behavior is generally not necessary as far as I can read
> > 802.1Q spec, it is essential for vlan 0 interface on Linux, I think.
> > My proposal aims to resolve it at least when we use Linux bridge.
> >
> > Example configuration:
> > 	bridge vlan add vid 10 dev eth1 pvid untagged
> > 	bridge vlan add vid 10 dev eth0
> > 	bridge vlan set prio_tag on dev eth1
> >
> > Intended behavior:
> >
> >          VID10-tagged                     prio_tagged
> > +---------+ <--- +------------------------+ <--- +-----------------+
> > |hw switch|------|eth0: Linux Bridge: eth1|------|em1.0:end station|
> > +---------+ ---> +------------------------+ ---> +-----------------+
> >          VID10-tagged                     prio_tagged
> >                                (always if egress policy untagged)
> 
> Ok, I think you've convinced me that this is the right approach. The 
> only thing that I am not crazy about is the API.  I'd almost want to
> introduce a new flag that can be set in a 'vlan add' or 'vlan set'
> command that communicates a new policy.

I'm glad that we reached a consensus on the approach :)

I agree with you that the API is flag based.
I'm guessing your intention is that 'vlan add' means a per vlan per port
policy and 'vlan set' means a per port one, that is,
	'vlan add': bridge vlan add vid 10 dev eth1 prio_tag
	'vlan set': bridge vlan set prio_tag on dev eth1

I think they can behave differently only when we set untagged to
multiple vlans on the same port.

'vlan add' example with vid 10 and 20:
	bridge vlan add vid 10 dev eth1 pvid untagged prio_tag
	bridge vlan add vid 10 dev eth0
	bridge vlan add vid 20 dev eth1 untagged
	bridge vlan add vid 20 dev eth2

         VID10-tagged                  prio_tagged (from eth0)
+---------+ ---> +------------------------+ ---> +-----------------+
|hw switch|------|eth0                eth1|------|em1.0:end station|
+---------+      |      Linux Bridge      | ---> +-----------------+
+---------+      |                        | *untagged*
|hw switch|------|eth2                    | (from eth2)
+---------+ ---> +------------------------+
         VID20-tagged


'vlan set' example with vid 10 and 20:
	bridge vlan add vid 10 dev eth1 pvid untagged
	bridge vlan add vid 10 dev eth0
	bridge vlan add vid 20 dev eth1 untagged
	bridge vlan add vid 20 dev eth2
	bridge vlan set prio_tag on dev eth1

         VID10-tagged                  prio_tagged (from eth0)
+---------+ ---> +------------------------+ ---> +-----------------+
|hw switch|------|eth0                eth1|------|em1.0:end station|
+---------+      |      Linux Bridge      | ---> +-----------------+
+---------+      |                        | prio_tagged
|hw switch|------|eth2                    | (from eth2)
+---------+ ---> +------------------------+
         VID20-tagged

Em1.0 can always receive traffic from eth1 if we adopt 'vlan set'.
However, I cannot imagine when multiple untagged vlans is required, so
cannot figure out whether 'vlan add' is useful or harmful.
Anyway, both of approaches are OK with me.

Thanks,
Toshiaki Makita

> 
> Thanks
> -vlad
> 
> >
> > Thanks,
> >
> > Toshiaki Makita
> >
> >>
> >> -vlad
> >>
> >>>
> >>> Thanks,
> >>>
> >>> Toshiaki Makita
> >>>
> >>>>
> >>>> -vlad
> >>>>
> >>>>>
> >>>>>>
> >>>>>> I think I am ok with either approach.  Explicit vid 0 policy is
> >>>>>> easier for automatic provisioning.   The flag based one is
> >>>>>> easier for admin/ manual provisioning.
> >>>>>
> >>>>> Supposing we have to add something to help or man in any case, I
> >>>>> think flag based is better. The format below seems to suitable
> >>>>> for a per-port policy. bridge vlan set prio_tag on dev vnet0
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Toshiaki Makita
> >>>>>
> >>>>>>
> >>>>>> -vlad.
> >>>>>>
> >>>>>> -vlad
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks -vlad
> >>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Toshiaki Makita
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> How it is implemented internally in the kernel isn't as
> >>>>>>>>>> big of an issue. We can do it as a separate flag or as
> >>>>>>>>>> part of existing policy.
> >>>>>>>>>>
> >>>>>>>>>> -vlad
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Toshiaki Makita
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> -vlad
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Toshiaki Makita
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Toshiaki Makita
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -- To unsubscribe from this list: send the
> >>>>>>>>>>>>>>>> line "unsubscribe netdev" in the body of a
> >>>>>>>>>>>>>>>> message to majordomo@vger.kernel.org More
> >>>>>>>>>>>>>>>> majordomo info at
> >>>>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -- To unsubscribe from this list: send the line
> >>>>>>>>>>>>> "unsubscribe netdev" in the body of a message to
> >>>>>>>>>>>>> majordomo@vger.kernel.org More majordomo info at
> >>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RFC 0/2] xfrm: Remove ancient sleeping code
From: Fan Du @ 2013-10-11  7:18 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev
In-Reply-To: <20131010085703.GR7660@secunet.com>

hehe, I didn't object this cleanup except learning from it :)

On 2013年10月10日 16:57, Steffen Klassert wrote:
> On Thu, Oct 10, 2013 at 03:02:14PM +0800, Fan Du wrote:
>>
>>
>> On 2013年10月10日 14:33, Steffen Klassert wrote:
>>> Does anyone still rely on the ancient sleeping when the SA is in
>>> acquire state? It is disabled by default since more that five years,
>>> but can cause indefinite task hangs if enabled and the needed state
>>> does not get resolved.
>>
>> I saw that "can_sleep" is set true in ip_route_connect which upper layer
>> protocol relies on it, which ensure not dropping *any* skb.
>
> 'Any' means one per task in this context. Also, we can't ensure that
> this packet reaches it's destination. So where is the difference
> between dropping the packet locally or on the network?
>
>> And acquire timer will make sure the task will not hangs indefinitely.
>>
>
> Did you try that? It makes sure that the task wakes up from time to time,
> but it goes immediately back to sleep if the needed state is not resolved.
> The only terminating contition is when the task gets a signal to exit.
>
>> In xfrm policy queue, XFRM_MAX_QUEUE_LEN is 100, which means 101th skb
>> will be dropped, how about make it configurable?
>
> IMO we would have yet another useless knob then. Currently we send all
> packets by default to a blackhole as long as the state is not resolved
> and most people are fine with it. The queueing is mostly to speed up
> tcp handshakes,

I cannot follow on this part. Would you please mind to explain how making a
policy queue will speed up TCP handshakes than orignal CAN_SLEEP mechanism?


Thanks

-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply

* [PATCHv2 2/2] Don't compute checksum value for SCTP skb with, CHECKSUM_PARTIAL set
From: Fan Du @ 2013-10-11  7:08 UTC (permalink / raw)
  To: Neil Horman, vyasevich, steffen.klassert, davem; +Cc: netdev
In-Reply-To: <20131010131116.GA16147@hmsreliant.think-freely.org>



On 2013年10月10日 21:11, Neil Horman wrote:
> On Thu, Oct 10, 2013 at 01:51:36PM +0800, Fan Du wrote:
>> igb/ixgbe have hardware sctp checksum support, when this feature is enabled
>> and also IPsec is armed to protect sctp traffic, ugly things happened as
>> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
>> up and pack the 16bits result in the checksum field). The result is fail
>> establishment of sctp communication.
>>
> Shouldn't this be fixed in the xfrm code then?  E.g. check the device features
> for SCTP checksum offloading and and skip the checksum during xfrm output if its
> available?
>
> Or am I missing something?
> Neil
>
>


 From cff25810910603ff991f0c56441d6de8dc33a822 Mon Sep 17 00:00:00 2001
From: Fan Du <fan.du@windriver.com>
Date: Fri, 11 Oct 2013 14:31:57 +0800
Subject: [PATCH 2/2] Don't compute checksum value for SCTP skb with
  CHECKSUM_PARTIAL set
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

IPsec is not enabled in this scenario:

SCTP skb set CHECKSUM_PARTIAL to indicate hardware is capable of doing
SCTP checksum(crc32-c) scoping the whole SCTP packet range. However when
such kind of skb is delivered through IPv4/v6 output handler, IPv4/v6 stack
interpret CHECKSUM_PARTIAL by calling skb_checksum_help to compute 16bits
checksum value by summing everything up, the whole SCTP packet in software
manner！ After this skb reach NIC, after hardware doing its SCTP checking
business, a crc32-c value will overwrite the value IPv4/v6 stack computed
before.

This patch solves this by introducing skb_is_sctpv4/6 to optimize such case.

Signed-off-by: Fan Du <fan.du@windriver.com>
---
v2:
    Rework this problem by introducing skb_is_scktv4/6

---
  include/linux/ip.h     |    5 +++++
  include/linux/ipv6.h   |    6 ++++++
  include/linux/skbuff.h |    1 -
  net/core/skbuff.c      |    1 +
  net/ipv4/ip_output.c   |    4 +++-
  net/ipv6/ip6_output.c  |    1 +
  net/xfrm/xfrm_output.c |   20 +++++++++++++++-----
  7 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/ip.h b/include/linux/ip.h
index 492bc65..f556292 100644
--- a/include/linux/ip.h
+++ b/include/linux/ip.h
@@ -19,6 +19,7 @@

  #include <linux/skbuff.h>
  #include <uapi/linux/ip.h>
+#include <uapi/linux/in.h>

  static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
  {
@@ -34,4 +35,8 @@ static inline struct iphdr *ipip_hdr(const struct sk_buff *skb)
  {
  	return (struct iphdr *)skb_transport_header(skb);
  }
+static inline int skb_is_sctpv4(const struct sk_buff *skb)
+{
+	return ip_hdr(skb)->protocol == IPPROTO_SCTP;
+}
  #endif	/* _LINUX_IP_H */
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 28ea384..6e17c04 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -2,6 +2,7 @@
  #define _IPV6_H

  #include <uapi/linux/ipv6.h>
+#include <uapi/linux/in.h>

  #define ipv6_optlen(p)  (((p)->hdrlen+1) << 3)
  /*
@@ -387,4 +388,9 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
  	  ((__sk)->sk_bound_dev_if == (__dif)))				&& \
  	 net_eq(sock_net(__sk), (__net)))

+static inline int skb_is_sctpv6(const struct sk_buff *skb)
+{
+	return ipv6_hdr(skb)->nexthdr == IPPROTO_SCTP;
+}
+
  #endif /* _IPV6_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2ddb48d..b36d0cc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2393,7 +2393,6 @@ extern void	       skb_split(struct sk_buff *skb,
  extern int	       skb_shift(struct sk_buff *tgt, struct sk_buff *skb,
  				 int shiftlen);
  extern void	       skb_scrub_packet(struct sk_buff *skb, bool xnet);
-
  extern struct sk_buff *skb_segment(struct sk_buff *skb,
  				   netdev_features_t features);

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d81cff1..54d6172 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3526,3 +3526,4 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
  	nf_reset_trace(skb);
  }
  EXPORT_SYMBOL_GPL(skb_scrub_packet);
+
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index a04d872..8676b2c 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -587,7 +587,9 @@ slow_path_clean:

  slow_path:
  	/* for offloaded checksums cleanup checksum before fragmentation */
-	if ((skb->ip_summed == CHECKSUM_PARTIAL) && skb_checksum_help(skb))
+	if ((skb->ip_summed == CHECKSUM_PARTIAL) &&
+	    !skb_is_sctpv4(skb) &&
+	    skb_checksum_help(skb))
  		goto fail;
  	iph = ip_hdr(skb);

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3a692d5..9b27d95 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -671,6 +671,7 @@ slow_path_clean:

  slow_path:
  	if ((skb->ip_summed == CHECKSUM_PARTIAL) &&
+	    !skb_is_sctpv6(skb) &&
  	    skb_checksum_help(skb))
  		goto fail;

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 3bb2cdc..ddef94a 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -180,6 +180,14 @@ static int xfrm_output_gso(struct sk_buff *skb)
  	return 0;
  }

+static int skb_is_sctp(const struct sk_buff *skb)
+{
+	if (skb->protocol == __constant_htons(ETH_P_IP))
+		return skb_is_sctpv4(skb);
+	else
+		return skb_is_sctpv6(skb);
+}
+
  int xfrm_output(struct sk_buff *skb)
  {
  	struct net *net = dev_net(skb_dst(skb)->dev);
@@ -189,11 +197,13 @@ int xfrm_output(struct sk_buff *skb)
  		return xfrm_output_gso(skb);

  	if (skb->ip_summed == CHECKSUM_PARTIAL) {
-		err = skb_checksum_help(skb);
-		if (err) {
-			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
-			kfree_skb(skb);
-			return err;
+		if (!skb_is_sctp(skb)) {
+			err = skb_checksum_help(skb);
+			if (err) {
+				XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+				kfree_skb(skb);
+				return err;
+			}
  		}
  	}

-- 
1.7.9.5






-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply related

* [PATCHv2 1/2 ] {xfrm, sctp} Stick to software crc32 even if hardware is capable of that
From: Fan Du @ 2013-10-11  7:05 UTC (permalink / raw)
  To: Neil Horman, vyasevich, steffen.klassert, davem; +Cc: netdev
In-Reply-To: <20131010131116.GA16147@hmsreliant.think-freely.org>



On 2013年10月10日 21:11, Neil Horman wrote:
> On Thu, Oct 10, 2013 at 01:51:36PM +0800, Fan Du wrote:
>> igb/ixgbe have hardware sctp checksum support, when this feature is enabled
>> and also IPsec is armed to protect sctp traffic, ugly things happened as
>> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
>> up and pack the 16bits result in the checksum field). The result is fail
>> establishment of sctp communication.
>>
> Shouldn't this be fixed in the xfrm code then?  E.g. check the device features
> for SCTP checksum offloading and and skip the checksum during xfrm output if its
> available?
>
> Or am I missing something?
> Neil
>
>


 From 014276de0877f11d46e1704114a7d91f19221a63 Mon Sep 17 00:00:00 2001
From: Fan Du <fan.du@windriver.com>
Date: Fri, 11 Oct 2013 14:24:33 +0800
Subject: [PATCH 1/2] {xfrm, sctp} Stick to software crc32 even if hardware is
  capable of that

igb/ixgbe have hardware sctp checksum support, when this feature is enabled
and also IPsec is armed to protect sctp traffic, ugly things happened as
xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
up and pack the 16bits result in the checksum field). The result is fail
establishment of sctp communication.

Signed-off-by: Fan Du <fan.du@windriver.com>
---
v2:
   Leave ip_summed as CHECKSUM_PARTIAL as before, the second patch will fix this.

---
  net/sctp/output.c |   14 +++++++++++++-
  1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/sctp/output.c b/net/sctp/output.c
index 0ac3a65..6de6402 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -372,6 +372,16 @@ static void sctp_packet_set_owner_w(struct sk_buff *skb, struct sock *sk)
  	atomic_inc(&sk->sk_wmem_alloc);
  }

+static int is_xfrm_armed(struct dst_entry *dst)
+{
+#ifdef CONFIG_XFRM
+	/* If dst->xfrm is valid, this skb needs to be transformed */
+	return dst->xfrm != NULL;
+#else
+	return 0;
+#endif
+}
+
  /* All packets are sent to the network through this function from
   * sctp_outq_tail().
   *
@@ -536,7 +546,9 @@ int sctp_packet_transmit(struct sctp_packet *packet)
  	 * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
  	 */
  	if (!sctp_checksum_disable) {
-		if (!(dst->dev->features & NETIF_F_SCTP_CSUM)) {
+		if ((!(dst->dev->features & NETIF_F_SCTP_CSUM)) ||
+			is_xfrm_armed(dst)) {
+
  			__u32 crc32 = sctp_start_cksum((__u8 *)sh, cksum_buf_len);

  			/* 3) Put the resultant value into the checksum field in the
-- 
1.7.9.5




-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply related

* Re: [PATCH net] {xfrm, sctp} Stick to software crc32 even if hardware is capable of that
From: Fan Du @ 2013-10-11  7:02 UTC (permalink / raw)
  To: Neil Horman, steffen.klassert; +Cc: vyasevich, davem, netdev
In-Reply-To: <20131010131116.GA16147@hmsreliant.think-freely.org>

On 2013年10月10日 21:11, Neil Horman wrote:
> On Thu, Oct 10, 2013 at 01:51:36PM +0800, Fan Du wrote:
>> igb/ixgbe have hardware sctp checksum support, when this feature is enabled
>> and also IPsec is armed to protect sctp traffic, ugly things happened as
>> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
>> up and pack the 16bits result in the checksum field). The result is fail
>> establishment of sctp communication.
>>
> Shouldn't this be fixed in the xfrm code then?  E.g. check the device features
> for SCTP checksum offloading and and skip the checksum during xfrm output if its
> available?

sigh... same as my first thought.
However why should xfrm_output check whether the skb is a SCTP one as well as whether
the associated dev is able to do hw SCTP crc32-c checksum in the first place, and then
call SCTP crc checksum value API to write the correct checksum *after* this skb has
leaven SCTP layer???

The checksum operation in xfrm_ouput fits TCP/UDP/IP layer checksum semantics, e.g.
calculate 16bits value by sum almost everything up. Make a SCTP specific exception
in this common path sound wired, as the fix can be done in SCTP codes easily with
minimum changes, so not any bit of consequence for non-SCTP traffic.

And If you/Vlad insist to do so as well as no objection from Steffen/David, I'm
happy to send this revised version as your suggested.

Anyway, I should have separated this patch for two which makes the issues more clear
for you and Vlad to review.

> Or am I missing something?
> Neil
>
>

-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply

* Re: [PATCH net] {xfrm, sctp} Stick to software crc32 even if hardware is capable of that
From: Fan Du @ 2013-10-11  7:02 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: nhorman, steffen.klassert, davem, netdev
In-Reply-To: <5256B578.8040101@gmail.com>



On 2013年10月10日 22:11, Vlad Yasevich wrote:
> On 10/10/2013 01:51 AM, Fan Du wrote:
>> igb/ixgbe have hardware sctp checksum support, when this feature is enabled
>> and also IPsec is armed to protect sctp traffic, ugly things happened as
>> xfrm_output checks CHECKSUM_PARTIAL to do check sum operation(sum every thing
>> up and pack the 16bits result in the checksum field). The result is fail
>> establishment of sctp communication.
>>
>> And I saw another point in this part of code, when IPsec is not armed,
>> sctp communication is good, however setting setting CHECKSUM_PARTIAL will
>> make xfrm_output compute dummy checksum values which will be overwritten by
>> hardware lately.
>>
>> So this patch try to solve above two issues together.
>>
>> Signed-off-by: Fan Du <fan.du@windriver.com>
>> ---
>> note:
>> igb/ixgbe hardware is not handy on my side, so just build test only.
>>
>> ---
>> net/sctp/output.c | 27 +++++++++++++++++++--------
>> 1 file changed, 19 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/sctp/output.c b/net/sctp/output.c
>> index 0ac3a65..f0b9cc5 100644
>> --- a/net/sctp/output.c
>> +++ b/net/sctp/output.c
>> @@ -372,6 +372,16 @@ static void sctp_packet_set_owner_w(struct sk_buff *skb, struct sock *sk)
>> atomic_inc(&sk->sk_wmem_alloc);
>> }
>>
>> +static int is_xfrm_armed(struct dst_entry *dst)
>> +{
>> +#ifdef CONFIG_XFRM
>> + /* If dst->xfrm is valid, this skb needs to be transformed */
>> + return dst->xfrm != NULL;
>> +#else
>> + return 0;
>> +#endif
>> +}
>> +
>> /* All packets are sent to the network through this function from
>> * sctp_outq_tail().
>> *
>> @@ -536,20 +546,21 @@ int sctp_packet_transmit(struct sctp_packet *packet)
>> * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
>> */
>> if (!sctp_checksum_disable) {
>> - if (!(dst->dev->features & NETIF_F_SCTP_CSUM)) {
>> + if ((!(dst->dev->features & NETIF_F_SCTP_CSUM)) ||
>> + is_xfrm_armed(dst)) {
>> +
>> __u32 crc32 = sctp_start_cksum((__u8 *)sh, cksum_buf_len);
>>
>> /* 3) Put the resultant value into the checksum field in the
>> * common header, and leave the rest of the bits unchanged.
>> */
>> sh->checksum = sctp_end_cksum(crc32);
>> - } else {
>> - /* no need to seed pseudo checksum for SCTP */
>> - nskb->ip_summed = CHECKSUM_PARTIAL;
>> - nskb->csum_start = (skb_transport_header(nskb) -
>> - nskb->head);
>> - nskb->csum_offset = offsetof(struct sctphdr, checksum);
>> - }
>> + } else
>> + /* Mark skb as CHECKSUM_UNNECESSARY to let hardware compute
>> + * the checksum, and also avoid xfrm_output to do unceccessary
>> + * checksum.
>> + */
>> + nskb->ip_summed = CHECKSUM_UNNECESSARY;
>> }
>
> In addition to what Niel said, the use of CHECKSUM_UNNECESSARY is
> incorrect as it will cause the nic to not compute the checksum.
> The checksum offload depends on the use of CHECKSUM_PARTIAL.

My bad, my head is cramed with IPsec recently :( We should fix this in ipv4/v6 output path.

> -vlad
>
>
>>
>> /* IP layer ECN support
>>
>
>

-- 
浮沉随浪只记今朝笑

--fan

^ permalink raw reply

* Re: pull request: batman-adv 2013-10-09b
From: Antonio Quartulli @ 2013-10-11  6:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, b.a.t.m.a.n
In-Reply-To: <20131009.155658.131342560291162660.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 573 bytes --]

On Wed, Oct 09, 2013 at 03:56:58PM -0400, David Miller wrote:
> From: Antonio Quartulli <antonio@meshcoding.com>
> Date: Wed,  9 Oct 2013 21:32:38 +0200
> 
> > here you have my fixed pull request intended for net-next.
> > 
> > The previous build error was due to an accidental remotion of the beginning of a
> > batadv_dbg() statement during a merge conflict resolution.
> > Sorry for that.
> 
> This looks better, pulled, thanks a lot.

Hello David,

I can't find my patchset in net-next, hasn't it been pushed yet?

Regards,


-- 
Antonio Quartulli

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH] bridge: update mdb expiration timer upon reports.
From: David Miller @ 2013-10-11  4:57 UTC (permalink / raw)
  To: herbert; +Cc: vyasevic, netdev, xiyou.wangcong, stephen
In-Reply-To: <20131010234856.GA31951@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 11 Oct 2013 07:48:56 +0800

> On Thu, Oct 10, 2013 at 03:57:59PM -0400, Vlad Yasevich wrote:
>> commit 9f00b2e7cf241fa389733d41b615efdaa2cb0f5b
>> 	bridge: only expire the mdb entry when query is received
>> changed the mdb expiration timer to be armed only when QUERY is
>> received.  Howerver, this causes issues in an environment where
>> the multicast server socket comes and goes very fast while a client
>> is trying to send traffic to it.
>> 
>> The root cause is a race where a sequence of LEAVE followed by REPORT
>> messages can race against QUERY messages generated in response to LEAVE.
>> The QUERY ends up starting the expiration timer, and that timer can
>> potentially expire after the new REPORT message has been received signaling
>> the new join operation.  This leads to a significant drop in multicast
>> traffic and possible complete stall. 
>> 
>> The solution is to have REPORT messages update the expiration timer
>> on entries that already exist.
>> 
>> CC: Cong Wang <xiyou.wangcong@gmail.com>
>> CC: Herbert Xu <herbert@gondor.apana.org.au>
>> CC: Stephen Hemminger <stephen@networkplumber.org>
>> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
> 
> Good catch.  Thanks!
> 
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, and queued up for -stable, thanks everyone.

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Alexei Starovoitov @ 2013-10-11  4:48 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev, Jiri Pirko,
	David S. Miller
In-Reply-To: <CAEP_g=_meOUPrYc7S8rM4NwT_7qG1JZrHJkOg6CmbjDc-rFjAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Oct 10, 2013 at 8:56 PM, Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, Oct 10, 2013 at 5:47 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> On Thu, Oct 10, 2013 at 3:38 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>>>>>> The combination of two commits
>>>>>>>>>>
>>>>>>>>>> commit 8e4e1713e4
>>>>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>> commit 2537b4dd0a
>>>>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>>>>
>>>>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>>>>> netdev_unregister notification
>>>>>>>>>>
>>>>>>>>>> The following steps:
>>>>>>>>>>
>>>>>>>>>>   modprobe openvswitch
>>>>>>>>>>   ovs-dpctl add-dp test
>>>>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>>>>
>>>>>>>>>> are causing multiple warnings:
>>>>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>>>>> index c323567..e9380bd 100644
>>>>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>>>>                 return NOTIFY_DONE;
>>>>>>>>>>
>>>>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>>>>> and let workq do rest of work.
>>>>>>>>
>>>>>>>> isn't it what it's doing?
>>>>>>>
>>>>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>>>>> rest of vport destroy can be done in workq.
>>>>>>
>>>>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>>>>> that's dangerous.
>>>>> why is it dangerous? ovs already had ref to net-device.
>>>>
>>>> comment from dev.c:
>>>>                 /* Notify protocols, that we are about to destroy
>>>>                    this device. They should clean all the things.
>>>>                 */
>>>>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>>>>
>>>> so here you're suggesting to just netdev_upper_dev_unlink() to silence
>>>> the warning,
>>>> but then do dev_set_promisc(-1) in workqueue?
>>>>
>>> promiscuity setting is different issue. If you want to have discussion
>>> then you can post separate patch for same. Lets fix the warning here.
>>>
>>>> well, as a minimum audit of promiscuity will be wrong.
>>>> ndo_change_rx_flags will be called after ndo_uninit,
>>>> all sorts of other cleanups are done.
>>>
>>> change_rx_flags() checks for UP flag for given device.
>>>
>>>> I cannot track all possible scenarios, but it seems much safer to
>>>> cleanup everything possible
>>>> as soon as ovs received NETDEV_UNREGISTER event.
>>>>
>>>> May be all these risks are worth taking, then please explain what is
>>>> the problem with the proposed patch?
>>>>
>>> Problem is that this is causing layering issues in OVS. dp_notify is
>>> suppose to work at dp layer. your patch directly calls vport-netdev
>>> implementation function from dp_notify.
>>> I could not think of a simple approach that will do this in completely
>>> clean manner. Therefore I am trying to minimize layering issues. So
>>> just calling netdev_upper_dev_unlink() looks better than doing
>>> anything extra.
>>
>> dp_notify is per net, not per dp.
>> notifier can only be called for net_device and the first thing it does:
>>         if (!ovs_is_internal_dev(dev))
>>                 vport = ovs_netdev_get_vport(dev);
>> where ovs_netdev_get_vport() is defined in vport-netdev.c
>>
>> once it gets into workq, it checks for:
>>        if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
>>              continue;
>> and
>>         netdev_vport = netdev_vport_priv(vport);
>> where netdev_vport_priv() is defined in netdev-vport.h
>>
>> only then it proceeds with generic ovs_dp_detach_port().
>>
>> Is that the layering you talking about?
>>
>> So to minimize layering issues you want to call 'upper_dev_link' from
>> netdev_create() in vport-netdev.c
>> and 'upper_dev_unlink' directly from dp_device_event() in dp_notify.c?
>>
>> That will look better than calling ovs_netdev_unlink_dev() ?
>
> At a minimum, this function name is misleading because it does a bunch
> of other things beyond unlink the device.

sure. Please propose a different name.

> At this point, it basically seems like six of one, half dozen of the
> other. I don't think that either solution is ideal but it doesn't
> really matter all that much.
>
> However, the check dev->reg_state in netdev_destroy() looks racy to
> me, as it could already be in NETREG_UNREGISTERED even if we already
> processed this device.

you mean that netdev_destroy() will see reg_state == netreg_unregistered,
while dp_device_event() didn't see reg_state == netreg_unregistering yet?
or dp_device_event() saw it, proceeded to do unlink and
netdev_destroy() ran in parallel?
well, that's why reg_state == netreg_unregistering check in netdev_destroy()
is done with rtnl_lock() held.
reg_state cannot go into netreg_unregistered state skipping
netreg_unregistering and notifier.
therefore I don't think it's racy.

In ovs_dp_notify_wq() you're checking for both unregistering and
unregistered and that makes
sense, since workq can run after unregistering notifier called and
netdev_run_todo()
already changed the state to unregistered.
But here it's not the case.

^ permalink raw reply

* Re: Peak TCP performance
From: Kyle Hubert @ 2013-10-11  4:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1381463075.4971.99.camel@edumazet-glaptop.roam.corp.google.com>

On Thu, Oct 10, 2013 at 11:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Also, my copy of ethtool does not recognize tx-nocache-copy. However,
>> I do have control over the net device. Is there something there I can
>> set, or is tx-nocache-copy also a new feature? I'll start digging.
>>
>
> nocache-copy was added in 3.0, but I do find its not a gain for current
> cpus.
>
> You could get a fresh copy of ethtool sources :
>
> git clone git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
> cd ethtool
> ./autogen.sh  ...

That did the trick. Thanks for the help! Is there somewhere I can read
up on this feature? A lot of the netdev features are opaque to me.
Also, I can set NETIF_F_NOCACHE_COPY in the netdev->features to set
this by default, yes?

This at least mirrors the performance improvement that I see when
forwarding, however I still see reserved CPU time. Is there a way to
push it even farther?

-Kyle

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix vport-netdev unregister
From: Jesse Gross @ 2013-10-11  3:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Pravin Shelar, David S. Miller, Jiri Pirko, dev@openvswitch.org,
	netdev
In-Reply-To: <CAMEtUuyEw81g8+68HUMpR1RERAA9rjL=WHRE58se83cdur67eg@mail.gmail.com>

On Thu, Oct 10, 2013 at 5:47 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Thu, Oct 10, 2013 at 3:38 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>> On Thu, Oct 10, 2013 at 1:48 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> On Thu, Oct 10, 2013 at 11:21 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>> On Wed, Oct 9, 2013 at 11:26 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>> On Wed, Oct 9, 2013 at 11:07 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>>> On Wed, Oct 9, 2013 at 9:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>>> On Wed, Oct 9, 2013 at 8:02 PM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>>>>> On Tue, Oct 8, 2013 at 8:07 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>>>>> The combination of two commits
>>>>>>>>>
>>>>>>>>> commit 8e4e1713e4
>>>>>>>>> ("openvswitch: Simplify datapath locking.")
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>> commit 2537b4dd0a
>>>>>>>>> ("openvswitch:: link upper device for port devices")
>>>>>>>>>
>>>>>>>>> introduced a bug where upper_dev wasn't unlinked upon
>>>>>>>>> netdev_unregister notification
>>>>>>>>>
>>>>>>>>> The following steps:
>>>>>>>>>
>>>>>>>>>   modprobe openvswitch
>>>>>>>>>   ovs-dpctl add-dp test
>>>>>>>>>   ip tuntap add dev tap1 mode tap
>>>>>>>>>   ovs-dpctl add-if test tap1
>>>>>>>>>   ip tuntap del dev tap1 mode tap
>>>>>>>>>
>>>>>>>>> are causing multiple warnings:
>>>>>>>>> diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>>>>>>>> index c323567..e9380bd 100644
>>>>>>>>> --- a/net/openvswitch/dp_notify.c
>>>>>>>>> +++ b/net/openvswitch/dp_notify.c
>>>>>>>>> @@ -88,6 +88,11 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
>>>>>>>>>                 return NOTIFY_DONE;
>>>>>>>>>
>>>>>>>>>         if (event == NETDEV_UNREGISTER) {
>>>>>>>>> +               /* rx_handler_unregister and upper_dev_unlink immediately */
>>>>>>>>> +               if (dev->reg_state == NETREG_UNREGISTERING)
>>>>>>>>> +                       ovs_netdev_unlink_dev(vport);
>>>>>>>>> +
>>>>>>>>
>>>>>>>> Rather than doing vport destroy here, we can just unlink upper device
>>>>>>>> and let workq do rest of work.
>>>>>>>
>>>>>>> isn't it what it's doing?
>>>>>>
>>>>>> I meant just call netdev_upper_dev_unlink() here in event handler and
>>>>>> rest of vport destroy can be done in workq.
>>>>>
>>>>> netdev_upper_dev_unlink() without netdev_rx_handler_unregister() ?!
>>>>> that's dangerous.
>>>> why is it dangerous? ovs already had ref to net-device.
>>>
>>> comment from dev.c:
>>>                 /* Notify protocols, that we are about to destroy
>>>                    this device. They should clean all the things.
>>>                 */
>>>                 call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>>>
>>> so here you're suggesting to just netdev_upper_dev_unlink() to silence
>>> the warning,
>>> but then do dev_set_promisc(-1) in workqueue?
>>>
>> promiscuity setting is different issue. If you want to have discussion
>> then you can post separate patch for same. Lets fix the warning here.
>>
>>> well, as a minimum audit of promiscuity will be wrong.
>>> ndo_change_rx_flags will be called after ndo_uninit,
>>> all sorts of other cleanups are done.
>>
>> change_rx_flags() checks for UP flag for given device.
>>
>>> I cannot track all possible scenarios, but it seems much safer to
>>> cleanup everything possible
>>> as soon as ovs received NETDEV_UNREGISTER event.
>>>
>>> May be all these risks are worth taking, then please explain what is
>>> the problem with the proposed patch?
>>>
>> Problem is that this is causing layering issues in OVS. dp_notify is
>> suppose to work at dp layer. your patch directly calls vport-netdev
>> implementation function from dp_notify.
>> I could not think of a simple approach that will do this in completely
>> clean manner. Therefore I am trying to minimize layering issues. So
>> just calling netdev_upper_dev_unlink() looks better than doing
>> anything extra.
>
> dp_notify is per net, not per dp.
> notifier can only be called for net_device and the first thing it does:
>         if (!ovs_is_internal_dev(dev))
>                 vport = ovs_netdev_get_vport(dev);
> where ovs_netdev_get_vport() is defined in vport-netdev.c
>
> once it gets into workq, it checks for:
>        if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
>              continue;
> and
>         netdev_vport = netdev_vport_priv(vport);
> where netdev_vport_priv() is defined in netdev-vport.h
>
> only then it proceeds with generic ovs_dp_detach_port().
>
> Is that the layering you talking about?
>
> So to minimize layering issues you want to call 'upper_dev_link' from
> netdev_create() in vport-netdev.c
> and 'upper_dev_unlink' directly from dp_device_event() in dp_notify.c?
>
> That will look better than calling ovs_netdev_unlink_dev() ?

At a minimum, this function name is misleading because it does a bunch
of other things beyond unlink the device.

At this point, it basically seems like six of one, half dozen of the
other. I don't think that either solution is ideal but it doesn't
really matter all that much.

However, the check dev->reg_state in netdev_destroy() looks racy to
me, as it could already be in NETREG_UNREGISTERED even if we already
processed this device.

^ permalink raw reply

* Re: Peak TCP performance
From: Eric Dumazet @ 2013-10-11  3:44 UTC (permalink / raw)
  To: Kyle Hubert; +Cc: netdev
In-Reply-To: <CAJoZ4U0tpKhknMLd7kN4uQmq-E81=EbH1Q816nX_eig=tGreAQ@mail.gmail.com>

On Thu, 2013-10-10 at 23:33 -0400, Kyle Hubert wrote:

> Ah, yes, sorry. I'm on 3.0.80 (SLES SP2). And, no, I do not have that
> commit. In fact, the code looks quite different. There must have been
> a lot of changes?
> 

Probably ;)

> Also, my copy of ethtool does not recognize tx-nocache-copy. However,
> I do have control over the net device. Is there something there I can
> set, or is tx-nocache-copy also a new feature? I'll start digging.
> 

nocache-copy was added in 3.0, but I do find its not a gain for current
cpus.

You could get a fresh copy of ethtool sources :

git clone git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
cd ethtool
./autogen.sh  ...

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox