Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v4 2/5] net: Handle encapsulated offloads before fragmentation or handing to lower dev
From: Joseph Gasparakis @ 2012-12-08  0:14 UTC (permalink / raw)
  To: davem, shemminger, chrisw, gospo
  Cc: Alexander Duyck, netdev, linux-kernel, dmitry, saeed.bishara,
	bhutchings
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This change allows the VXLAN to enable Tx checksum offloading even on
devices that do not support encapsulated checksum offloads. The
advantage to this is that it allows for the lower device to change due
to routing table changes without impacting features on the VXLAN itself.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/core/dev.c       | 15 +++++++++++++--
 net/ipv4/ip_output.c |  4 ++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 307142a..a4c4a1b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2324,6 +2324,13 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			skb->vlan_tci = 0;
 		}
 
+		/* If encapsulation offload request, verify we are testing
+		 * hardware encapsulation features instead of standard
+		 * features for the netdev
+		 */
+		if (skb->encapsulation)
+			features &= dev->hw_enc_features;
+
 		if (netif_needs_gso(skb, features)) {
 			if (unlikely(dev_gso_segment(skb, features)))
 				goto out_kfree_skb;
@@ -2339,8 +2346,12 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			 * checksumming here.
 			 */
 			if (skb->ip_summed == CHECKSUM_PARTIAL) {
-				skb_set_transport_header(skb,
-					skb_checksum_start_offset(skb));
+				if (skb->encapsulation)
+					skb_set_inner_transport_header(skb,
+						skb_checksum_start_offset(skb));
+				else
+					skb_set_transport_header(skb,
+						skb_checksum_start_offset(skb));
 				if (!(features & NETIF_F_ALL_CSUM) &&
 				     skb_checksum_help(skb))
 					goto out_kfree_skb;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6537a40..3e98ed2 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -595,6 +595,10 @@ slow_path_clean:
 	}
 
 slow_path:
+	/* for offloaded checksums cleanup checksum before fragmentation */
+	if ((skb->ip_summed == CHECKSUM_PARTIAL) && skb_checksum_help(skb))
+		goto fail;
+
 	left = skb->len - hlen;		/* Space per frame */
 	ptr = hlen;		/* Where to start from */
 
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH v4 1/5] net: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-08  0:14 UTC (permalink / raw)
  To: davem, shemminger, chrisw, gospo
  Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
	bhutchings, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354925658-24115-1-git-send-email-joseph.gasparakis@intel.com>

This patch adds support in the kernel for offloading in the NIC Tx and Rx
checksumming for encapsulated packets (such as VXLAN and IP GRE).

For Tx encapsulation offload, the driver will need to set the right bits
in netdev->hw_enc_features. The protocol driver will have to set the
skb->encapsulation bit and populate the inner headers, so the NIC driver will
use those inner headers to calculate the csum in hardware.

For Rx encapsulation offload, the driver will need to set again the
skb->encapsulation flag and the skb->ip_csum to CHECKSUM_UNNECESSARY.
In that case the protocol driver should push the decapsulated packet up
to the stack, again with CHECKSUM_UNNECESSARY. In ether case, the protocol
driver should set the skb->encapsulation flag back to zero. Fianlly the
protocol driver should have NETIF_F_RXCSUM flag set in its features.

Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/ip.h        |  5 +++
 include/linux/ipv6.h      |  5 +++
 include/linux/netdevice.h |  6 +++
 include/linux/skbuff.h    | 95 ++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/tcp.h       | 10 +++++
 include/linux/udp.h       |  5 +++
 net/core/skbuff.c         |  9 +++++
 7 files changed, 134 insertions(+), 1 deletion(-)

diff --git a/include/linux/ip.h b/include/linux/ip.h
index 58b82a2..492bc65 100644
--- a/include/linux/ip.h
+++ b/include/linux/ip.h
@@ -25,6 +25,11 @@ static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
 	return (struct iphdr *)skb_network_header(skb);
 }
 
+static inline struct iphdr *inner_ip_hdr(const struct sk_buff *skb)
+{
+	return (struct iphdr *)skb_inner_network_header(skb);
+}
+
 static inline struct iphdr *ipip_hdr(const struct sk_buff *skb)
 {
 	return (struct iphdr *)skb_transport_header(skb);
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 12729e9..faed1e3 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -67,6 +67,11 @@ static inline struct ipv6hdr *ipv6_hdr(const struct sk_buff *skb)
 	return (struct ipv6hdr *)skb_network_header(skb);
 }
 
+static inline struct ipv6hdr *inner_ipv6_hdr(const struct sk_buff *skb)
+{
+	return (struct ipv6hdr *)skb_inner_network_header(skb);
+}
+
 static inline struct ipv6hdr *ipipv6_hdr(const struct sk_buff *skb)
 {
 	return (struct ipv6hdr *)skb_transport_header(skb);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 18c5dc9..c6a14d4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1063,6 +1063,12 @@ struct net_device {
 	netdev_features_t	wanted_features;
 	/* mask of features inheritable by VLAN devices */
 	netdev_features_t	vlan_features;
+	/* mask of features inherited by encapsulating devices
+	 * This field indicates what encapsulation offloads
+	 * the hardware is capable of doing, and drivers will
+	 * need to set them appropriately.
+	 */
+	netdev_features_t	hw_enc_features;
 
 	/* Interface index. Unique device identifier	*/
 	int			ifindex;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2af494..320e976 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -376,6 +376,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@mark: Generic packet mark
  *	@dropcount: total number of sk_receive_queue overflows
  *	@vlan_tci: vlan tag control information
+ *	@inner_transport_header: Inner transport layer header (encapsulation)
+ *	@inner_network_header: Network layer header (encapsulation)
  *	@transport_header: Transport layer header
  *	@network_header: Network layer header
  *	@mac_header: Link layer header
@@ -471,7 +473,13 @@ struct sk_buff {
 	__u8			wifi_acked:1;
 	__u8			no_fcs:1;
 	__u8			head_frag:1;
-	/* 8/10 bit hole (depending on ndisc_nodetype presence) */
+	/* Encapsulation protocol and NIC drivers should use
+	 * this flag to indicate to each other if the skb contains
+	 * encapsulated packet or not and maybe use the inner packet
+	 * headers if needed
+	 */
+	__u8			encapsulation:1;
+	/* 7/9 bit hole (depending on ndisc_nodetype presence) */
 	kmemcheck_bitfield_end(flags2);
 
 #ifdef CONFIG_NET_DMA
@@ -486,6 +494,8 @@ struct sk_buff {
 		__u32		avail_size;
 	};
 
+	sk_buff_data_t		inner_transport_header;
+	sk_buff_data_t		inner_network_header;
 	sk_buff_data_t		transport_header;
 	sk_buff_data_t		network_header;
 	sk_buff_data_t		mac_header;
@@ -1435,12 +1445,53 @@ static inline void skb_reserve(struct sk_buff *skb, int len)
 	skb->tail += len;
 }
 
+static inline void skb_reset_inner_headers(struct sk_buff *skb)
+{
+	skb->inner_network_header = skb->network_header;
+	skb->inner_transport_header = skb->transport_header;
+}
+
 static inline void skb_reset_mac_len(struct sk_buff *skb)
 {
 	skb->mac_len = skb->network_header - skb->mac_header;
 }
 
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
+static inline unsigned char *skb_inner_transport_header(const struct sk_buff
+							*skb)
+{
+	return skb->head + skb->inner_transport_header;
+}
+
+static inline void skb_reset_inner_transport_header(struct sk_buff *skb)
+{
+	skb->inner_transport_header = skb->data - skb->head;
+}
+
+static inline void skb_set_inner_transport_header(struct sk_buff *skb,
+						   const int offset)
+{
+	skb_reset_inner_transport_header(skb);
+	skb->inner_transport_header += offset;
+}
+
+static inline unsigned char *skb_inner_network_header(const struct sk_buff *skb)
+{
+	return skb->head + skb->inner_network_header;
+}
+
+static inline void skb_reset_inner_network_header(struct sk_buff *skb)
+{
+	skb->inner_network_header = skb->data - skb->head;
+}
+
+static inline void skb_set_inner_network_header(struct sk_buff *skb,
+						const int offset)
+{
+	skb_reset_inner_network_header(skb);
+	skb->inner_network_header += offset;
+}
+
 static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
 {
 	return skb->head + skb->transport_header;
@@ -1496,6 +1547,38 @@ static inline void skb_set_mac_header(struct sk_buff *skb, const int offset)
 }
 
 #else /* NET_SKBUFF_DATA_USES_OFFSET */
+static inline unsigned char *skb_inner_transport_header(const struct sk_buff
+							*skb)
+{
+	return skb->inner_transport_header;
+}
+
+static inline void skb_reset_inner_transport_header(struct sk_buff *skb)
+{
+	skb->inner_transport_header = skb->data;
+}
+
+static inline void skb_set_inner_transport_header(struct sk_buff *skb,
+						   const int offset)
+{
+	skb->inner_transport_header = skb->data + offset;
+}
+
+static inline unsigned char *skb_inner_network_header(const struct sk_buff *skb)
+{
+	return skb->inner_network_header;
+}
+
+static inline void skb_reset_inner_network_header(struct sk_buff *skb)
+{
+	skb->inner_network_header = skb->data;
+}
+
+static inline void skb_set_inner_network_header(struct sk_buff *skb,
+						const int offset)
+{
+	skb->inner_network_header = skb->data + offset;
+}
 
 static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
 {
@@ -1574,11 +1657,21 @@ static inline u32 skb_network_header_len(const struct sk_buff *skb)
 	return skb->transport_header - skb->network_header;
 }
 
+static inline u32 skb_inner_network_header_len(const struct sk_buff *skb)
+{
+	return skb->inner_transport_header - skb->inner_network_header;
+}
+
 static inline int skb_network_offset(const struct sk_buff *skb)
 {
 	return skb_network_header(skb) - skb->data;
 }
 
+static inline int skb_inner_network_offset(const struct sk_buff *skb)
+{
+	return skb_inner_network_header(skb) - skb->data;
+}
+
 static inline int pskb_network_may_pull(struct sk_buff *skb, unsigned int len)
 {
 	return pskb_may_pull(skb, skb_network_offset(skb) + len);
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 60b7aac..4e1d228 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -35,6 +35,16 @@ static inline unsigned int tcp_hdrlen(const struct sk_buff *skb)
 	return tcp_hdr(skb)->doff * 4;
 }
 
+static inline struct tcphdr *inner_tcp_hdr(const struct sk_buff *skb)
+{
+	return (struct tcphdr *)skb_inner_transport_header(skb);
+}
+
+static inline unsigned int inner_tcp_hdrlen(const struct sk_buff *skb)
+{
+	return inner_tcp_hdr(skb)->doff * 4;
+}
+
 static inline unsigned int tcp_optlen(const struct sk_buff *skb)
 {
 	return (tcp_hdr(skb)->doff - 5) * 4;
diff --git a/include/linux/udp.h b/include/linux/udp.h
index 0b67d77..9d81de1 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -27,6 +27,11 @@ static inline struct udphdr *udp_hdr(const struct sk_buff *skb)
 	return (struct udphdr *)skb_transport_header(skb);
 }
 
+static inline struct udphdr *inner_udp_hdr(const struct sk_buff *skb)
+{
+	return (struct udphdr *)skb_inner_transport_header(skb);
+}
+
 #define UDP_HTABLE_SIZE_MIN		(CONFIG_BASE_SMALL ? 128 : 256)
 
 static inline int udp_hashfn(struct net *net, unsigned num, unsigned mask)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 880722e2..ccbabf5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -682,11 +682,14 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->transport_header	= old->transport_header;
 	new->network_header	= old->network_header;
 	new->mac_header		= old->mac_header;
+	new->inner_transport_header = old->inner_transport_header;
+	new->inner_network_header = old->inner_transport_header;
 	skb_dst_copy(new, old);
 	new->rxhash		= old->rxhash;
 	new->ooo_okay		= old->ooo_okay;
 	new->l4_rxhash		= old->l4_rxhash;
 	new->no_fcs		= old->no_fcs;
+	new->encapsulation	= old->encapsulation;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
@@ -892,6 +895,8 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->network_header   += offset;
 	if (skb_mac_header_was_set(new))
 		new->mac_header	      += offset;
+	new->inner_transport_header += offset;
+	new->inner_network_header   += offset;
 #endif
 	skb_shinfo(new)->gso_size = skb_shinfo(old)->gso_size;
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
@@ -1089,6 +1094,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	skb->network_header   += off;
 	if (skb_mac_header_was_set(skb))
 		skb->mac_header += off;
+	skb->inner_transport_header += off;
+	skb->inner_network_header += off;
 	/* Only adjust this if it actually is csum_start rather than csum */
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
 		skb->csum_start += nhead;
@@ -1188,6 +1195,8 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	n->network_header   += off;
 	if (skb_mac_header_was_set(skb))
 		n->mac_header += off;
+	n->inner_transport_header += off;
+	n->inner_network_header	   += off;
 #endif
 
 	return n;
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH v4 net-next 0/5] tunneling: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-08  0:14 UTC (permalink / raw)
  To: davem, shemminger, chrisw, gospo
  Cc: Joseph Gasparakis, netdev, linux-kernel, dmitry, saeed.bishara,
	bhutchings

The series contains updates to add in the NIC Rx and Tx checksumming support
for encapsulated packets.

The sk_buff needs to somehow have information of the inner packet, and adding
three fields for the inner mac, network and transport headers was the prefered
approach. 

Not adding these fields would mean that the drivers would need to parse the
sk_buff data in hot-path, having a negative impact in the performance.

Adding in sk_buff a pointer to the skbuff of the inner packet made sense, but
would be a complicated change as assumptions needed to be made with regards to
helper functions such as skb_clone() skb_copy(). Also code for the existing
encapsulation protocols (such as VXLAN and IP GRE) had to be reworked, so the
decision was to have the simple approach of adding these three fields.

v2 Makes sure that checksumming for IP GRE does not take place if the offload
   flag is set in the skb's netdev features

v3 Fixes issues picked up by the community in v2 and is intended to provide
   ability to demo vxlan Tx offloading with Intel's ixgbe. As part of this, 
   it provides an RFC patch for ixgbe to take advantage of the offloading
   mechanism

   Now it is possible to create a vxlan interface like this:
    #ip link add vxlan0 type vxlan id 40 ttl 10 group 239.1.1.1 dev eth0

   Then turn on/off the encapsulation offload mechanism by doing:
    #ethtool -K eth0 tx-checksum-ip-generic on

   In v3 ipgre work got paused (and therefore patches not included) and I will
   come back to it when vxlan is accepted by the community.

v4 Added more detailed commit logs and code comments as per request in v3
   Also now the Rx offload encapsulation patch is included in the series.

^ permalink raw reply

* Experimental network protocol produces "dropping frag"
From: Templin, Fred L @ 2012-12-07 23:41 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <1354897227.2707.10.camel@bwh-desktop.uk.solarflarecom.com>

Hello,

Under the 3.2.17 kernel, I am working with an experimental
tunneling protocol. I am using ip-proto-253 for the experiment,
and have coded the tunnel code in as a module under net/ipv4
by cloning net/ipv4/ip_gre.c. I have been testing the module
by using modprobe to reload the module without rebooting the
machines.

I was successfully able to send and receive packets via the
tunnel and verified that both directions were working using
tcpdump and wireshark which showed ip-proto-253 packets going
in both directions. Since then, however, I recompiled and
reinistalled the entire kernel and rebooted the machines.

Now, I can still send my ip-proto-253 packets, but on the
receiving side I see dmesg output that says:

  "dropping frag"

for every packet received. The input packets appear to be
accounted for when I type "ifconfig eth0" (i.e., the input
packet count is incrementing) but printk's from my tunnel
driver code are not showing up when I type dmesg and
"ifconfig tun0" on the tunnel interface does not show any
input packets. It is as if the receiver has "forgotten"
how to deal with ip-proto-253.

Any ideas on how I may have shot myself in the foot, and
ways to recover?

Thanks - Fred
fred.l.templin@boeing.com

^ permalink raw reply

* Re: [PATCH v3 3/4] vxlan: capture inner headers during encapsulation
From: Jeff Kirsher @ 2012-12-07 23:16 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: davem, shemminger, chrisw, gospo, netdev, linux-kernel, dmitry,
	saeed.bishara, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354845419-22483-4-git-send-email-joseph.gasparakis@intel.com>

[-- Attachment #1: Type: text/plain, Size: 585 bytes --]

On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> Allow VXLAN to make use of Tx checksum offloading and Tx scatter-gather.
> The advantage to these two changes is that it also allows the VXLAN to
> make use of GSO.
>
> Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  drivers/net/vxlan.c |   10 +++++++++-
>  1 files changed, 9 insertions(+), 1 deletions(-)
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]

^ permalink raw reply

* Re: [PATCH v3 2/4] net: Handle encapsulated offloads before fragmentation or handing to lower dev
From: Jeff Kirsher @ 2012-12-07 23:15 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: davem, shemminger, chrisw, gospo, Alexander Duyck, netdev,
	linux-kernel, dmitry, saeed.bishara
In-Reply-To: <1354845419-22483-3-git-send-email-joseph.gasparakis@intel.com>

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> This change allows the VXLAN to enable Tx checksum offloading even on
> devices that do not support encapsulated checksum offloads. The
> advantage to this is that it allows for the lower device to change due
> to routing table changes without impacting features on the VXLAN itself.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  net/core/dev.c       |   15 +++++++++++++--
>  net/ipv4/ip_output.c |    4 ++++
>  2 files changed, 17 insertions(+), 2 deletions(-)
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]

^ permalink raw reply

* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Jeff Kirsher @ 2012-12-07 23:15 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: davem, shemminger, chrisw, gospo, netdev, linux-kernel, dmitry,
	saeed.bishara, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354845419-22483-2-git-send-email-joseph.gasparakis@intel.com>

[-- Attachment #1: Type: text/plain, Size: 848 bytes --]

On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> This patch adds support in the kernel for offloading in the NIC Tx and Rx
> checksumming for encapsulated packets (such as VXLAN and IP GRE).
>
> Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  include/linux/ip.h        |    5 ++
>  include/linux/ipv6.h      |    5 ++
>  include/linux/netdevice.h |    2 +
>  include/linux/skbuff.h    |   90 ++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/tcp.h       |   10 +++++
>  include/linux/udp.h       |    5 ++
>  net/core/skbuff.c         |    9 ++++
>  7 files changed, 125 insertions(+), 1 deletions(-)
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v3 0/3] Multiqueue support in virtio-net
From: Stephen Hemminger @ 2012-12-07 23:13 UTC (permalink / raw)
  To: David Miller
  Cc: krkumar2, kvm, mst, netdev, linux-kernel, virtualization,
	bhutchings, jwhan, shiyer
In-Reply-To: <20121207.153556.989159003987459765.davem@davemloft.net>

On Fri, 07 Dec 2012 15:35:56 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: Jason Wang <jasowang@redhat.com>
> Date: Sat,  8 Dec 2012 01:04:54 +0800
> 
> > This series is an update version (hope the final version) of multiqueue
> > (VIRTIO_NET_F_MQ) support in virtio-net driver. All previous comments were
> > addressed, the work were based on Krishna Kumar's work to let virtio-net use
> > multiple rx/tx queues to do the packets reception and transmission. Performance
> > test show the aggregate latency were increased greately but may get some
> > regression in small packet transmission. Due to this, multiqueue were disabled
> > by default. If user want to benefit form the multiqueue, ethtool -L could be
> > used to enable the feature.
> > 
> > Please review and comments.
> > 
> > A protype implementation of qemu-kvm support could by found in
> > git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two queues, you
> > could specify the queues parameters to both tap and virtio-net like:
> > 
> > ./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...
> > 
> > then enable the multiqueue through ethtool by:
> > 
> > ethtool -L eth0 combined 2
> 
> It seems like most, if not all, of the feedback given for this series
> has been addressed by Jason.
> 
> Can I get some ACKs?

Other than the minor style nit in the first patch, I see no issues.
This is really needed by Virtual Routers.

Acked-by: Stephen Hemminger <shemminger@vyatta.com>

^ permalink raw reply

* Re: [PATCH net-next v3 1/3] virtio-net: separate fields of sending/receiving queue from virtnet_info
From: Stephen Hemminger @ 2012-12-07 23:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, linux-kernel, virtualization,
	bhutchings, jwhan, davem, shiyer
In-Reply-To: <1354899897-10423-2-git-send-email-jasowang@redhat.com>

Minor style issue reported by checkpatch which can be fixed after merge.
Although sizeof is actually an operator in C, it is considered correct
style to treat it as a function.


WARNING: sizeof hdr->hdr should be sizeof(hdr->hdr)
#293: FILE: drivers/net/virtio_net.c:395:
+	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);

WARNING: sizeof hdr->mhdr should be sizeof(hdr->mhdr)
#552: FILE: drivers/net/virtio_net.c:641:
+		sg_set_buf(sq->sg, &hdr->mhdr, sizeof hdr->mhdr);

WARNING: sizeof hdr->hdr should be sizeof(hdr->hdr)
#555: FILE: drivers/net/virtio_net.c:643:
+		sg_set_buf(sq->sg, &hdr->hdr, sizeof hdr->hdr);

^ permalink raw reply

* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
From: Jon Maloy @ 2012-12-07 22:30 UTC (permalink / raw)
  To: Neil Horman; +Cc: Paul Gortmaker, David Miller, netdev, Ying Xue
In-Reply-To: <20121207192030.GA30339@hmsreliant.think-freely.org>

On 12/07/2012 02:20 PM, Neil Horman wrote:
> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>> From: Ying Xue <ying.xue@windriver.com>
>>
>> The sk_receive_queue limit control is currently performed for
>> all arriving messages, disregarding socket and message type.
>> But for connected sockets this check is redundant, since the protocol
>> flow control already makes queue overflow impossible.
>>
> Can you explain where that occurs?  

It happens in the functions port_dispatcher_sigh() and tipc_send(), 
among other places. Both are to be found in the file port.c, which 
was supposed to contain the 'generic' (i.e., API independent) part 
of the send/receive code.
Now that we have only one API left, the socket API, we are 
planning to merge the code in socket.c and port.c, and get rid of 
some code overhead.

The flow control in TIPC is message based, where the sender
requires to receive an explicit acknowledge message for each 
512 message the receiver reads to user space.
If the sender has more than 1024 messages outstanding without having
received an acknowledge he will be suspended or receive EAGAIN until 
he does.
The plan going forward is to replace this mechanism with a more 
standard looking byte based flow control, while maintaining 
backwards compatibility.

> I see where the tipc dispatch function calls
> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
> the receiving socket is connection oriented or connectionless), but if the
> receiver doesn't call receive very often, This just adds a check against your
> global limit, doing nothing for your per-socket limits. 

OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
our per-socket limit. In fact, TIPC connectionless overflow control currently 
is a kind of a hybrid, based on a message counter when the socket is not locked, 
and based on sk_rcv_queue's byte limit when a message has to be added to the 
backlog.
We are planning to fix this inconsistency too.

 In fact it seems to
> repeat the same check twice, as in the worst case of the incomming message being
> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
> OVERLOAD_LIMIT_BASE/2 again.

Yes, you are right. The intention is that only the first test, 
if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
will be run for the vast majority of messages, since we must assume
that there is no overload most of the time.
An inelegant optimization, perhaps, but not logically wrong.

///jon

> 
> Neil
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 22:23 UTC (permalink / raw)
  To: David Miller; +Cc: eric, netdev, linux-wireless, linville, Eric Dumazet
In-Reply-To: <1354918363.9124.29.camel@jlt4.sipsolutions.net>

Oh I should say ... IP is like black magic to me ;-)

> -	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
> +	if (!skb_copy_bits(skb, 0, &iph, sizeof(iph)))
>  		return skb;
>  
> -	iph = ip_hdr(skb);
> -	if (iph->ihl < 5 || iph->version != 4)
> +	if (iph.ihl < 5 || iph.version != 4)
>  		return skb;
> -	if (!pskb_may_pull(skb, iph->ihl*4))
> -		return skb;
> -	iph = ip_hdr(skb);
> -	len = ntohs(iph->tot_len);
> -	if (skb->len < len || len < (iph->ihl * 4))
> +
> +	len = ntohs(iph.tot_len);
> +	if (skb->len < len || len < (iph.ihl * 4))
>  		return skb;
>  
> -	if (ip_is_fragment(ip_hdr(skb))) {
> +	if (ip_is_fragment(&iph)) {
>  		skb = skb_share_check(skb, GFP_ATOMIC);
>  		if (skb) {
> +			if (!pskb_may_pull(skb, iph.ihl*4))
> +				return skb;

I moved this here but I have no idea what it does. Asking if we can pull
this here seems a bit pointless, and in the previous place it seemed
similarly pointless since we only use the static part of the IP header
and don't look at any options... So to me it seems this pskb_may_pull()
could just be removed, and that most likely means I'm messing with code
I don't understand :-)

johannes

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 22:12 UTC (permalink / raw)
  To: David Miller
  Cc: eric-EVVnsjFE0OfYtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linville-2XuSBdqkA4R54TAoqtyWWQ, Eric Dumazet
In-Reply-To: <1354916502.9124.18.camel-8Nb76shvtaUJvtFkdXX2HixXY32XiHfO@public.gmane.org>

On Fri, 2012-12-07 at 22:41 +0100, Johannes Berg wrote:

> wpa_supplicant opens a packet socket for ETH_P_EAPOL, which indirectly
> eventually calls dev_add_pack(). But if you do the same for another
> socket, you'll get the same again, and then deliver_skb() will deliver
> only a refcounted packet to the prot_hook->func().
> 
> This seems like it could very well cause the problem?

Ok I couldn't reproduce it because suricata didn't work this way, but I
did try using tcpdump (without the fanout) and deliver_skb() *is* called
twice for each packet as I thought. It thus seems the problem is
entirely in af_packet itself. It was changed a bit by Eric Dumazet in
bc416d9768 but the original code goes back to the original defrag
support in 7736d33f4, as far as I can tell. aec27311c changed the code
to not do skb_clone(), but it seems the skb_share_check() should be
before the pskb_pull().

Well, it seems ip_check_defrag() should simply not modify the SKB before
it unshares it ... like this maybe:

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 448e685..8d5cc75 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -707,28 +707,27 @@ EXPORT_SYMBOL(ip_defrag);
 
 struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
 {
-	const struct iphdr *iph;
+	struct iphdr iph;
 	u32 len;
 
 	if (skb->protocol != htons(ETH_P_IP))
 		return skb;
 
-	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+	if (!skb_copy_bits(skb, 0, &iph, sizeof(iph)))
 		return skb;
 
-	iph = ip_hdr(skb);
-	if (iph->ihl < 5 || iph->version != 4)
+	if (iph.ihl < 5 || iph.version != 4)
 		return skb;
-	if (!pskb_may_pull(skb, iph->ihl*4))
-		return skb;
-	iph = ip_hdr(skb);
-	len = ntohs(iph->tot_len);
-	if (skb->len < len || len < (iph->ihl * 4))
+
+	len = ntohs(iph.tot_len);
+	if (skb->len < len || len < (iph.ihl * 4))
 		return skb;
 
-	if (ip_is_fragment(ip_hdr(skb))) {
+	if (ip_is_fragment(&iph)) {
 		skb = skb_share_check(skb, GFP_ATOMIC);
 		if (skb) {
+			if (!pskb_may_pull(skb, iph.ihl*4))
+				return skb;
 			if (pskb_trim_rcsum(skb, len))
 				return skb;
 			memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 21:56 UTC (permalink / raw)
  To: Eric Leblond
  Cc: David Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linville-2XuSBdqkA4R54TAoqtyWWQ
In-Reply-To: <1354916768.4530.18.camel@tiger2>

HI,

> > That's interesting. A single one seems odd. I might have expected two,
> > but not one. Well, since you removed the crash ... I guess I'll have to
> > believe that there's just one and the second one doesn't show up because
> > we crashed before :-)
> 
> It was the case with initial code but I've suppressed the BUG() call and
> replaced it with a return ;)

Right. Well, does it actually still work then? I wonder if then you
don't see a second packet because the first one doesn't make it
through ... I'm thinking that this is just an internal af_packet problem
with multiple listeners.


> I've no simple code available to test it. I've add the problem when
> running suricata. Maybe you could use it. It is packaged in most
> distribution now.
> To enable packet fanout. Modify default /etc/suricata/suricata.yaml to
> have something like:
> af-packet:
>   - interface: wlan0
>     # Number of receive threads (>1 will enable experimental flow pinned
>     # runmode)
>     threads: 3

That'll should do, thanks.

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Eric Leblond @ 2012-12-07 21:46 UTC (permalink / raw)
  To: Johannes Berg; +Cc: David Miller, netdev, linux-wireless, linville
In-Reply-To: <1354915824.9124.11.camel@jlt4.sipsolutions.net>

Hi,

On Fri, 2012-12-07 at 22:30 +0100, Johannes Berg wrote:
> 
> > Wireless folks, please take a look.  The issue is that,
> > under the circumstances listed below, we get SKBs in
> > the AF_PACKET input path that are shared.
> 
> Ok so I took a look, but I can't see where the wireless stack is going
> wrong.
> 
> > Given the logic present in ieee80211_deliver_skb() I think
> > the mac80211 code doesn't expect this either.
> 
> This is correct, but the driver should never give us a shared skb. From
> the other mail it seems Eric is using iwlwifi, which is definitely not
> creating shared SKBs. Nothing in mac80211 creates them either.
> 
> > > I've observed this crash under the following condition:
> > >  1. a program is listening to an wifi interface (let say wlan0)
> > >  2. it is using fanout capture in flow load balancing mode
> > >  3. defrag option is on on the fanout socket
> 
> How do you set this up, and what does it do? I'd like to try to
> reproduce this.



> > >  4. the interface disconnect (radio down for example)
> > >  5. the interface reconnect (radio switched up)
> > >  6. once reconnected a single packet is seen with skb->users=2
> 
> That's interesting. A single one seems odd. I might have expected two,
> but not one. Well, since you removed the crash ... I guess I'll have to
> believe that there's just one and the second one doesn't show up because
> we crashed before :-)

It was the case with initial code but I've suppressed the BUG() call and
replaced it with a return ;)

> > So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
> > packet headers, wherein it memoves() the data into a better aligned location.
> > 
> > But if these SKBs really are skb_shared(), this packet data
> > modification is illegal.
> > 
> > I suspect that the assumptions built into this unaligned data handling
> > code, and AF_PACKET, are correct.  Meaning that we should never see
> > skb_shared() packets here.  We just have a missing skb_copy()
> > somewhere in mac80211, Johannes can you please take a look?
> 
> My first theory was related to multiple virtual interfaces, but Eric
> didn't say he was running that, but we use skb_copy() for that in
> ieee80211_prepare_and_rx_handle(). That's not necessarily the most
> efficient (another reason for drivers to use paged RX here) but clearly
> not causing the issue.
> 
> The only other theory I can come up with right now is that the skb_get()
> happens in deliver_skb via __netif_receive_skb. Keeping in mind that
> wpa_supplicant might have another packet socket open for authentication
> packets, that seems like a possibility. I'll test it once I figure out
> how to do this "defrag" option you speak of :)

I've no simple code available to test it. I've add the problem when
running suricata. Maybe you could use it. It is packaged in most
distribution now.
To enable packet fanout. Modify default /etc/suricata/suricata.yaml to
have something like:
af-packet:
  - interface: wlan0
    # Number of receive threads (>1 will enable experimental flow pinned
    # runmode)
    threads: 3

Start it with: suricata --af-packet=wlan0
Then get wlan0 interface down and up. After a few seconds, the crash
will occur.
It is a bit complicated for a simple test case. I can cook a little
example code if you want.

BR,
-- 
Eric Leblond <eric@regit.org>
Blog: https://home.regit.org/

^ permalink raw reply

* [PATCH net-next #2 1/1] r8169: workaround for missing extended GigaMAC registers
From: Francois Romieu @ 2012-12-07 21:20 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Lee Chun-Yi, Wang YanQing, Hayes Wang

GigaMAC registers have been reported left unitialized in several
situations:
- after cold boot from power-off state
- after S3 resume

Tweaking rtl_hw_phy_config takes care of both.

This patch removes an excess entry (",") at the end of the exgmac_reg
array as well.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Wang YanQing <udknight@gmail.com>
Cc: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c | 42 ++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 18 deletions(-)

- Rebased against net-next with rtl8168e_2_hw_phy_config vs rtl8168f_hw_phy_config
  fixup as per davem's request.

- Removed a stealth "," after the last record of exgmac_reg. Not sure where it poked
  into...

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 891feee..1747256 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -3120,6 +3120,23 @@ static void rtl8168e_1_hw_phy_config(struct rtl8169_private *tp)
 	rtl_writephy(tp, 0x0d, 0x0000);
 }
 
+static void rtl_rar_exgmac_set(struct rtl8169_private *tp, u8 *addr)
+{
+	const u16 w[] = {
+		addr[0] | (addr[1] << 8),
+		addr[2] | (addr[3] << 8),
+		addr[4] | (addr[5] << 8)
+	};
+	const struct exgmac_reg e[] = {
+		{ .addr = 0xe0, ERIAR_MASK_1111, .val = w[0] | (w[1] << 16) },
+		{ .addr = 0xe4, ERIAR_MASK_1111, .val = w[2] },
+		{ .addr = 0xf0, ERIAR_MASK_1111, .val = w[0] << 16 },
+		{ .addr = 0xf4, ERIAR_MASK_1111, .val = w[1] | (w[2] << 16) }
+	};
+
+	rtl_write_exgmac_batch(tp, e, ARRAY_SIZE(e));
+}
+
 static void rtl8168e_2_hw_phy_config(struct rtl8169_private *tp)
 {
 	static const struct phy_reg phy_reg_init[] = {
@@ -3204,6 +3221,9 @@ static void rtl8168e_2_hw_phy_config(struct rtl8169_private *tp)
 	rtl_writephy(tp, 0x1f, 0x0000);
 
 	r8168_aldps_enable_1(tp);
+
+	/* Broken BIOS workaround: feed GigaMAC registers with MAC address. */
+	rtl_rar_exgmac_set(tp, tp->dev->dev_addr);
 }
 
 static void rtl8168f_hw_phy_config(struct rtl8169_private *tp)
@@ -3740,33 +3760,19 @@ static void rtl8169_init_phy(struct net_device *dev, struct rtl8169_private *tp)
 static void rtl_rar_set(struct rtl8169_private *tp, u8 *addr)
 {
 	void __iomem *ioaddr = tp->mmio_addr;
-	u32 high;
-	u32 low;
-
-	low  = addr[0] | (addr[1] << 8) | (addr[2] << 16) | (addr[3] << 24);
-	high = addr[4] | (addr[5] << 8);
 
 	rtl_lock_work(tp);
 
 	RTL_W8(Cfg9346, Cfg9346_Unlock);
 
-	RTL_W32(MAC4, high);
+	RTL_W32(MAC4, addr[4] | addr[5] << 8);
 	RTL_R32(MAC4);
 
-	RTL_W32(MAC0, low);
+	RTL_W32(MAC0, addr[0] | addr[1] << 8 | addr[2] << 16 | addr[3] << 24);
 	RTL_R32(MAC0);
 
-	if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
-		const struct exgmac_reg e[] = {
-			{ .addr = 0xe0, ERIAR_MASK_1111, .val = low },
-			{ .addr = 0xe4, ERIAR_MASK_1111, .val = high },
-			{ .addr = 0xf0, ERIAR_MASK_1111, .val = low << 16 },
-			{ .addr = 0xf4, ERIAR_MASK_1111, .val = high << 16 |
-								low  >> 16 },
-		};
-
-		rtl_write_exgmac_batch(tp, e, ARRAY_SIZE(e));
-	}
+	if (tp->mac_version == RTL_GIGA_MAC_VER_34)
+		rtl_rar_exgmac_set(tp, addr);
 
 	RTL_W8(Cfg9346, Cfg9346_Lock);
 
-- 
1.7.11.7

^ permalink raw reply related

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 21:41 UTC (permalink / raw)
  To: David Miller; +Cc: eric, netdev, linux-wireless, linville
In-Reply-To: <1354915824.9124.11.camel@jlt4.sipsolutions.net>

On Fri, 2012-12-07 at 22:30 +0100, Johannes Berg wrote:

> The only other theory I can come up with right now is that the skb_get()
> happens in deliver_skb via __netif_receive_skb. Keeping in mind that
> wpa_supplicant might have another packet socket open for authentication
> packets, that seems like a possibility. I'll test it once I figure out
> how to do this "defrag" option you speak of :)

Hmm now I'm venturing into the unknown (for me) and realm of
speculation...

wpa_supplicant opens a packet socket for ETH_P_EAPOL, which indirectly
eventually calls dev_add_pack(). But if you do the same for another
socket, you'll get the same again, and then deliver_skb() will deliver
only a refcounted packet to the prot_hook->func().

This seems like it could very well cause the problem?

johannes

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 21:30 UTC (permalink / raw)
  To: David Miller; +Cc: eric, netdev, linux-wireless, linville
In-Reply-To: <20121207.153134.25835204617509469.davem@davemloft.net>

> Wireless folks, please take a look.  The issue is that,
> under the circumstances listed below, we get SKBs in
> the AF_PACKET input path that are shared.

Ok so I took a look, but I can't see where the wireless stack is going
wrong.

> Given the logic present in ieee80211_deliver_skb() I think
> the mac80211 code doesn't expect this either.

This is correct, but the driver should never give us a shared skb. From
the other mail it seems Eric is using iwlwifi, which is definitely not
creating shared SKBs. Nothing in mac80211 creates them either.

> > I've observed this crash under the following condition:
> >  1. a program is listening to an wifi interface (let say wlan0)
> >  2. it is using fanout capture in flow load balancing mode
> >  3. defrag option is on on the fanout socket

How do you set this up, and what does it do? I'd like to try to
reproduce this.

> >  4. the interface disconnect (radio down for example)
> >  5. the interface reconnect (radio switched up)
> >  6. once reconnected a single packet is seen with skb->users=2

That's interesting. A single one seems odd. I might have expected two,
but not one. Well, since you removed the crash ... I guess I'll have to
believe that there's just one and the second one doesn't show up because
we crashed before :-)

> So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
> packet headers, wherein it memoves() the data into a better aligned location.
> 
> But if these SKBs really are skb_shared(), this packet data
> modification is illegal.
> 
> I suspect that the assumptions built into this unaligned data handling
> code, and AF_PACKET, are correct.  Meaning that we should never see
> skb_shared() packets here.  We just have a missing skb_copy()
> somewhere in mac80211, Johannes can you please take a look?

My first theory was related to multiple virtual interfaces, but Eric
didn't say he was running that, but we use skb_copy() for that in
ieee80211_prepare_and_rx_handle(). That's not necessarily the most
efficient (another reason for drivers to use paged RX here) but clearly
not causing the issue.

The only other theory I can come up with right now is that the skb_get()
happens in deliver_skb via __netif_receive_skb. Keeping in mind that
wpa_supplicant might have another packet socket open for authentication
packets, that seems like a possibility. I'll test it once I figure out
how to do this "defrag" option you speak of :)

johannes

^ permalink raw reply

* Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
From: Jay Vosburgh @ 2012-12-07 21:00 UTC (permalink / raw)
  To: Linda Walsh; +Cc: Cong Wang, LKML, Linux Kernel Network Developers
In-Reply-To: <50C24C44.8000809@tlinx.org>

Linda Walsh <lkml@tlinx.org> wrote:

>Sorry for the delay.... my distro (Suse) has made rebooting my system
>a chore (have to often boot from rescue to get it to come up because
>they put mount libs in /usr/lib expecting they will always boot
>from their ram disk -- preventing those of use who boot directly
>from disk from doing so easily...grrr.
>
>Jay Vosburgh wrote:
>> 	The miimon functionality is used to check link state and notice
>> when slaves lose carrier.
>---
>   If I am running 'rr' on 2 channels -- specifically for the purpose
>of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels)
>I'm not sure I see how miimon would provide benefit.  -- if 1 link dies,
>the other, being on the same card is likely to be dead too, so would
>it really serve a purpose?

	Perhaps, but if the link partner experiences a failure, that may
be a different situation.  Not all failures will necessarily cause both
links to fail simultaneously.

>>   Running without it will not detect failure of
>> the bonding slaves, which is likely not what you want.  The mode,
>> balance-rr in your case, is what selects the load balance to use, and is
>> separate from the miimon.
>>   
>----
>   Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk
>dies, the entire link goes down? 

	No; failure of a single slave does not cause the entire bond to
fail (unless that is the last available slave).  For round robin, a
failed slave is taken out of the set used to transmit traffic, and any
remaining slaves continue to round robin amongst themselves.

>   The other end (windows) doesn't dynamically config for a static-link
>aggregation, so I don't think it would provide benefit.

	So it (windows) has no means to disable (and discontinue use of)
one channel of the aggregation should it fail, even in a static link
aggregation?

>> 	That said, the problem you're seeing appears to be caused by two
>> things: bonding holds a lock (in addition to RTNL) when calling
>> __ethtool_get_settings, and an ixgbe function in the call path to
>> retrieve the settings, ixgbe_acquire_swfw_sync_X540, can sleep.
>>
>> 	The test patch above handles one case in bond_enslave, but there
>> is another case in bond_miimon_commit when a slave changes link state
>> from down to up, which will occur shortly after the slave is added.
>>   
>----
>   Added your 2nd patch -- no more error messages...
>
>   however -- likely unrelated, the max speed read or write I am seeing
>is about 500MB/s, and that is rare -- usually it's barely <3x a 1Gb
>network speed. (119/125 MB R/W).  I'm not at all sure it's really
>combining the links
>properly.  Anyway to verify that?

	How are you testing the throughput?  If you configure the
aggregation with just one link, how does the throughput compare to the
aggregation with both links?

	It most likely is combining links properly, but any link
aggregation scheme has tradeoffs, and the best load balance algorithm to
use depends upon the work load.  Two aggregated 10G links are not
interchangable with a single 20G link.

	For a round robin transmission scheme, issues arise because
packets are delivered at the other end out of order.  This in turn
triggers various TCP behaviors to deal with what is perceived to be
transmission errors or lost packets (TCP fast retransmit being the most
notable).  This usually results in a single TCP connection being unable
to completely saturate a round-robin aggregated set of links.

	There are a few parameters on linux that can be adjusted.  I
don't know what the windows equivalents might be.

	On linux, adjusting the net.ipv4.tcp_reordering sysctl value
will increase the tolerance for out of order delivery.  

	The sysctl is adjusted via something like

sysctl -w net.ipv4.tcp_reordering=10

	the default value is 3, and higher values increase the tolerance
for out of order delivery.  If memory serves, the setting is applied to
connections as they are created, so existing connections will not see
changes.

	Also, adjusting the packet coalescing setting for the receiving
devices may also permit higher throughput. The packet coalescing setting
is adjusted via ethtool; the current settings can be viewed via

ethtool -c eth0

	and then adjusted via something like

ethtool -C eth0 rx-usecs 30

	I've seen reports that raising the "rx-usecs" parameter at the
receiver can increase the round-robin throughput.  My recollection is
that the value used was 30, but the best settings will likely be
dependent upon your particular hardware and configuration.

>   On the windows side it shows the bond-link as a 20Gb connection, but
>I don't see anyplace for something similar on linux. 

	There isn't any such indicator; bonding does not advertise its
link speed as the sum of its slaves link speeds.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Eric Leblond @ 2012-12-07 20:54 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless, johannes, linville
In-Reply-To: <20121207.153134.25835204617509469.davem@davemloft.net>

Hi,

On Fri, 2012-12-07 at 15:31 -0500, David Miller wrote:
> From: Eric Leblond <eric@regit.org>
> Date: Fri,  7 Dec 2012 19:56:01 +0100
> 
> Wireless folks, please take a look.  The issue is that,
> under the circumstances listed below, we get SKBs in
> the AF_PACKET input path that are shared.
> 
> Given the logic present in ieee80211_deliver_skb() I think
> the mac80211 code doesn't expect this either.
> 
> More commentary from me below:
> 
> > This patch is adding a check on skb before trying to defrag the
> > packet for the hash computation in fanout mode. The goal of this
> > patch is to avoid an kernel crash in pskb_expand_head.
> > It appears that under some specific condition there is a shared
> > skb reaching the defrag code and this lead to a crash due to the
> > following code:
> > 
> > 	if (skb_shared(skb))
> > 		BUG();
> > 
> > I've observed this crash under the following condition:
> >  1. a program is listening to an wifi interface (let say wlan0)
> >  2. it is using fanout capture in flow load balancing mode
> >  3. defrag option is on on the fanout socket
> >  4. the interface disconnect (radio down for example)
> >  5. the interface reconnect (radio switched up)
> >  6. once reconnected a single packet is seen with skb->users=2
> >  7. the kernel crash in pskb_expand_head at skbuff.c:1035
> > 
> > [BBB55:744364] [<ffffffff812a2761>] ? __pskb_pull_tail+0x43x0x26f
> > [BB8S5.744395] [<ffffffff812d29Tb>] ? ip_check_defrag+ox3a/0x14a
> > [BBB55.744422] [<ffffffffB1344459>] ? packet_rcv_fanout+ox5e/oxf9
> > [BBBS5.7444S0] [<ffffffffB12aaS9b>] ? __netif_receive_skb+ox444/ox4f9
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? netif_receive_skb+ox6d/0x?3
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_deliver_skb+0xbd/0xfa [mac80211]
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_h_data+0x1e0/0x21a [mac80211]
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_handlers+0x3d5/0x480 [mac80211]
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? __wake_up
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? evdev_eventr+0xc0/0xcf [evdev]
> > 
> > Signed-off-by: Eric Leblond <eric@regit.org>
> 
> So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
> packet headers, wherein it memoves() the data into a better aligned location.
> 
> But if these SKBs really are skb_shared(), this packet data
> modification is illegal.
> 
> I suspect that the assumptions built into this unaligned data handling
> code, and AF_PACKET, are correct.  Meaning that we should never see
> skb_shared() packets here.  We just have a missing skb_copy()
> somewhere in mac80211, Johannes can you please take a look?

Here's some more info that may help people knowing the code. During my
test, I've removed the BUG() and replaced with a printk to have a living
kernel. Only one single shared skb was seen for each up event.

I've also add another oops in the same code:
[BBB55:744364] [<ffffffff812a2761>] ? __pskb_pull_tail+0x43x0x26f
[BB8S5.744395] [<ffffffff812d29Tb>] ? ip_check_defrag+ox3a/0x14a
[BBB55.744422] [<ffffffffB1344459>] ? packet_rcv_fanout+ox5e/oxf9
[BBBS5.7444S0] [<ffffffffB12aaS9b>] ? __netif_receive_skb+ox444/ox4f9
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? netif_receive_skb+ox6d/0x?3
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_deliver_skb+0xbd/0xfa [mac80211]
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_h_data+0x1e0/0x21a [mac80211]
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_handlers+0x3d5/0x480 [mac80211]
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? _raw_spin_lock_irqsave+0x14/0x35
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_prepare_and_rx_handle+0x5a3/0x5db [mac80211]
...
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ttwu_dowakeup+0x2d

Picture of the oops available here:
	http://home.regit.org/~regit/wireless-oops.jpg

BR,
-- 
Eric Leblond <eric@regit.org>
Blog: https://home.regit.org/

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 20:42 UTC (permalink / raw)
  To: David Miller; +Cc: eric, netdev, linux-wireless, linville
In-Reply-To: <20121207.153134.25835204617509469.davem@davemloft.net>

On Fri, 2012-12-07 at 15:31 -0500, David Miller wrote:
> From: Eric Leblond <eric@regit.org>
> Date: Fri,  7 Dec 2012 19:56:01 +0100
> 
> Wireless folks, please take a look.  The issue is that,
> under the circumstances listed below, we get SKBs in
> the AF_PACKET input path that are shared.
> 
> Given the logic present in ieee80211_deliver_skb() I think
> the mac80211 code doesn't expect this either.

Indeed, it would certainly not like this, I'll take a look.

Eric, what's the driver you're using? I'm wondering whether paged skbs
vs. all data in the header would make a difference, hence the question.

johannes

^ permalink raw reply

* Re: [PATCH net-next v3 0/3] Multiqueue support in virtio-net
From: David Miller @ 2012-12-07 20:35 UTC (permalink / raw)
  To: jasowang
  Cc: krkumar2, kvm, mst, netdev, linux-kernel, virtualization,
	bhutchings, jwhan, shiyer
In-Reply-To: <1354899897-10423-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Sat,  8 Dec 2012 01:04:54 +0800

> This series is an update version (hope the final version) of multiqueue
> (VIRTIO_NET_F_MQ) support in virtio-net driver. All previous comments were
> addressed, the work were based on Krishna Kumar's work to let virtio-net use
> multiple rx/tx queues to do the packets reception and transmission. Performance
> test show the aggregate latency were increased greately but may get some
> regression in small packet transmission. Due to this, multiqueue were disabled
> by default. If user want to benefit form the multiqueue, ethtool -L could be
> used to enable the feature.
> 
> Please review and comments.
> 
> A protype implementation of qemu-kvm support could by found in
> git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two queues, you
> could specify the queues parameters to both tap and virtio-net like:
> 
> ./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...
> 
> then enable the multiqueue through ethtool by:
> 
> ethtool -L eth0 combined 2

It seems like most, if not all, of the feedback given for this series
has been addressed by Jason.

Can I get some ACKs?

Thanks.

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2012-12-07 20:35 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel

Two stragglers:

1) The new code that adds new flushing semantics to GRO can cause
   SKB pointer list corruption, manage the lists differently to	
   avoid the OOPS.  Fix from Eric Dumazet.

2) When TCP fast open does a retransmit of data in a SYN-ACK or
   similar, we update retransmit state that we shouldn't triggering
   a WARN_ON later.  Fix from Yuchung Cheng.

Please pull, thanks a lot!

The following changes since commit 1afa471706963643ceeda7cbbe9c605a1e883d53:

  Merge tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc (2012-12-07 09:15:20 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master

for you to fetch changes up to c3c7c254b2e8cd99b0adf288c2a1bddacd7ba255:

  net: gro: fix possible panic in skb_gro_receive() (2012-12-07 14:39:29 -0500)

----------------------------------------------------------------
Eric Dumazet (1):
      net: gro: fix possible panic in skb_gro_receive()

Yuchung Cheng (1):
      tcp: bug fix Fast Open client retransmission

 include/linux/netdevice.h |  3 +++
 include/net/tcp.h         |  1 +
 net/core/dev.c            |  2 ++
 net/core/skbuff.c         |  6 +++---
 net/ipv4/tcp_input.c      |  6 +++++-
 net/ipv4/tcp_output.c     | 15 ++++++++++-----
 6 files changed, 24 insertions(+), 9 deletions(-)

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: David Miller @ 2012-12-07 20:31 UTC (permalink / raw)
  To: eric; +Cc: netdev, linux-wireless, johannes, linville
In-Reply-To: <1354906561-4695-1-git-send-email-eric@regit.org>

From: Eric Leblond <eric@regit.org>
Date: Fri,  7 Dec 2012 19:56:01 +0100

Wireless folks, please take a look.  The issue is that,
under the circumstances listed below, we get SKBs in
the AF_PACKET input path that are shared.

Given the logic present in ieee80211_deliver_skb() I think
the mac80211 code doesn't expect this either.

More commentary from me below:

> This patch is adding a check on skb before trying to defrag the
> packet for the hash computation in fanout mode. The goal of this
> patch is to avoid an kernel crash in pskb_expand_head.
> It appears that under some specific condition there is a shared
> skb reaching the defrag code and this lead to a crash due to the
> following code:
> 
> 	if (skb_shared(skb))
> 		BUG();
> 
> I've observed this crash under the following condition:
>  1. a program is listening to an wifi interface (let say wlan0)
>  2. it is using fanout capture in flow load balancing mode
>  3. defrag option is on on the fanout socket
>  4. the interface disconnect (radio down for example)
>  5. the interface reconnect (radio switched up)
>  6. once reconnected a single packet is seen with skb->users=2
>  7. the kernel crash in pskb_expand_head at skbuff.c:1035
> 
> [BBB55:744364] [<ffffffff812a2761>] ? __pskb_pull_tail+0x43x0x26f
> [BB8S5.744395] [<ffffffff812d29Tb>] ? ip_check_defrag+ox3a/0x14a
> [BBB55.744422] [<ffffffffB1344459>] ? packet_rcv_fanout+ox5e/oxf9
> [BBBS5.7444S0] [<ffffffffB12aaS9b>] ? __netif_receive_skb+ox444/ox4f9
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? netif_receive_skb+ox6d/0x?3
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_deliver_skb+0xbd/0xfa [mac80211]
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_h_data+0x1e0/0x21a [mac80211]
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_handlers+0x3d5/0x480 [mac80211]
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? __wake_up
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? evdev_eventr+0xc0/0xcf [evdev]
> 
> Signed-off-by: Eric Leblond <eric@regit.org>

So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
packet headers, wherein it memoves() the data into a better aligned location.

But if these SKBs really are skb_shared(), this packet data
modification is illegal.

I suspect that the assumptions built into this unaligned data handling
code, and AF_PACKET, are correct.  Meaning that we should never see
skb_shared() packets here.  We just have a missing skb_copy()
somewhere in mac80211, Johannes can you please take a look?

Thanks.

^ permalink raw reply

* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-07 20:18 UTC (permalink / raw)
  To: David Miller
  Cc: joseph.gasparakis, bhutchings, alexander.h.duyck, shemminger,
	chrisw, gospo, netdev, linux-kernel, dmitry, saeed.bishara,
	peter.p.waskiewicz.jr
In-Reply-To: <20121207.145239.1408511750725975741.davem@davemloft.net>



On Fri, 7 Dec 2012, David Miller wrote:

> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Date: Fri, 7 Dec 2012 11:52:43 -0800 (PST)
> 
> > 
> > 
> > On Fri, 7 Dec 2012, David Miller wrote:
> > 
> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> >> Date: Fri, 7 Dec 2012 11:41:46 -0800 (PST)
> >> 
> >> > 
> >> > 
> >> > On Fri, 7 Dec 2012, David Miller wrote:
> >> > 
> >> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> >> >> Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
> >> >> 
> >> >> > So the idea here is that the driver will use the headers for checksumming 
> >> >> > if the skb->encapsulation bit is on. The bit should be set in the protocol 
> >> >> > driver.
> >> >> > 
> >> >> > To answer the second comment, the flags that we use in this series of 
> >> >> > patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are 
> >> >> > the bits that we propose will be used for checksumming of encapsulation. 
> >> >> > As per a previous comment in v2, the hw_enc_features field should be used 
> >> >> > also in the future when NICs have more encap offloads, so one could
> >> >> > indicate these features there from the driver.
> >> >> > 
> >> >> > Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM 
> >> >> > is used, again in conjunction with skb->encapsulation flag. As I mention 
> >> >> > in my logs, the driver is expected to set the ip_summed to UNNECESSARY and 
> >> >> > turn the skb->encapsulation on, to indicate that the inner headers are 
> >> >> > already HW checksummed.
> >> >> > 
> >> >> 
> >> >> This is the kind of language that belongs in the commit message and
> >> >> code comments.
> >> >> 
> >> > Sure. I'll wait to gather some more feedback if there is any and I will 
> >> > re-spin this off adding more code comments and clarify this in the logs.
> >> 
> >> Great.
> >> 
> >> Please note that this request applies to your receive side change too.
> >> 
> > I believe the logs are sufficient in the rx patch:
> > 
> > This patch adds capability in vxlan to identify received
> > checksummed inner packets and signal them to the upper layers of
> > the stack. The driver needs to set the skb->encapsulation bit
> > and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.
> > 
> > ...but I can add more comments to the code if this is what you are 
> > referring to.
> 
> Yes, please do.
> 
I will! Thanks.

^ permalink raw reply

* Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
From: Linda Walsh @ 2012-12-07 20:06 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Cong Wang, LKML, Linux Kernel Network Developers
In-Reply-To: <4913.1354154236@death.nxdomain>

Sorry for the delay.... my distro (Suse) has made rebooting my system
a chore (have to often boot from rescue to get it to come up because
they put mount libs in /usr/lib expecting they will always boot
from their ram disk -- preventing those of use who boot directly
from disk from doing so easily...grrr.

Jay Vosburgh wrote:
> 	The miimon functionality is used to check link state and notice
> when slaves lose carrier.
---
    If I am running 'rr' on 2 channels -- specifically for the purpose
of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels)
I'm not sure I see how miimon would provide benefit.  -- if 1 link dies,
the other, being on the same card is likely to be dead too, so would
it really serve a purpose?


>   Running without it will not detect failure of
> the bonding slaves, which is likely not what you want.  The mode,
> balance-rr in your case, is what selects the load balance to use, and is
> separate from the miimon.
>   
----
    Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk
dies, the entire link goes down? 

    The other end (windows) doesn't dynamically config for a static-link
aggregation, so I don't think it would provide benefit.

> 	That said, the problem you're seeing appears to be caused by two
> things: bonding holds a lock (in addition to RTNL) when calling
> __ethtool_get_settings, and an ixgbe function in the call path to
> retrieve the settings, ixgbe_acquire_swfw_sync_X540, can sleep.
>
> 	The test patch above handles one case in bond_enslave, but there
> is another case in bond_miimon_commit when a slave changes link state
> from down to up, which will occur shortly after the slave is added.
>   
----
    Added your 2nd patch -- no more error messages...

    however -- likely unrelated, the max speed read or write I am seeing
is about 500MB/s, and that is rare -- usually it's barely <3x a 1Gb 
network speed. (119/125 MB R/W).  I'm not at all sure it's really 
combining the links
properly.  Anyway to verify that?


    On the windows side it shows the bond-link as a 20Gb connection, but
I don't see anyplace for something similar on linux. 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox