Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net] igbvf: fix divide by zero
From: Jeff Kirsher @ 2012-06-30 10:23 UTC (permalink / raw)
  To: davem
  Cc: Mitch A Williams, netdev, gospo, sassmann, stable, daahern,
	Jeff Kirsher

From: Mitch A Williams <mitch.a.williams@intel.com>

Using ethtool -C ethX rx-usecs 0 crashes with a divide by zero.
Refactor this function to fix this issue and make it more clear
what the intent of each conditional is. Add comment regarding
using a setting of zero.

CC: stable <stable@vger.kernel.org> [3.3+]
CC: David Ahern <daahern@cisco.com>
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igbvf/ethtool.c |   29 +++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/ethtool.c b/drivers/net/ethernet/intel/igbvf/ethtool.c
index 8ce6706..90eef07 100644
--- a/drivers/net/ethernet/intel/igbvf/ethtool.c
+++ b/drivers/net/ethernet/intel/igbvf/ethtool.c
@@ -357,21 +357,28 @@ static int igbvf_set_coalesce(struct net_device *netdev,
 	struct igbvf_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
 
-	if ((ec->rx_coalesce_usecs > IGBVF_MAX_ITR_USECS) ||
-	    ((ec->rx_coalesce_usecs > 3) &&
-	     (ec->rx_coalesce_usecs < IGBVF_MIN_ITR_USECS)) ||
-	    (ec->rx_coalesce_usecs == 2))
-		return -EINVAL;
-
-	/* convert to rate of irq's per second */
-	if (ec->rx_coalesce_usecs && ec->rx_coalesce_usecs <= 3) {
+	if ((ec->rx_coalesce_usecs >= IGBVF_MIN_ITR_USECS) &&
+	     (ec->rx_coalesce_usecs <= IGBVF_MAX_ITR_USECS)) {
+		adapter->current_itr = ec->rx_coalesce_usecs << 2;
+		adapter->requested_itr = 1000000000 /
+					(adapter->current_itr * 256);
+	} else if ((ec->rx_coalesce_usecs == 3) ||
+		   (ec->rx_coalesce_usecs == 2)) {
 		adapter->current_itr = IGBVF_START_ITR;
 		adapter->requested_itr = ec->rx_coalesce_usecs;
-	} else {
-		adapter->current_itr = ec->rx_coalesce_usecs << 2;
+	} else if (ec->rx_coalesce_usecs == 0) {
+		/*
+		 * The user's desire is to turn off interrupt throttling
+		 * altogether, but due to HW limitations, we can't do that.
+		 * Instead we set a very small value in EITR, which would
+		 * allow ~967k interrupts per second, but allow the adapter's
+		 * internal clocking to still function properly.
+		 */
+		adapter->current_itr = 4;
 		adapter->requested_itr = 1000000000 /
 					(adapter->current_itr * 256);
-	}
+	} else
+		return -EINVAL;
 
 	writel(adapter->current_itr,
 	       hw->hw_addr + adapter->rx_ring->itr_register);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] NFC: Prevent NULL deref when getting socket name
From: Sasha Levin @ 2012-06-30  9:56 UTC (permalink / raw)
  To: lauro.venancio, aloisio.almeida, sameo, linville
  Cc: linux-wireless, netdev, linux-kernel, Sasha Levin

llcp_sock_getname can be called without a device attached to the nfc_llcp_sock.

This would lead to the following BUG:

[  362.341807] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  362.341815] IP: [<ffffffff836258e5>] llcp_sock_getname+0x75/0xc0
[  362.341818] PGD 31b35067 PUD 30631067 PMD 0
[  362.341821] Oops: 0000 [#627] PREEMPT SMP DEBUG_PAGEALLOC
[  362.341826] CPU 3
[  362.341827] Pid: 7816, comm: trinity-child55 Tainted: G      D W    3.5.0-rc4-next-20120628-sasha-00005-g9f23eb7 #479
[  362.341831] RIP: 0010:[<ffffffff836258e5>]  [<ffffffff836258e5>] llcp_sock_getname+0x75/0xc0
[  362.341832] RSP: 0018:ffff8800304fde88  EFLAGS: 00010286
[  362.341834] RAX: 0000000000000000 RBX: ffff880033cb8000 RCX: 0000000000000001
[  362.341835] RDX: ffff8800304fdec4 RSI: ffff8800304fdec8 RDI: ffff8800304fdeda
[  362.341836] RBP: ffff8800304fdea8 R08: 7ebcebcb772b7ffb R09: 5fbfcb9c35bdfd53
[  362.341838] R10: 4220020c54326244 R11: 0000000000000246 R12: ffff8800304fdec8
[  362.341839] R13: ffff8800304fdec4 R14: ffff8800304fdec8 R15: 0000000000000044
[  362.341841] FS:  00007effa376e700(0000) GS:ffff880035a00000(0000) knlGS:0000000000000000
[  362.341843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  362.341844] CR2: 0000000000000000 CR3: 0000000030438000 CR4: 00000000000406e0
[  362.341851] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  362.341856] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  362.341858] Process trinity-child55 (pid: 7816, threadinfo ffff8800304fc000, task ffff880031270000)
[  362.341858] Stack:
[  362.341862]  ffff8800304fdea8 ffff880035156780 0000000000000000 0000000000001000
[  362.341865]  ffff8800304fdf78 ffffffff83183b40 00000000304fdec8 0000006000000000
[  362.341868]  ffff8800304f0027 ffffffff83729649 ffff8800304fdee8 ffff8800304fdf48
[  362.341869] Call Trace:
[  362.341874]  [<ffffffff83183b40>] sys_getpeername+0xa0/0x110
[  362.341877]  [<ffffffff83729649>] ? _raw_spin_unlock_irq+0x59/0x80
[  362.341882]  [<ffffffff810f342b>] ? do_setitimer+0x23b/0x290
[  362.341886]  [<ffffffff81985ede>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  362.341889]  [<ffffffff8372a539>] system_call_fastpath+0x16/0x1b
[  362.341921] Code: 84 00 00 00 00 00 b8 b3 ff ff ff 48 85 db 74 54 66 41 c7 04 24 27 00 49 8d 7c 24 12 41 c7 45 00 60 00 00 00 48 8b 83 28 05 00 00 <8b> 00 41 89 44 24 04 0f b6 83 41 05 00 00 41 88 44 24 10 0f b6
[  362.341924] RIP  [<ffffffff836258e5>] llcp_sock_getname+0x75/0xc0
[  362.341925]  RSP <ffff8800304fde88>
[  362.341926] CR2: 0000000000000000
[  362.341928] ---[ end trace 6d450e935ee18bf3 ]---

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
---
 net/nfc/llcp/sock.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/nfc/llcp/sock.c b/net/nfc/llcp/sock.c
index 2c0b317..05ca5a6 100644
--- a/net/nfc/llcp/sock.c
+++ b/net/nfc/llcp/sock.c
@@ -292,7 +292,7 @@ static int llcp_sock_getname(struct socket *sock, struct sockaddr *addr,
 
 	pr_debug("%p\n", sk);
 
-	if (llcp_sock == NULL)
+	if (llcp_sock == NULL || llcp_sock->dev == NULL)
 		return -EBADFD;
 
 	addr->sa_family = AF_NFC;
-- 
1.7.8.6

^ permalink raw reply related

* Re: [PATCH] ipv4: Create and use fib_compute_spec_dst() helper.
From: Julian Anastasov @ 2012-06-30  9:25 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120628.184500.114483408843364230.davem@davemloft.net>


	Hello,

On Thu, 28 Jun 2012, David Miller wrote:

> ipv4: Fix bugs in fib_compute_spec_dst().

	Some more thoughts on this topic...

	I'm wondering, may be ip_options_echo wants to put
local IP for srr. ip_options_echo is called by ip_send_unicast_reply.
ip_send_unicast_reply supports source address spoofing for
tproxy (arg.flags & IP_REPLY_ARG_NOSRCCHECK).

	May be the tproxy users add local routes to redirect
the traffic to local stack but daddr is preserved (non-local).
So, rt_flags will have RTCF_LOCAL but for srr purposes we
need local address, right?

	There can be optimization in ip_options_echo to
avoid fib_compute_spec_dst if daddr is not needed. It seems
it is needed only in the sopt->srr case.

	It seems ip_options_compile can be called by
ip_rcv_options (ip_rcv_finish) just after ip_route_input_noref
but before dst_input. It means, it can happen for forwarding,
not just for local delivery.

	To summarize, we can not rely on iph->daddr to be
local address if RTCF_LOCAL is set. There is always the risk to
work with redirected or forwarded traffic. Even for the PKTINFO
case we should make sure ipi_spec_dst is a local address (original
daddr goes to ipi_addr anyways), in case later ipi_spec_dst
is used again for sending in PKTINFO.

	For now, I see only one possible optimization.
When fib_lookup returns res.fi and res.type is RTN_LOCAL
we can check fib_protocol. If fib_protocol is not
RTPROT_KERNEL we will add RTCF_MAYBE_LOCAL (new flag) to rt_flags.
It will lead to slow lookups to validate the iph->daddr
if used later as source address, like in the spec_dst case.

	For the common case of local routes created by fib_magic()
we will use iph->daddr in fib_compute_spec_dst as follows:

	if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST |
			    RTCF_LOCAL | RTCF_MAYBE_LOCAL) == RTCF_LOCAL))
		return ip_hdr(skb)->daddr;
	/* For mcast, forwarding and spoofing we take the slow path */

	If users add local RTPROT_KERNEL routes, later
the outgoing traffic will anyways fail in some output route lookup
because FLOWI_FLAG_ANYSRC is set in rare cases. But also
users can break srr in this way, so there is some risk.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [BUG, regression, bisected] Marvell 88E8055 NIC (sky2) fails to detect link after resume from S3
From: Michal Zatloukal @ 2012-06-30  9:01 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20120629164610.1f343434@nehalam.linuxnetplumber.net>

On Sat, Jun 30, 2012 at 1:46 AM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
<snip>
>
> Is ubuntu still doing the stupid rmmod on suspend?

The modprobes in the posted dmesg were done manually by me to see if
it would help. It didn't.
As for Ubuntu doing it, I don't know. Any way I could tell? I've
looked into /usr/lib/pm-utils/pm-functions and the $SUSPEND_MODULES
variable is set to empty, so it doesn't look like it's doing it for
any modules at this point.

> Looks like PCI power management has turned the chip off (that is why it
> keeps reading ff to all requests).

Is there something I can try?

MZ

^ permalink raw reply

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
From: Eric Dumazet @ 2012-06-30  8:39 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher
In-Reply-To: <4FEE3487.9080408@intel.com>

On Fri, 2012-06-29 at 16:04 -0700, Alexander Duyck wrote:

> I was wondering if there were any plans to clean this patch up and
> submit it to net-next?  If not, I can probably work on that since this
> addressed the concerns I had in my original patch.
> 

I used this patch for a while on my machines, but I am working on
something allowing fallback to order-0 allocations if memory gets
fragmented. This fallback should almost never happen, but we should have
it just in case ?

^ permalink raw reply

* Re: [patch net-next v2 0/4] net: introduce and use IFF_LIFE_ADDR_CHANGE
From: David Miller @ 2012-06-30  8:08 UTC (permalink / raw)
  To: jpirko
  Cc: mst, netdev, shimoda.hiroaki, virtualization, danny.kukawka,
	edumazet
In-Reply-To: <1340982608-897-1-git-send-email-jpirko@redhat.com>

From: Jiri Pirko <jpirko@redhat.com>
Date: Fri, 29 Jun 2012 17:10:04 +0200

> three drivers updated, but this can be used in many others.
> 
> v1->v2:
> %s/LIFE/LIVE
> 
> Jiri Pirko (4):
>   net: introduce new priv_flag indicating iface capable of change mac
>     when running
>   virtio_net: use IFF_LIVE_ADDR_CHANGE priv_flag
>   team: use IFF_LIVE_ADDR_CHANGE priv_flag
>   dummy: use IFF_LIVE_ADDR_CHANGE priv_flag

Applied, thanks Jiri.

^ permalink raw reply

* Re: [PATCH V3 1/2] bonding support for IPv6 transmit hashing
From: David Miller @ 2012-06-30  8:05 UTC (permalink / raw)
  To: linux; +Cc: netdev
In-Reply-To: <4FEE99E7.9010504@8192.net>

If you're going to post multiple patches, give them unique
subject line texts describing what each change does uniquely.
Do not use identical subject lines ever, that is very unhelpful
for the people reading your changes.

From: John <linux@8192.net>
Date: Fri, 29 Jun 2012 23:17:11 -0700

> + skb_network_header_len(skb) >= sizeof(struct ipv6hdr)) {
> +		ipv6h = ipv6_hdr(skb);
> +		v6hash =
> + (ipv6h->saddr.s6_addr32[1] ^ ipv6h->daddr.s6_addr32[1]) ^
> + (ipv6h->saddr.s6_addr32[2] ^ ipv6h->daddr.s6_addr32[2]) ^
> + (ipv6h->saddr.s6_addr32[3] ^ ipv6h->daddr.s6_addr32[3]);
> +		v6hash = (v6hash >> 16) ^ (v6hash >> 8) ^ v6hash;
> + return (v6hash ^ data->h_dest[5] ^ data->h_source[5]) % count;

Either you formatted this terribly, or your email client corrupted
your patches.

^ permalink raw reply

* [PATCH V3 0/2] bonding support for IPv6 transmit hashing
From: John @ 2012-06-30  6:17 UTC (permalink / raw)
  To: netdev

Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6
(and TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support
in the bonding driver.

The algorithm chosen (xor'ing the bottom three quads and then xor'ing
the bottom three bytes of that) was chosen after testing almost 400,000
unique IPv6 addresses harvested from server logs. This algorithm
had the most even distribution for both big- and little-endian
architectures while still using few instructions.

Fragmented IPv6 packets are handled the same way as fragmented
IPv4 packets, ie, they are not balanced based on layer 4
information. Additionally, IPv6 packets with intermediate headers
are not balanced based on layer 4 information. In practice these
intermediate headers are not common and this should not cause any
problems, and the alternative (a packet-parsing loop and look-up table)
seemed slow and complicated for little gain.

This is an update to a prior patch I submitted. This version includes
a clarified description, more thorough bounds checking, updates
functions to call bond_xmit_hash_policy_l2 rather than re-implement
the same logic, incorporates Jay's style suggestions, and patches
against net-next. Patch has been tested and performs as expected.

John

^ permalink raw reply

* [PATCH V3 1/2] bonding support for IPv6 transmit hashing
From: John @ 2012-06-30  6:17 UTC (permalink / raw)
  To: netdev

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f5a40b9..b138d84 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3345,56 +3345,93 @@ static struct notifier_block bond_netdev_notifier = {
  /*---------------------------- Hashing Policies -----------------------------*/

  /*
+ * Hash for the output device based upon layer 2 data
+ */
+static int bond_xmit_hash_policy_l2(struct sk_buff *skb, int count)
+{
+	struct ethhdr *data = (struct ethhdr *)skb->data;
+
+	if (skb_headlen(skb) >= offsetof(struct ethhdr, h_proto))
+		return (data->h_dest[5] ^ data->h_source[5]) % count;
+
+	return 0;
+}
+
+/*
   * Hash for the output device based upon layer 2 and layer 3 data. If
- * the packet is not IP mimic bond_xmit_hash_policy_l2()
+ * the packet is not IP, fall back on bond_xmit_hash_policy_l2()
   */
  static int bond_xmit_hash_policy_l23(struct sk_buff *skb, int count)
  {
  	struct ethhdr *data = (struct ethhdr *)skb->data;
-	struct iphdr *iph = ip_hdr(skb);
+	struct iphdr *iph;
+	struct ipv6hdr *ipv6h;
+	u32 v6hash;

-	if (skb->protocol == htons(ETH_P_IP)) {
+	if (skb->protocol == htons(ETH_P_IP) &&
+		skb_network_header_len(skb) >= sizeof(struct iphdr)) {
+		iph = ip_hdr(skb);
  		return ((ntohl(iph->saddr ^ iph->daddr) & 0xffff) ^
  			(data->h_dest[5] ^ data->h_source[5])) % count;
-	}
-
-	return (data->h_dest[5] ^ data->h_source[5]) % count;
+	} else if (skb->protocol == htons(ETH_P_IPV6) &&
+		skb_network_header_len(skb) >= sizeof(struct ipv6hdr)) {
+		ipv6h = ipv6_hdr(skb);
+		v6hash =
+			(ipv6h->saddr.s6_addr32[1] ^ ipv6h->daddr.s6_addr32[1]) ^
+			(ipv6h->saddr.s6_addr32[2] ^ ipv6h->daddr.s6_addr32[2]) ^
+			(ipv6h->saddr.s6_addr32[3] ^ ipv6h->daddr.s6_addr32[3]);
+		v6hash = (v6hash >> 16) ^ (v6hash >> 8) ^ v6hash;
+		return (v6hash ^ data->h_dest[5] ^ data->h_source[5]) % count;
+	}
+
+	return bond_xmit_hash_policy_l2(skb, count);
  }

  /*
   * Hash for the output device based upon layer 3 and layer 4 data. If
   * the packet is a frag or not TCP or UDP, just use layer 3 data.  If it is
- * altogether not IP, mimic bond_xmit_hash_policy_l2()
+ * altogether not IP, fall back on bond_xmit_hash_policy_l2()
   */
  static int bond_xmit_hash_policy_l34(struct sk_buff *skb, int count)
  {
-	struct ethhdr *data = (struct ethhdr *)skb->data;
-	struct iphdr *iph = ip_hdr(skb);
-	__be16 *layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
-	int layer4_xor = 0;
+	u32 layer4_xor = 0;
+	struct iphdr *iph;
+	struct ipv6hdr *ipv6h;

  	if (skb->protocol == htons(ETH_P_IP)) {
+		iph = ip_hdr(skb);
  		if (!ip_is_fragment(iph) &&
-		    (iph->protocol == IPPROTO_TCP ||
-		     iph->protocol == IPPROTO_UDP)) {
+			(iph->protocol == IPPROTO_TCP ||
+			iph->protocol == IPPROTO_UDP)) {
+			__be16 *layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
+			if (iph->ihl * sizeof(u32) + sizeof(__be16) * 2 >
+				skb_headlen(skb) - skb_network_offset(skb))
+				goto short_header;
  			layer4_xor = ntohs((*layer4hdr ^ *(layer4hdr + 1)));
+		} else if (skb_network_header_len(skb) < sizeof(struct iphdr)) {
+			goto short_header;
  		}
-		return (layer4_xor ^
-			((ntohl(iph->saddr ^ iph->daddr)) & 0xffff)) % count;
-
+		return (layer4_xor ^ ((ntohl(iph->saddr ^ iph->daddr)) & 0xffff)) % count;
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		ipv6h = ipv6_hdr(skb);
+		if (ipv6h->nexthdr == IPPROTO_TCP || ipv6h->nexthdr == IPPROTO_UDP) {
+			__be16 *layer4hdrv6 = (__be16 *)((u8 *)ipv6h + sizeof(struct ipv6hdr));
+			if (sizeof(struct ipv6hdr) + sizeof(__be16) * 2 >
+				skb_headlen(skb) - skb_network_offset(skb))
+				goto short_header;
+			layer4_xor = (*layer4hdrv6 ^ *(layer4hdrv6 + 1));
+		} else if (skb_network_header_len(skb) < sizeof(struct ipv6hdr)) {
+			goto short_header;
+		}
+		layer4_xor ^=
+			(ipv6h->saddr.s6_addr32[1] ^ ipv6h->daddr.s6_addr32[1]) ^
+			(ipv6h->saddr.s6_addr32[2] ^ ipv6h->daddr.s6_addr32[2]) ^
+			(ipv6h->saddr.s6_addr32[3] ^ ipv6h->daddr.s6_addr32[3]);
+		return ((layer4_xor >> 16) ^ (layer4_xor >> 8) ^ layer4_xor) % count;
  	}

-	return (data->h_dest[5] ^ data->h_source[5]) % count;
-}
-
-/*
- * Hash for the output device based upon layer 2 data
- */
-static int bond_xmit_hash_policy_l2(struct sk_buff *skb, int count)
-{
-	struct ethhdr *data = (struct ethhdr *)skb->data;
-
-	return (data->h_dest[5] ^ data->h_source[5]) % count;
+short_header:
+	return bond_xmit_hash_policy_l2(skb, count);
  }

  /*-------------------------- Device entry points ----------------------------*/

^ permalink raw reply related

* [PATCH V3 2/2] bonding support for IPv6 transmit hashing
From: John @ 2012-06-30  6:17 UTC (permalink / raw)
  To: netdev

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index bfea8a3..5db14fe 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -752,12 +752,22 @@ xmit_hash_policy
  		protocol information to generate the hash.

  		Uses XOR of hardware MAC addresses and IP addresses to
-		generate the hash.  The formula is
+		generate the hash.  The IPv4 formula is

  		(((source IP XOR dest IP) AND 0xffff) XOR
  			( source MAC XOR destination MAC ))
  				modulo slave count

+		The IPv6 forumla is
+
+		iphash =
+			(source ip quad 2 XOR dest IP quad 2) XOR
+			(source ip quad 3 XOR dest IP quad 3) XOR
+			(source ip quad 4 XOR dest IP quad 4)
+
+		((iphash >> 16) XOR (iphash >> 8) XOR iphash)
+			modulo slave count
+
  		This algorithm will place all traffic to a particular
  		network peer on the same slave.  For non-IP traffic,
  		the formula is the same as for the layer2 transmit
@@ -778,19 +788,30 @@ xmit_hash_policy
  		slaves, although a single connection will not span
  		multiple slaves.

-		The formula for unfragmented TCP and UDP packets is
+		The formula for unfragmented IPv4 TCP and UDP packets is

  		((source port XOR dest port) XOR
  			 ((source IP XOR dest IP) AND 0xffff)
  				modulo slave count

-		For fragmented TCP or UDP packets and all other IP
-		protocol traffic, the source and destination port
+		The formula for unfragmented IPv6 TCP and UDP packets is
+
+		iphash =
+			(source ip quad 2 XOR dest IP quad 2) XOR
+			(source ip quad 3 XOR dest IP quad 3) XOR
+			(source ip quad 4 XOR dest IP quad 4)
+
+		((source port XOR dest port) XOR
+			(iphash >> 16) XOR (iphash >> 8) XOR iphash)
+				modulo slave count
+
+		For fragmented TCP or UDP packets and all other IPv4 and
+		IPv6 protocol traffic, the source and destination port
  		information is omitted.  For non-IP traffic, the
  		formula is the same as for the layer2 transmit hash
  		policy.

-		This policy is intended to mimic the behavior of
+		The IPv4 policy is intended to mimic the behavior of
  		certain switches, notably Cisco switches with PFC2 as
  		well as some Foundry and IBM products.

^ permalink raw reply related

* Dear Friend,
From: Jaine Cyrus @ 2012-06-30  3:46 UTC (permalink / raw)
  To: Recipients

OPEN AND READ THE ATTACHMENT  FILE..

^ permalink raw reply

* Re: [BUG, regression, bisected] Marvell 88E8055 NIC (sky2) fails to detect link after resume from S3
From: Stephen Hemminger @ 2012-06-29 23:46 UTC (permalink / raw)
  To: Michal Zatloukal; +Cc: netdev
In-Reply-To: <op.wgon84tw16tawo@esprimo>

On Fri, 29 Jun 2012 23:20:54 +0200
"Michal Zatloukal" <myxal.mxl@gmail.com> wrote:

> Hello.
> 
> I'm the reporter of Ubuntu bug 1007841
> <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1007841> and would
> like to bring attention to it here, since it's in upstream kernel as well.
> 
> The gist of the problem is, since around 3.2 (I haven't kept up-to-date
> and mostly used 2.6.35 on the machine), whenever I wake up the laptop from
> S3 by opening the lid, the NIC loses link detection and it's reported as
> always down. Relevant dmesg output (suspend-resume twice, then attempted
> modprobe -r and modprobe, also twice):
> 
> [    3.351407] sky2: driver version 1.30
> [    3.351460] sky2 0000:04:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ  
> 17
> [    3.351477] sky2 0000:04:00.0: setting latency timer to 64
> [    3.351510] sky2 0000:04:00.0: Yukon-2 EC Ultra chip revision 3
> [    3.351610] sky2 0000:04:00.0: irq 44 for MSI/MSI-X
> [    3.360722] sky2 0000:04:00.0: eth0: addr 00:a0:d1:cd:97:e5
> [   19.233940] sky2 0000:04:00.0: eth0: enabling interface
> [   21.595880] sky2 0000:04:00.0: eth0: Link is up at 1000 Mbps, full  
> duplex, flow control both
> [ 2547.761596] sky2 0000:04:00.0: eth0: disabling interface
> [ 2551.220040] PM: late suspend of drv:sky2 dev:0000:04:00.0 complete  
> after 155.989 msecs
> [ 2551.532056] sky2 0000:04:00.0: Refused to change power state, currently  
> in D3
> [ 2551.532070] sky2 0000:04:00.0: restoring config space at offset 0xf  
> (was 0xffffffff, writing 0x10a)
> [ 2551.532074] sky2 0000:04:00.0: restoring config space at offset 0xe  
> (was 0xffffffff, writing 0x0)
> [ 2551.532078] sky2 0000:04:00.0: restoring config space at offset 0xd  
> (was 0xffffffff, writing 0x48)
> [ 2551.532082] sky2 0000:04:00.0: restoring config space at offset 0xc  
> (was 0xffffffff, writing 0x0)
> [ 2551.532086] sky2 0000:04:00.0: restoring config space at offset 0xb  
> (was 0xffffffff, writing 0x110f1734)
> [ 2551.532090] sky2 0000:04:00.0: restoring config space at offset 0xa  
> (was 0xffffffff, writing 0x0)
> [ 2551.532094] sky2 0000:04:00.0: restoring config space at offset 0x9  
> (was 0xffffffff, writing 0x0)
> [ 2551.532099] sky2 0000:04:00.0: restoring config space at offset 0x8  
> (was 0xffffffff, writing 0x0)
> [ 2551.532103] sky2 0000:04:00.0: restoring config space at offset 0x7  
> (was 0xffffffff, writing 0x0)
> [ 2551.532107] sky2 0000:04:00.0: restoring config space at offset 0x6  
> (was 0xffffffff, writing 0x3001)
> [ 2551.532111] sky2 0000:04:00.0: restoring config space at offset 0x5  
> (was 0xffffffff, writing 0x0)
> [ 2551.532115] sky2 0000:04:00.0: restoring config space at offset 0x4  
> (was 0xffffffff, writing 0xf8000004)
> [ 2551.532119] sky2 0000:04:00.0: restoring config space at offset 0x3  
> (was 0xffffffff, writing 0x10)
> [ 2551.532123] sky2 0000:04:00.0: restoring config space at offset 0x2  
> (was 0xffffffff, writing 0x2000014)
> [ 2551.532127] sky2 0000:04:00.0: restoring config space at offset 0x1  
> (was 0xffffffff, writing 0x100507)
> [ 2551.532132] sky2 0000:04:00.0: restoring config space at offset 0x0  
> (was 0xffffffff, writing 0x436311ab)
> [ 2551.537226] sky2 0000:04:00.0: ignoring stuck error report bit
> [ 2553.916819] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916826] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916830] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916833] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916836] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916839] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916843] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916846] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916849] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916852] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916855] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916859] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916862] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916865] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916868] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916871] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.916875] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2553.917001] sky2 0000:04:00.0: eth0: enabling interface
> [ 2601.941407] sky2 0000:04:00.0: eth0: disabling interface
> [ 2601.941443] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2601.941452] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2601.941459] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2601.941466] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2601.941473] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2601.941480] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2601.941487] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2601.941494] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2601.941501] sky2 0000:04:00.0: eth0: phy I/O error
> [ 2601.968125] sky2 0000:04:00.0: PCI INT A disabled
> [ 2608.679627] sky2: driver version 1.30
> [ 2608.679726] sky2 0000:04:00.0: enabling device (0000 -> 0003)
> [ 2608.679746] sky2 0000:04:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ  
> 17
> [ 2608.679776] sky2 0000:04:00.0: setting latency timer to 64
> [ 2608.679827] sky2 0000:04:00.0: unsupported chip type 0xff
> [ 2608.679851] sky2 0000:04:00.0: PCI INT A disabled
> [ 2608.679866] sky2: probe of 0000:04:00.0 failed with error -95
> [26940.138170] sky2: driver version 1.30
> [26940.138220] sky2 0000:04:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ  
> 17
> [26940.138236] sky2 0000:04:00.0: setting latency timer to 64
> [26940.138258] sky2 0000:04:00.0: unsupported chip type 0xff
> [26940.138268] sky2 0000:04:00.0: PCI INT A disabled
> [26940.138273] sky2: probe of 0000:04:00.0 failed with error -95
> 
> I have done bisection and have found the offending commit to be:
> 
> commit 7afe1845dd1e7c90828c942daed7e57ffa7c38d6
> Author: Sameer Nanda <snanda@chromium.org>
> Date: Mon Jul 25 17:13:29 2011 -0700
> init: skip calibration delay if previously done
> For each CPU, do the calibration delay only once.  For subsequent calls,
> use the cached per-CPU value of loops_per_jiffy.
> 
> This saves about 200ms of resume time on dual core Intel Atom N5xx based
> systems.  This helps bring down the kernel resume time on such systems
>    from about 500ms to about 300ms.
> --- end commit info ---
> 
> My uneducated guess is that by making the resume from S3 shorter, the
> driver catches the hardware with its pants down and freaks out.
> You can find all details/files (dmesg, lspci, dmidecode, config...)
> collected by apport in the ubuntu bug linked above. Let me know if I
> should supply any more info.
> Note: Please CC me into replies, I'm not subscribed. Thank you.
> 
> Best Regards,
> Michal Zatloukal

Is ubuntu still doing the stupid rmmod on suspend?

Looks like PCI power management has turned the chip off (that is why it
keeps reading ff to all requests).

^ permalink raw reply

* [RFC PATCH 10/10] ixgbe: Add support for set_channels ethtool operation
From: Alexander Duyck @ 2012-06-30  0:17 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

This change adds support for the ethtool set_channels operation.

Since the ixgbe driver has to support DCB as well as the other modes the
assumption I made here is that the number of channels in DCB modes refers
to the number of queues per traffic class, not the number of queues total.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   40 ++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 03e369f..ec49afb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2774,6 +2774,45 @@ static void ixgbe_get_channels(struct net_device *dev,
 	ch->combined_count = adapter->ring_feature[RING_F_FDIR].indices;
 }
 
+static int ixgbe_set_channels(struct net_device *dev,
+			      struct ethtool_channels *ch)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int count = ch->combined_count;
+
+	/* verify they are not requesting separate vectors */
+	if (ch->rx_count || ch->tx_count)
+		return -EINVAL;
+
+	/* ignore other_count since it is not changeable */
+
+	/* verify we have at least one channel requested */
+	if (!count)
+		return -EINVAL;
+
+	/* verify the number of channels does not exceed hardware limits */
+	if (count > ixgbe_max_channels(adapter))
+		return -EINVAL;
+
+	/* update feature limits from largest to smallest supported values */
+	adapter->ring_feature[RING_F_FDIR].limit = count;
+
+	/* cap RSS limit at 16 */
+	if (count > IXGBE_MAX_RSS_INDICES)
+		count = IXGBE_MAX_RSS_INDICES;
+	adapter->ring_feature[RING_F_RSS].limit = count;
+
+#ifdef IXGBE_FCOE
+	/* cap FCoE limit at 8 */
+	if (count > IXGBE_FCRETA_SIZE)
+		count = IXGBE_FCRETA_SIZE;
+	adapter->ring_feature[RING_F_FCOE].limit = count;
+
+#endif
+	/* use setup TC to update any traffic class queue mapping */
+	return ixgbe_setup_tc(dev, netdev_get_num_tc(dev));
+}
+
 static const struct ethtool_ops ixgbe_ethtool_ops = {
 	.get_settings           = ixgbe_get_settings,
 	.set_settings           = ixgbe_set_settings,
@@ -2804,6 +2843,7 @@ static const struct ethtool_ops ixgbe_ethtool_ops = {
 	.set_rxnfc		= ixgbe_set_rxnfc,
 	.get_ts_info		= ixgbe_get_ts_info,
 	.get_channels		= ixgbe_get_channels,
+	.set_channels		= ixgbe_set_channels,
 };
 
 void ixgbe_set_ethtool_ops(struct net_device *netdev)

^ permalink raw reply related

* [RFC PATCH 09/10] ixgbe: Add support for displaying the number of Tx/Rx channels
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

This patch adds support for the ethtool get_channels operation.

Since the ixgbe driver has to support DCB as well as the other modes the
assumption I made here is that the number of channels in DCB modes refers
to the number of queues per traffic class, not the number of queues total.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   72 ++++++++++++++++++++++
 1 files changed, 72 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 4104ea2..03e369f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2703,6 +2703,77 @@ static int ixgbe_get_ts_info(struct net_device *dev,
 	return 0;
 }
 
+static unsigned int ixgbe_max_channels(struct ixgbe_adapter *adapter)
+{
+	unsigned int max_combined;
+	u8 tcs = netdev_get_num_tc(adapter->netdev);
+
+	if (!(adapter->flags & IXGBE_FLAG_MSIX_ENABLED)) {
+		/* We only support one q_vector without MSI-X */
+		max_combined = 1;
+	} else if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED) {
+		/* SR-IOV currently only allows one queue on the PF */
+		max_combined = 1;
+	} else if (tcs > 1) {
+		/* For DCB report channels per traffic class */
+		if (adapter->hw.mac.type == ixgbe_mac_82598EB) {
+			/* 8 TC w/ 4 queues per TC */
+			max_combined = 4;
+		} else if (tcs > 4) {
+			/* 8 TC w/ 8 queues per TC */
+			max_combined = 8;
+		} else {
+			/* 4 TC w/ 16 queues per TC */
+			max_combined = 16;
+		}
+	} else if (adapter->atr_sample_rate) {
+		/* support up to 64 queues with ATR */
+		max_combined = IXGBE_MAX_FDIR_INDICES;
+	} else {
+		/* support up to 16 queues with RSS */
+		max_combined = IXGBE_MAX_RSS_INDICES;
+	}
+
+	return max_combined;
+}
+
+static void ixgbe_get_channels(struct net_device *dev,
+			       struct ethtool_channels *ch)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+	/* report maximum channels */
+	ch->max_combined = ixgbe_max_channels(adapter);
+
+	/* report info for other vector */
+	if (adapter->flags & IXGBE_FLAG_MSIX_ENABLED) {
+		ch->max_other = NON_Q_VECTORS;
+		ch->other_count = NON_Q_VECTORS;
+	}
+
+	/* record RSS queues */
+	ch->combined_count = adapter->ring_feature[RING_F_RSS].indices;
+
+	/* nothing else to report if RSS is disabled */
+	if (ch->combined_count == 1)
+		return;
+
+	/* we do not support ATR queueing if SR-IOV is enabled */
+	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+		return;
+
+	/* same thing goes for being DCB enabled */
+	if (netdev_get_num_tc(dev) > 1)
+		return;
+
+	/* if ATR is disabled we can exit */
+	if (!adapter->atr_sample_rate)
+		return;
+
+	/* report flow director queues as maximum channels */
+	ch->combined_count = adapter->ring_feature[RING_F_FDIR].indices;
+}
+
 static const struct ethtool_ops ixgbe_ethtool_ops = {
 	.get_settings           = ixgbe_get_settings,
 	.set_settings           = ixgbe_set_settings,
@@ -2732,6 +2803,7 @@ static const struct ethtool_ops ixgbe_ethtool_ops = {
 	.get_rxnfc		= ixgbe_get_rxnfc,
 	.set_rxnfc		= ixgbe_set_rxnfc,
 	.get_ts_info		= ixgbe_get_ts_info,
+	.get_channels		= ixgbe_get_channels,
 };
 
 void ixgbe_set_ethtool_ops(struct net_device *netdev)

^ permalink raw reply related

* [RFC PATCH 08/10] ixgbe: Update ixgbe driver to use __dev_pick_tx in ixgbe_select_queue
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

This change updates the ixgbe driver to use __dev_pick_tx instead of the
current logic it is using to select a queue.  The main result of this
change is that ixgbe can now fully support XPS.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   18 +++++++-----------
 1 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 06641ea..0b35ec3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6336,18 +6336,18 @@ static inline int ixgbe_maybe_stop_tx(struct ixgbe_ring *tx_ring, u16 size)
 	return __ixgbe_maybe_stop_tx(tx_ring, size);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
-	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
-					       smp_processor_id();
-#ifdef IXGBE_FCOE
 	__be16 protocol = vlan_get_protocol(skb);
 
 	if (((protocol == htons(ETH_P_FCOE)) ||
 	    (protocol == htons(ETH_P_FIP))) &&
 	    (adapter->flags & IXGBE_FLAG_FCOE_ENABLED)) {
 		struct ixgbe_ring_feature *f;
+		int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
+						       smp_processor_id();
 
 		f = &adapter->ring_feature[RING_F_FCOE];
 
@@ -6357,17 +6357,11 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb)
 
 		return txq;
 	}
-#endif
-
-	if (adapter->flags & IXGBE_FLAG_FDIR_HASH_CAPABLE) {
-		while (unlikely(txq >= dev->real_num_tx_queues))
-			txq -= dev->real_num_tx_queues;
-		return txq;
-	}
 
-	return skb_tx_hash(dev, skb);
+	return __dev_pick_tx(dev, skb);
 }
 
+#endif
 netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
 			  struct ixgbe_adapter *adapter,
 			  struct ixgbe_ring *tx_ring)
@@ -7013,7 +7007,9 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
 	.ndo_start_xmit		= ixgbe_xmit_frame,
+#ifdef IXGBE_FCOE
 	.ndo_select_queue	= ixgbe_select_queue,
+#endif
 	.ndo_set_rx_mode	= ixgbe_set_rx_mode,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address	= ixgbe_set_mac,

^ permalink raw reply related

* [RFC PATCH 07/10] ixgbe: Add function for setting XPS queue mapping
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

This change adds support for ixgbe to configure the XPS queue mapping on
load.  The result of this change is that on open we will now be resetting
the number of Tx queues, and then setting the default configuration for XPS
based on if ATR is enabled or disabled.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |    3 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   18 ++++++++++++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index 72386fb..a43dae0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -797,8 +797,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 	/* setup affinity mask and node */
 	if (cpu != -1)
 		cpumask_set_cpu(cpu, &q_vector->affinity_mask);
-	else
-		cpumask_copy(&q_vector->affinity_mask, cpu_online_mask);
+
 	q_vector->numa_node = node;
 
 	/* initialize CPU for DCA */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index dedb412..06641ea 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -4848,6 +4848,22 @@ static int ixgbe_change_mtu(struct net_device *netdev, int new_mtu)
 	return 0;
 }
 
+static void ixgbe_set_xps_mapping(struct net_device *netdev)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+	struct ixgbe_q_vector *q_vector;
+	u16 i;
+
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		q_vector = adapter->tx_ring[i]->q_vector;
+
+		if (!q_vector)
+			continue;
+
+		netif_set_xps_queue(netdev, &q_vector->affinity_mask, i);
+	}
+}
+
 /**
  * ixgbe_open - Called when a network interface is made active
  * @netdev: network interface device structure
@@ -4894,6 +4910,8 @@ static int ixgbe_open(struct net_device *netdev)
 	if (err)
 		goto err_set_queues;
 
+	/* update the Tx mapping */
+	ixgbe_set_xps_mapping(netdev);
 
 	err = netif_set_real_num_rx_queues(netdev,
 					   adapter->num_rx_pools > 1 ? 1 :

^ permalink raw reply related

* [RFC PATCH 06/10] ixgbe: Define FCoE and Flow director limits much sooner to allow for changes
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

Instead of adjusting the FCoE and Flow director limits based on the number
of CPUs we can define them much sooner.  This allows the user to come
through later and adjust them once we have updated the code to support the
set_channels ethtool operation.

I am still allowing for FCoE and RSS queues to be separated if the number
queues is less than the number of CPUs.  This essentially treats the two
groupings like they are two separate traffic classes.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |    7 +------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   12 ++++++++----
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index 24acd53..72386fb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -386,7 +386,6 @@ static bool ixgbe_set_dcb_sriov_queues(struct ixgbe_adapter *adapter)
 		fcoe = &adapter->ring_feature[RING_F_FCOE];
 
 		/* limit ourselves based on feature limits */
-		fcoe_i = min_t(u16, fcoe_i, num_online_cpus());
 		fcoe_i = min_t(u16, fcoe_i, fcoe->limit);
 
 		if (fcoe_i) {
@@ -562,9 +561,6 @@ static bool ixgbe_set_sriov_queues(struct ixgbe_adapter *adapter)
 		fcoe_i = min_t(u16, fcoe_i, fcoe->limit);
 
 		if (vmdq_i > 1 && fcoe_i) {
-			/* reserve no more than number of CPUs */
-			fcoe_i = min_t(u16, fcoe_i, num_online_cpus());
-
 			/* alloc queues for FCoE separately */
 			fcoe->indices = fcoe_i;
 			fcoe->offset = vmdq_i * rss_i;
@@ -623,8 +619,7 @@ static bool ixgbe_set_rss_queues(struct ixgbe_adapter *adapter)
 	if (rss_i > 1 && adapter->atr_sample_rate) {
 		f = &adapter->ring_feature[RING_F_FDIR];
 
-		f->indices = min_t(u16, num_online_cpus(), f->limit);
-		rss_i = max_t(u16, rss_i, f->indices);
+		rss_i = f->indices = f->limit;
 
 		if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
 			adapter->flags |= IXGBE_FLAG_FDIR_HASH_CAPABLE;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 5217b6d..dedb412 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -4433,7 +4433,7 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
 	struct pci_dev *pdev = adapter->pdev;
-	unsigned int rss;
+	unsigned int rss, fdir;
 #ifdef CONFIG_IXGBE_DCB
 	int j;
 	struct tc_configuration *tc;
@@ -4466,8 +4466,8 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
 			adapter->flags2 |= IXGBE_FLAG2_TEMP_SENSOR_CAPABLE;
 		/* Flow Director hash filters enabled */
 		adapter->atr_sample_rate = 20;
-		adapter->ring_feature[RING_F_FDIR].limit =
-							 IXGBE_MAX_FDIR_INDICES;
+		fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, num_online_cpus());
+		adapter->ring_feature[RING_F_FDIR].limit = fdir;
 		adapter->fdir_pballoc = IXGBE_FDIR_PBALLOC_64K;
 #ifdef IXGBE_FCOE
 		adapter->flags |= IXGBE_FLAG_FCOE_CAPABLE;
@@ -7324,13 +7324,17 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 
 #ifdef IXGBE_FCOE
 	if (adapter->flags & IXGBE_FLAG_FCOE_CAPABLE) {
+		unsigned int fcoe_l;
+
 		if (hw->mac.ops.get_device_caps) {
 			hw->mac.ops.get_device_caps(hw, &device_caps);
 			if (device_caps & IXGBE_DEVICE_CAPS_FCOE_OFFLOADS)
 				adapter->flags &= ~IXGBE_FLAG_FCOE_CAPABLE;
 		}
 
-		adapter->ring_feature[RING_F_FCOE].limit = IXGBE_FCRETA_SIZE;
+
+		fcoe_l = min_t(int, IXGBE_FCRETA_SIZE, num_online_cpus());
+		adapter->ring_feature[RING_F_FCOE].limit = fcoe_l;
 
 		netdev->features |= NETIF_F_FSO |
 				    NETIF_F_FCOE_CRC;

^ permalink raw reply related

* [RFC PATCH 05/10] net: Add support for XPS without SYSFS being defined
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

This patch makes it so that we can support transmit packet steering without
sysfs needing to be enabled.  The reason for making this change is to make
it so that a driver can make use of the XPS even while the sysfs portion of
the interface is not present.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 include/linux/netdevice.h |    1 -
 net/Kconfig               |    2 +-
 net/core/dev.c            |   26 ++++++++++++++++++++------
 net/core/net-sysfs.c      |   14 --------------
 4 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e9e74b7..db27be2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2073,7 +2073,6 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 }
 
 #ifdef CONFIG_XPS
-extern void netif_reset_xps_queue(struct net_device *dev, u16 index);
 extern int netif_set_xps_queue(struct net_device *dev, struct cpumask *mask,
 			       u16 index);
 #else
diff --git a/net/Kconfig b/net/Kconfig
index 245831b..fcc5657 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -230,7 +230,7 @@ config RFS_ACCEL
 
 config XPS
 	boolean
-	depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS
+	depends on SMP && USE_GENERIC_SMP_HELPERS
 	default y
 
 config NETPRIO_CGROUP
diff --git a/net/core/dev.c b/net/core/dev.c
index 5f0550b..894faf1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1758,10 +1758,10 @@ static struct xps_map *remove_xps_queue(struct xps_dev_maps *dev_maps,
 	return map;
 }
 
-void netif_reset_xps_queue(struct net_device *dev, u16 index)
+static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
 {
 	struct xps_dev_maps *dev_maps;
-	int cpu;
+	int cpu, i;
 	bool active = false;
 
 	mutex_lock(&xps_map_mutex);
@@ -1771,7 +1771,11 @@ void netif_reset_xps_queue(struct net_device *dev, u16 index)
 		goto out_no_maps;
 
 	for_each_possible_cpu(cpu) {
-		if (remove_xps_queue(dev_maps, cpu, index))
+		for (i = index; i < dev->num_tx_queues; i++) {
+			if (!remove_xps_queue(dev_maps, cpu, i))
+				break;
+		}
+		if (i == dev->num_tx_queues)
 			active = true;
 	}
 
@@ -1780,8 +1784,10 @@ void netif_reset_xps_queue(struct net_device *dev, u16 index)
 		kfree_rcu(dev_maps, rcu);
 	}
 
-	netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
-				     NUMA_NO_NODE);
+	for (i = index; i < dev->num_tx_queues; i++)
+		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
+					     NUMA_NO_NODE);
+
 out_no_maps:
 	mutex_unlock(&xps_map_mutex);
 }
@@ -1967,8 +1973,12 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
 		if (dev->num_tc)
 			netif_setup_tc(dev, txq);
 
-		if (txq < dev->real_num_tx_queues)
+		if (txq < dev->real_num_tx_queues) {
 			qdisc_reset_all_tx_gt(dev, txq);
+#ifdef CONFIG_XPS
+			netif_reset_xps_queues_gt(dev, txq);
+#endif
+		}
 	}
 
 	dev->real_num_tx_queues = txq;
@@ -5460,6 +5470,10 @@ static void rollback_registered_many(struct list_head *head)
 
 		/* Remove entries from kobject tree */
 		netdev_unregister_kobject(dev);
+#ifdef CONFIG_XPS
+		/* Remove XPS queueing entries */
+		netif_reset_xps_queues_gt(dev, 0);
+#endif
 	}
 
 	/* Process any work delayed until the end of the batch */
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 092d338..42bb496 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -963,16 +963,6 @@ static ssize_t show_xps_map(struct netdev_queue *queue,
 	return len;
 }
 
-static void xps_queue_release(struct netdev_queue *queue)
-{
-	struct net_device *dev = queue->dev;
-	unsigned long index;
-
-	index = get_netdev_queue_index(queue);
-
-	netif_reset_xps_queue(dev, index);
-}
-
 static ssize_t store_xps_map(struct netdev_queue *queue,
 		      struct netdev_queue_attribute *attribute,
 		      const char *buf, size_t len)
@@ -1019,10 +1009,6 @@ static void netdev_queue_release(struct kobject *kobj)
 {
 	struct netdev_queue *queue = to_netdev_queue(kobj);
 
-#ifdef CONFIG_XPS
-	xps_queue_release(queue);
-#endif
-
 	memset(kobj, 0, sizeof(*kobj));
 	dev_put(queue->dev);
 }

^ permalink raw reply related

* [RFC PATCH 04/10] net: Rewrite netif_set_xps_queues to address several issues
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

This change is meant to address several issues I found within the
netif_set_xps_queues function.

If the allocation of one of the maps to be assigned to new_dev_maps failed
we could end up with the device map in an inconsistent state since we had
already worked through a number of CPUs and removed or added the queue.  To
address that I split the process into several steps.  The first of which is
just the allocation of updated maps for CPUs that will need larger maps to
store the queue.  By doing this we can fail gracefully without actually
altering the contents of the current device map.

The second issue I found was the fact that we were always allocating a new
device map even if we were not adding any queues.  I have updated the code
so that we only allocate a new device map if we are adding queues,
otherwise if we are not adding any queues to CPUs we just skip to the
removal process.

The last change I made was to reuse the code from remove_xps_queue to remove
the queue from the CPU.  By making this change we can be consistent in how
we go about adding and removing the queues from the CPUs.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 net/core/dev.c |  183 ++++++++++++++++++++++++++++++++++++--------------------
 1 files changed, 117 insertions(+), 66 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8e259d4..5f0550b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1786,107 +1786,158 @@ out_no_maps:
 	mutex_unlock(&xps_map_mutex);
 }
 
+static struct xps_map *expand_xps_map(struct xps_map *map,
+				      int cpu, u16 index)
+{
+	struct xps_map *new_map;
+	int alloc_len = XPS_MIN_MAP_ALLOC;
+	int i, pos;
+
+	for (pos = 0; map && pos < map->len; pos++) {
+		if (map->queues[pos] != index)
+			continue;
+		return map;
+	}
+
+	/* Need to add queue to this CPU's existing map */
+	if (map) {
+		if (pos < map->alloc_len)
+			return map;
+
+		alloc_len = map->alloc_len * 2;
+	}
+
+	/* Need to allocate new map to store queue on this CPU's map */
+	new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
+			       cpu_to_node(cpu));
+	if (!new_map)
+		return NULL;
+
+	for (i = 0; i < pos; i++)
+		new_map->queues[i] = map->queues[i];
+	new_map->alloc_len = alloc_len;
+	new_map->len = pos;
+
+	return new_map;
+}
+
 int netif_set_xps_queue(struct net_device *dev, struct cpumask *mask, u16 index)
 {
-	int i, cpu, pos, map_len, alloc_len, need_set;
+	struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
 	struct xps_map *map, *new_map;
-	struct xps_dev_maps *dev_maps, *new_dev_maps;
-	int nonempty = 0;
-	int numa_node_id = -2;
 	int maps_sz = max_t(unsigned int, XPS_DEV_MAPS_SIZE, L1_CACHE_BYTES);
-
-	new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
-	if (!new_dev_maps)
-		return -ENOMEM;
+	int cpu, numa_node_id = -2;
+	bool active = false;
 
 	mutex_lock(&xps_map_mutex);
 
 	dev_maps = xmap_dereference(dev->xps_maps);
 
+	/* allocate memory for queue storage */
+	for_each_online_cpu(cpu) {
+		if (!cpumask_test_cpu(cpu, mask))
+			continue;
+
+		if (!new_dev_maps)
+			new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
+		if (!new_dev_maps)
+			return -ENOMEM;
+
+		map = dev_maps ? xmap_dereference(dev_maps->cpu_map[cpu]) :
+				 NULL;
+
+		map = expand_xps_map(map, cpu, index);
+		if (!map)
+			goto error;
+
+		RCU_INIT_POINTER(new_dev_maps->cpu_map[cpu], map);
+	}
+
+	if (!new_dev_maps)
+		goto out_no_new_maps;
+
 	for_each_possible_cpu(cpu) {
-		map = dev_maps ?
-			xmap_dereference(dev_maps->cpu_map[cpu]) : NULL;
-		new_map = map;
-		if (map) {
-			for (pos = 0; pos < map->len; pos++)
-				if (map->queues[pos] == index)
-					break;
-			map_len = map->len;
-			alloc_len = map->alloc_len;
-		} else
-			pos = map_len = alloc_len = 0;
+		if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
+			/* add queue to CPU maps */
+			int pos = 0;
 
-		need_set = cpumask_test_cpu(cpu, mask) && cpu_online(cpu);
+			map = xmap_dereference(new_dev_maps->cpu_map[cpu]);
+			while ((pos < map->len) && (map->queues[pos] != index))
+				pos++;
+
+			if (pos == map->len)
+				map->queues[map->len++] = index;
 #ifdef CONFIG_NUMA
-		if (need_set) {
 			if (numa_node_id == -2)
 				numa_node_id = cpu_to_node(cpu);
 			else if (numa_node_id != cpu_to_node(cpu))
 				numa_node_id = -1;
-		}
 #endif
-		if (need_set && pos >= map_len) {
-			/* Need to add queue to this CPU's map */
-			if (map_len >= alloc_len) {
-				alloc_len = alloc_len ?
-				    2 * alloc_len : XPS_MIN_MAP_ALLOC;
-				new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len),
-						       GFP_KERNEL,
-						       cpu_to_node(cpu));
-				if (!new_map)
-					goto error;
-				new_map->alloc_len = alloc_len;
-				for (i = 0; i < map_len; i++)
-					new_map->queues[i] = map->queues[i];
-				new_map->len = map_len;
-			}
-			new_map->queues[new_map->len++] = index;
-		} else if (!need_set && pos < map_len) {
-			/* Need to remove queue from this CPU's map */
-			if (map_len > 1)
-				new_map->queues[pos] =
-				    new_map->queues[--new_map->len];
-			else
-				new_map = NULL;
+		} else if (dev_maps) {
+			/* fill in the new device map from the old device map */
+			map = xmap_dereference(dev_maps->cpu_map[cpu]);
+			RCU_INIT_POINTER(new_dev_maps->cpu_map[cpu], map);
 		}
-		RCU_INIT_POINTER(new_dev_maps->cpu_map[cpu], new_map);
+
 	}
 
+	rcu_assign_pointer(dev->xps_maps, new_dev_maps);
+
 	/* Cleanup old maps */
-	for_each_possible_cpu(cpu) {
-		map = dev_maps ?
-			xmap_dereference(dev_maps->cpu_map[cpu]) : NULL;
-		if (map && xmap_dereference(new_dev_maps->cpu_map[cpu]) != map)
-			kfree_rcu(map, rcu);
-		if (new_dev_maps->cpu_map[cpu])
-			nonempty = 1;
-	}
+	if (dev_maps) {
+		for_each_possible_cpu(cpu) {
+			new_map = xmap_dereference(new_dev_maps->cpu_map[cpu]);
+			map = xmap_dereference(dev_maps->cpu_map[cpu]);
+			if (map && map != new_map)
+				kfree_rcu(map, rcu);
+		}
 
-	if (nonempty) {
-		rcu_assign_pointer(dev->xps_maps, new_dev_maps);
-	} else {
-		kfree(new_dev_maps);
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
+		kfree_rcu(dev_maps, rcu);
 	}
 
-	if (dev_maps)
-		kfree_rcu(dev_maps, rcu);
+	dev_maps = new_dev_maps;
+	active = true;
 
+out_no_new_maps:
+	/* update Tx queue numa node */
 	netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
 				     (numa_node_id >= 0) ? numa_node_id :
 				     NUMA_NO_NODE);
 
+	if (!dev_maps)
+		goto out_no_maps;
+
+	/* removes queue from unused CPUs */
+	for_each_possible_cpu(cpu) {
+		if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu))
+			continue;
+
+		if (remove_xps_queue(dev_maps, cpu, index))
+			active = true;
+	}
+
+	/* free map if not active */
+	if (!active) {
+		RCU_INIT_POINTER(dev->xps_maps, NULL);
+		kfree_rcu(dev_maps, rcu);
+	}
+
+out_no_maps:
 	mutex_unlock(&xps_map_mutex);
 
 	return 0;
 error:
+	/* remove any maps that we added */
+	for_each_possible_cpu(cpu) {
+		new_map = xmap_dereference(new_dev_maps->cpu_map[cpu]);
+		map = dev_maps ? xmap_dereference(dev_maps->cpu_map[cpu]) :
+				 NULL;
+		if (new_map && new_map != map)
+			kfree(new_map);
+	}
+
 	mutex_unlock(&xps_map_mutex);
 
-	if (new_dev_maps)
-		for_each_possible_cpu(i)
-			kfree(rcu_dereference_protected(
-				new_dev_maps->cpu_map[i],
-				1));
 	kfree(new_dev_maps);
 	return -ENOMEM;
 }

^ permalink raw reply related

* [RFC PATCH 03/10] net: Rewrite netif_reset_xps_queue to allow for better code reuse
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

This patch does a minor refactor on netif_reset_xps_queue to address a few
items I noticed.

First is the fact that we are doing removal of queues in both
netif_reset_xps_queue and netif_set_xps_queue.  Since there is no need to
have the code in two places I am pushing it out into a separate function
and will come back in another patch and reuse the code in
netif_set_xps_queue.

The second item this change addresses is the fact that the Tx queues were
not getting their numa_node value cleared as a part of the XPS queue reset.
This patch resolves that by resetting the numa_node value if the dev_maps
value is set.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 net/core/dev.c |   56 +++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4c0981b..8e259d4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1733,45 +1733,55 @@ static DEFINE_MUTEX(xps_map_mutex);
 #define xmap_dereference(P)		\
 	rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
 
-void netif_reset_xps_queue(struct net_device *dev, u16 index)
+static struct xps_map *remove_xps_queue(struct xps_dev_maps *dev_maps,
+					int cpu, u16 index)
 {
-	struct xps_dev_maps *dev_maps;
-	struct xps_map *map;
-	int i, pos, nonempty = 0;
-
-	mutex_lock(&xps_map_mutex);
-	dev_maps = xmap_dereference(dev->xps_maps);
-
-	if (!dev_maps)
-		goto out_no_maps;
+	struct xps_map *map = NULL;
+	int pos;
 
-	for_each_possible_cpu(i) {
-		map = xmap_dereference(dev_maps->cpu_map[i]);
-		if (!map)
-			continue;
-
-		for (pos = 0; pos < map->len; pos++)
-			if (map->queues[pos] == index)
-				break;
+	if (dev_maps)
+		map = xmap_dereference(dev_maps->cpu_map[cpu]);
 
-		if (pos < map->len) {
+	for (pos = 0; map && pos < map->len; pos++) {
+		if (map->queues[pos] == index) {
 			if (map->len > 1) {
 				map->queues[pos] = map->queues[--map->len];
 			} else {
-				RCU_INIT_POINTER(dev_maps->cpu_map[i], NULL);
+				RCU_INIT_POINTER(dev_maps->cpu_map[cpu], NULL);
 				kfree_rcu(map, rcu);
 				map = NULL;
 			}
+			break;
 		}
-		if (map)
-			nonempty = 1;
 	}
 
-	if (!nonempty) {
+	return map;
+}
+
+void netif_reset_xps_queue(struct net_device *dev, u16 index)
+{
+	struct xps_dev_maps *dev_maps;
+	int cpu;
+	bool active = false;
+
+	mutex_lock(&xps_map_mutex);
+	dev_maps = xmap_dereference(dev->xps_maps);
+
+	if (!dev_maps)
+		goto out_no_maps;
+
+	for_each_possible_cpu(cpu) {
+		if (remove_xps_queue(dev_maps, cpu, index))
+			active = true;
+	}
+
+	if (!active) {
 		RCU_INIT_POINTER(dev->xps_maps, NULL);
 		kfree_rcu(dev_maps, rcu);
 	}
 
+	netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
+				     NUMA_NO_NODE);
 out_no_maps:
 	mutex_unlock(&xps_map_mutex);
 }

^ permalink raw reply related

* [RFC PATCH 02/10] net: Add functions netif_reset_xps_queue and netif_set_xps_queue
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

This patch adds two functions, netif_reset_xps_queue and
netif_set_xps_queue.  The main idea behind these to functions is to provide
a mechanism through which drivers can update their defaults in regards to
XPS.

Currently no such mechanism exists and as a result we cannot use XPS for
things such as ATR which would require a basic configuration to start in
which the Tx queues are mapped to CPUs via a 1:1 mapping.  With this change
I am making it possible for drivers such as ixgbe to be able to use the XPS
feature by controlling the default configuration.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 include/linux/netdevice.h |   13 ++++
 net/core/dev.c            |  155 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-sysfs.c      |  148 +------------------------------------------
 3 files changed, 173 insertions(+), 143 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3329d70..e9e74b7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2072,6 +2072,19 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 		__netif_schedule(txq->qdisc);
 }
 
+#ifdef CONFIG_XPS
+extern void netif_reset_xps_queue(struct net_device *dev, u16 index);
+extern int netif_set_xps_queue(struct net_device *dev, struct cpumask *mask,
+			       u16 index);
+#else
+static inline int netif_set_xps_queue(struct net_device *dev,
+				      struct cpumask *mask,
+				      u16 index)
+{
+	return 0;
+}
+#endif
+
 /*
  * Returns a Tx hash for the given packet when dev->real_num_tx_queues is used
  * as a distribution range limit for the returned value.
diff --git a/net/core/dev.c b/net/core/dev.c
index b31a9ff..4c0981b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1728,6 +1728,161 @@ static void netif_setup_tc(struct net_device *dev, unsigned int txq)
 	}
 }
 
+#ifdef CONFIG_XPS
+static DEFINE_MUTEX(xps_map_mutex);
+#define xmap_dereference(P)		\
+	rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
+
+void netif_reset_xps_queue(struct net_device *dev, u16 index)
+{
+	struct xps_dev_maps *dev_maps;
+	struct xps_map *map;
+	int i, pos, nonempty = 0;
+
+	mutex_lock(&xps_map_mutex);
+	dev_maps = xmap_dereference(dev->xps_maps);
+
+	if (!dev_maps)
+		goto out_no_maps;
+
+	for_each_possible_cpu(i) {
+		map = xmap_dereference(dev_maps->cpu_map[i]);
+		if (!map)
+			continue;
+
+		for (pos = 0; pos < map->len; pos++)
+			if (map->queues[pos] == index)
+				break;
+
+		if (pos < map->len) {
+			if (map->len > 1) {
+				map->queues[pos] = map->queues[--map->len];
+			} else {
+				RCU_INIT_POINTER(dev_maps->cpu_map[i], NULL);
+				kfree_rcu(map, rcu);
+				map = NULL;
+			}
+		}
+		if (map)
+			nonempty = 1;
+	}
+
+	if (!nonempty) {
+		RCU_INIT_POINTER(dev->xps_maps, NULL);
+		kfree_rcu(dev_maps, rcu);
+	}
+
+out_no_maps:
+	mutex_unlock(&xps_map_mutex);
+}
+
+int netif_set_xps_queue(struct net_device *dev, struct cpumask *mask, u16 index)
+{
+	int i, cpu, pos, map_len, alloc_len, need_set;
+	struct xps_map *map, *new_map;
+	struct xps_dev_maps *dev_maps, *new_dev_maps;
+	int nonempty = 0;
+	int numa_node_id = -2;
+	int maps_sz = max_t(unsigned int, XPS_DEV_MAPS_SIZE, L1_CACHE_BYTES);
+
+	new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
+	if (!new_dev_maps)
+		return -ENOMEM;
+
+	mutex_lock(&xps_map_mutex);
+
+	dev_maps = xmap_dereference(dev->xps_maps);
+
+	for_each_possible_cpu(cpu) {
+		map = dev_maps ?
+			xmap_dereference(dev_maps->cpu_map[cpu]) : NULL;
+		new_map = map;
+		if (map) {
+			for (pos = 0; pos < map->len; pos++)
+				if (map->queues[pos] == index)
+					break;
+			map_len = map->len;
+			alloc_len = map->alloc_len;
+		} else
+			pos = map_len = alloc_len = 0;
+
+		need_set = cpumask_test_cpu(cpu, mask) && cpu_online(cpu);
+#ifdef CONFIG_NUMA
+		if (need_set) {
+			if (numa_node_id == -2)
+				numa_node_id = cpu_to_node(cpu);
+			else if (numa_node_id != cpu_to_node(cpu))
+				numa_node_id = -1;
+		}
+#endif
+		if (need_set && pos >= map_len) {
+			/* Need to add queue to this CPU's map */
+			if (map_len >= alloc_len) {
+				alloc_len = alloc_len ?
+				    2 * alloc_len : XPS_MIN_MAP_ALLOC;
+				new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len),
+						       GFP_KERNEL,
+						       cpu_to_node(cpu));
+				if (!new_map)
+					goto error;
+				new_map->alloc_len = alloc_len;
+				for (i = 0; i < map_len; i++)
+					new_map->queues[i] = map->queues[i];
+				new_map->len = map_len;
+			}
+			new_map->queues[new_map->len++] = index;
+		} else if (!need_set && pos < map_len) {
+			/* Need to remove queue from this CPU's map */
+			if (map_len > 1)
+				new_map->queues[pos] =
+				    new_map->queues[--new_map->len];
+			else
+				new_map = NULL;
+		}
+		RCU_INIT_POINTER(new_dev_maps->cpu_map[cpu], new_map);
+	}
+
+	/* Cleanup old maps */
+	for_each_possible_cpu(cpu) {
+		map = dev_maps ?
+			xmap_dereference(dev_maps->cpu_map[cpu]) : NULL;
+		if (map && xmap_dereference(new_dev_maps->cpu_map[cpu]) != map)
+			kfree_rcu(map, rcu);
+		if (new_dev_maps->cpu_map[cpu])
+			nonempty = 1;
+	}
+
+	if (nonempty) {
+		rcu_assign_pointer(dev->xps_maps, new_dev_maps);
+	} else {
+		kfree(new_dev_maps);
+		RCU_INIT_POINTER(dev->xps_maps, NULL);
+	}
+
+	if (dev_maps)
+		kfree_rcu(dev_maps, rcu);
+
+	netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
+				     (numa_node_id >= 0) ? numa_node_id :
+				     NUMA_NO_NODE);
+
+	mutex_unlock(&xps_map_mutex);
+
+	return 0;
+error:
+	mutex_unlock(&xps_map_mutex);
+
+	if (new_dev_maps)
+		for_each_possible_cpu(i)
+			kfree(rcu_dereference_protected(
+				new_dev_maps->cpu_map[i],
+				1));
+	kfree(new_dev_maps);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL(netif_set_xps_queue);
+
+#endif
 /*
  * Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
  * greater then real_num_tx_queues stale skbs on the qdisc must be flushed.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 7260717..092d338 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -963,54 +963,14 @@ static ssize_t show_xps_map(struct netdev_queue *queue,
 	return len;
 }
 
-static DEFINE_MUTEX(xps_map_mutex);
-#define xmap_dereference(P)		\
-	rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
-
 static void xps_queue_release(struct netdev_queue *queue)
 {
 	struct net_device *dev = queue->dev;
-	struct xps_dev_maps *dev_maps;
-	struct xps_map *map;
 	unsigned long index;
-	int i, pos, nonempty = 0;
 
 	index = get_netdev_queue_index(queue);
 
-	mutex_lock(&xps_map_mutex);
-	dev_maps = xmap_dereference(dev->xps_maps);
-
-	if (dev_maps) {
-		for_each_possible_cpu(i) {
-			map = xmap_dereference(dev_maps->cpu_map[i]);
-			if (!map)
-				continue;
-
-			for (pos = 0; pos < map->len; pos++)
-				if (map->queues[pos] == index)
-					break;
-
-			if (pos < map->len) {
-				if (map->len > 1)
-					map->queues[pos] =
-					    map->queues[--map->len];
-				else {
-					RCU_INIT_POINTER(dev_maps->cpu_map[i],
-					    NULL);
-					kfree_rcu(map, rcu);
-					map = NULL;
-				}
-			}
-			if (map)
-				nonempty = 1;
-		}
-
-		if (!nonempty) {
-			RCU_INIT_POINTER(dev->xps_maps, NULL);
-			kfree_rcu(dev_maps, rcu);
-		}
-	}
-	mutex_unlock(&xps_map_mutex);
+	netif_reset_xps_queue(dev, index);
 }
 
 static ssize_t store_xps_map(struct netdev_queue *queue,
@@ -1018,13 +978,9 @@ static ssize_t store_xps_map(struct netdev_queue *queue,
 		      const char *buf, size_t len)
 {
 	struct net_device *dev = queue->dev;
-	cpumask_var_t mask;
-	int err, i, cpu, pos, map_len, alloc_len, need_set;
 	unsigned long index;
-	struct xps_map *map, *new_map;
-	struct xps_dev_maps *dev_maps, *new_dev_maps;
-	int nonempty = 0;
-	int numa_node_id = -2;
+	cpumask_var_t mask;
+	int err;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
@@ -1040,105 +996,11 @@ static ssize_t store_xps_map(struct netdev_queue *queue,
 		return err;
 	}
 
-	new_dev_maps = kzalloc(max_t(unsigned int,
-	    XPS_DEV_MAPS_SIZE, L1_CACHE_BYTES), GFP_KERNEL);
-	if (!new_dev_maps) {
-		free_cpumask_var(mask);
-		return -ENOMEM;
-	}
-
-	mutex_lock(&xps_map_mutex);
-
-	dev_maps = xmap_dereference(dev->xps_maps);
-
-	for_each_possible_cpu(cpu) {
-		map = dev_maps ?
-			xmap_dereference(dev_maps->cpu_map[cpu]) : NULL;
-		new_map = map;
-		if (map) {
-			for (pos = 0; pos < map->len; pos++)
-				if (map->queues[pos] == index)
-					break;
-			map_len = map->len;
-			alloc_len = map->alloc_len;
-		} else
-			pos = map_len = alloc_len = 0;
-
-		need_set = cpumask_test_cpu(cpu, mask) && cpu_online(cpu);
-#ifdef CONFIG_NUMA
-		if (need_set) {
-			if (numa_node_id == -2)
-				numa_node_id = cpu_to_node(cpu);
-			else if (numa_node_id != cpu_to_node(cpu))
-				numa_node_id = -1;
-		}
-#endif
-		if (need_set && pos >= map_len) {
-			/* Need to add queue to this CPU's map */
-			if (map_len >= alloc_len) {
-				alloc_len = alloc_len ?
-				    2 * alloc_len : XPS_MIN_MAP_ALLOC;
-				new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len),
-						       GFP_KERNEL,
-						       cpu_to_node(cpu));
-				if (!new_map)
-					goto error;
-				new_map->alloc_len = alloc_len;
-				for (i = 0; i < map_len; i++)
-					new_map->queues[i] = map->queues[i];
-				new_map->len = map_len;
-			}
-			new_map->queues[new_map->len++] = index;
-		} else if (!need_set && pos < map_len) {
-			/* Need to remove queue from this CPU's map */
-			if (map_len > 1)
-				new_map->queues[pos] =
-				    new_map->queues[--new_map->len];
-			else
-				new_map = NULL;
-		}
-		RCU_INIT_POINTER(new_dev_maps->cpu_map[cpu], new_map);
-	}
-
-	/* Cleanup old maps */
-	for_each_possible_cpu(cpu) {
-		map = dev_maps ?
-			xmap_dereference(dev_maps->cpu_map[cpu]) : NULL;
-		if (map && xmap_dereference(new_dev_maps->cpu_map[cpu]) != map)
-			kfree_rcu(map, rcu);
-		if (new_dev_maps->cpu_map[cpu])
-			nonempty = 1;
-	}
-
-	if (nonempty) {
-		rcu_assign_pointer(dev->xps_maps, new_dev_maps);
-	} else {
-		kfree(new_dev_maps);
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
-	}
-
-	if (dev_maps)
-		kfree_rcu(dev_maps, rcu);
-
-	netdev_queue_numa_node_write(queue, (numa_node_id >= 0) ? numa_node_id :
-					    NUMA_NO_NODE);
-
-	mutex_unlock(&xps_map_mutex);
+	err = netif_set_xps_queue(dev, mask, index);
 
 	free_cpumask_var(mask);
-	return len;
 
-error:
-	mutex_unlock(&xps_map_mutex);
-
-	if (new_dev_maps)
-		for_each_possible_cpu(i)
-			kfree(rcu_dereference_protected(
-				new_dev_maps->cpu_map[i],
-				1));
-	kfree(new_dev_maps);
-	free_cpumask_var(mask);
-	return -ENOMEM;
+	return err ? : len;
 }
 
 static struct netdev_queue_attribute xps_cpus_attribute =

^ permalink raw reply related

* [RFC PATCH 01/10] net: Split core bits of dev_pick_tx into __dev_pick_tx
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120630000652.29939.11108.stgit@gitlad.jf.intel.com>

This change splits the core bits of dev_pick_tx into a separate function.
The main idea behind this is to make this code accessible to select queue
functions when they decide to process the standard path instead of their
own custom path in their select queue routine.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 include/linux/netdevice.h |    3 +++
 net/core/dev.c            |   51 ++++++++++++++++++++++++++-------------------
 2 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c2ecea..3329d70 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2082,6 +2082,9 @@ static inline u16 skb_tx_hash(const struct net_device *dev,
 	return __skb_tx_hash(dev, skb, dev->real_num_tx_queues);
 }
 
+extern int __dev_pick_tx(const struct net_device *dev,
+			 const struct sk_buff *skb);
+
 /**
  *	netif_is_multiqueue - test if device has multiple transmit queues
  *	@dev: network device
diff --git a/net/core/dev.c b/net/core/dev.c
index 57c4f9b..b31a9ff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2301,7 +2301,8 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
 	return queue_index;
 }
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+static inline int get_xps_queue(const struct net_device *dev,
+				const struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps *dev_maps;
@@ -2339,11 +2340,37 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 #endif
 }
 
+int __dev_pick_tx(const struct net_device *dev, const struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+	int queue_index = sk_tx_queue_get(sk);
+
+	if (queue_index < 0 || skb->ooo_okay ||
+	    queue_index >= dev->real_num_tx_queues) {
+		int old_index = queue_index;
+
+		queue_index = get_xps_queue(dev, skb);
+		if (queue_index < 0)
+			queue_index = skb_tx_hash(dev, skb);
+
+		if (queue_index != old_index && sk) {
+			struct dst_entry *dst =
+			    rcu_dereference_check(sk->sk_dst_cache, 1);
+
+			if (dst && skb_dst(skb) == dst)
+				sk_tx_queue_set(sk, queue_index);
+		}
+	}
+
+	return queue_index;
+}
+EXPORT_SYMBOL(__dev_pick_tx);
+
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
-	int queue_index;
 	const struct net_device_ops *ops = dev->netdev_ops;
+	int queue_index;
 
 	if (dev->real_num_tx_queues == 1)
 		queue_index = 0;
@@ -2351,25 +2378,7 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 		queue_index = ops->ndo_select_queue(dev, skb);
 		queue_index = dev_cap_txqueue(dev, queue_index);
 	} else {
-		struct sock *sk = skb->sk;
-		queue_index = sk_tx_queue_get(sk);
-
-		if (queue_index < 0 || skb->ooo_okay ||
-		    queue_index >= dev->real_num_tx_queues) {
-			int old_index = queue_index;
-
-			queue_index = get_xps_queue(dev, skb);
-			if (queue_index < 0)
-				queue_index = skb_tx_hash(dev, skb);
-
-			if (queue_index != old_index && sk) {
-				struct dst_entry *dst =
-				    rcu_dereference_check(sk->sk_dst_cache, 1);
-
-				if (dst && skb_dst(skb) == dst)
-					sk_tx_queue_set(sk, queue_index);
-			}
-		}
+		queue_index = __dev_pick_tx(dev, skb);
 	}
 
 	skb_set_queue_mapping(skb, queue_index);

^ permalink raw reply related

* [RFC PATCH 00/10] Make XPS usable within ixgbe
From: Alexander Duyck @ 2012-06-30  0:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

The following patch series makes it so that the ixgbe driver can support
ATR even when the number of queues is less than the number of CPUs.  To do
this I have updated the kernel to support letting drivers set their own XPS
configuration.  To do this it was necessary to move the code out of the
sysfs specific code and into the dev specific regions.

I am still working out a few issues such as the fact that with routing I
only ever seem to be able to get the first queue that is mapped to the CPU
when XPS is enabled.

Also I am looking for input on if it is acceptable to only let the
set_channels/get_channels calls report/set the number of queues per traffic
class as I implemented the code this way to avoid any significant conflicts
between the DCB traffic classes code and these functions.

---

Alexander Duyck (10):
      ixgbe: Add support for set_channels ethtool operation
      ixgbe: Add support for displaying the number of Tx/Rx channels
      ixgbe: Update ixgbe driver to use __dev_pick_tx in ixgbe_select_queue
      ixgbe: Add function for setting XPS queue mapping
      ixgbe: Define FCoE and Flow director limits much sooner to allow for changes
      net: Add support for XPS without SYSFS being defined
      net: Rewrite netif_set_xps_queues to address several issues
      net: Rewrite netif_reset_xps_queue to allow for better code reuse
      net: Add functions netif_reset_xps_queue and netif_set_xps_queue
      net: Split core bits of dev_pick_tx into __dev_pick_tx

 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |  112 +++++++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c     |   10 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |   48 +++-
 include/linux/netdevice.h                        |   15 +
 net/Kconfig                                      |    2 
 net/core/dev.c                                   |  283 ++++++++++++++++++++--
 net/core/net-sysfs.c                             |  160 ------------
 7 files changed, 428 insertions(+), 202 deletions(-)

-- 
Thanks,

Alex

^ permalink raw reply

* Re: AF_BUS socket address family
From: Benjamin LaHaise @ 2012-06-30  0:13 UTC (permalink / raw)
  To: Vincent Sanders; +Cc: David Miller, netdev, linux-kernel
In-Reply-To: <20120629234230.GA11480@kyllikki.org>

On Sat, Jun 30, 2012 at 12:42:30AM +0100, Vincent Sanders wrote:
> The current users are suffering from the issues outlined in my
> introductory mail all the time. These issues are caused by emulating an
> IPC system over AF_UNIX in userspace.

Nothing in your introductory statements indicate how your requirements 
can't be met through a hybrid socket + shared memory solution.  The IPC 
facilities of the kernel are already quite rich, and sufficient for 
building many kinds of complex systems.  What's so different about DBus' 
requirements?

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply

* Re: AF_BUS socket address family
From: Vincent Sanders @ 2012-06-30  0:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20120629.165023.1605284574408858612.davem@davemloft.net>

On Fri, Jun 29, 2012 at 04:50:23PM -0700, David Miller wrote:
> From: Vincent Sanders <vincent.sanders@collabora.co.uk>
> Date: Sat, 30 Jun 2012 00:42:30 +0100
> 
> > Basically you are indicating you would be completely opposed to any
> > mechanism involving D-Bus IPC and the kernel? 
> 
> I would not oppose existing mechanisms, which I do not believe is
> impossible to use in your scenerio.
> 

You keep saying that yet have offered no concrete way to achive the
semantics we require. To pass fd and credentials currently *requires*
the use of AF_UNIX does it not? And D-Bus already emulates its IPC
over AF_UNIX because of that.

> What you really don't get is that packet drops and event losses are
> absolutely fundamental.

not within an IPC surely? there cannot be packet drops within AF_BUS
we simply do not do it. The rrecive queues are checked for capability
of reciving the message before it is delivered to them all or none.

> 
> As long as receivers lack infinite receive queue this will always be
> the case.

Indeed, I would not question that.

> 
> Multicast operates in non-reliable transports only so that one stuck
> or malfunctioning receiver doesn't screw things over for everyone nor
> unduly brudon the sender.
> 

We have addressed this within AF_BUS by the reciver and bus master
being told if all recepients cannot receive the message (and therefor
it cannot be sent). 

The policy decision of how to handle this situation is therefore
handled by the userspace clients on a protocol level. D-Bus *already*
has to handle this situation, its just currently done over AF_UNIX
sockets so once it occours the problem is harder to rectify as the
ordering constraint is broken (which causes even more issues).

I am afraid it is rather late here and I may not be able to continue
this conversation untill the morning, I apologise if this is
inconveniant, but I must sleep.

-- 
Regards Vincent

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox