Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] bonding: fix arp_validate on bonds inside a bridge
From: Jay Vosburgh @ 2010-04-28 19:05 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: bonding-devel, netdev
In-Reply-To: <20100428125940.GB13400@midget.suse.cz>

Jiri Bohac <jbohac@suse.cz> wrote:

>bonding with arp_validate does not currently work when the
>bonding master is part of a bridge. This is because
>bond_arp_rcv() is registered as a packet type handler for ARP,
>but before netif_receive_skb() processes the ptype_base hash
>table, handle_bridge() is called and changes the skb->dev to
>point to the bridge device.
>
>This patch makes bonding_should_drop() call the bonding ARP
>handler directly if a IFF_MASTER_NEEDARP flag is set on the
>bonding master.  bond_register_arp() now only needs to set the
>IFF_MASTER_NEEDARP flag.
>
>We ne longer need special ARP handling for inactive slaves, hence
>IFF_SLAVE_NEEDARP is not needed.
>
>skb_reset_network_header() and skb_reset_transport_header() need
>to be called before the call to bonding_should_drop() because
>bond_handle_arp() needs the offsets initialized.
>
>P.S.: bonding_should_drop() should probably be renamed to
>handle_bonding() -- we already have handle_bridge() and
>handle_macvlan(), and bonding_should_drop() has long been doing
>other stuff than deciding which packets to drop...

	I agree, and I have code that I've been working on locally that
wraps the "should_drop" into a hook, similar to the bridge and macvlan
hooks that already exist.  I used different names, though.  I've got
bond_handle_frame (and bond_handle_frame_hook) bond_main.c and
handle_bonding in net/core/dev.c, where the implementation for
handle_bonding parallels that of bridge and macvlan:

#if defined(CONFIG_BONDING) || defined(CONFIG_BONDING_MODULE)
int (*bonding_handle_frame_hook)(struct sk_buff *skb) __read_mostly;
EXPORT_SYMBOL_GPL(bonding_handle_frame_hook);

static inline int handle_bonding(struct sk_buff *skb)
{
	struct net_device *dev = skb->dev;
	struct net_device *master = dev->master;

	if (master->priv_flags & IFF_MASTER_ML)
		return bonding_handle_frame_hook(skb);

	return skb_bond_should_drop(skb);
}
#else
#define handle_bonding(skb)	(skb)
#endif

	I think the structure you're using (skb_bond_should_drop calls
the hook as needed) may be better overall, as it doesn't require special
logic in the VLAN cases.  The disadvantage is that it's a little uglier
to hide the hook declaration behind #ifdef CONFIG_BONDING (more on that
below).

	The code I'm working on now doesn't just hook for ARP (I'm
working on a load-balance by subnet mode that needs to assign skb->dev
by destination to permit the slaves to operate independently from the
master; but that's another topic).  I did leave as much as possible to
the priv_flags, since that is less expensive to process than poking
around in the bonding structures (no locks for the priv_flags).

	I haven't tested your patch yet, but it looks like it should
work as advertised.

	I have a couple of minor comments, below, but nothing
substantive about the core of the changes.

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

>Signed-off-by: Jiri Bohac <jbohac@suse.cz>
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 0075514..cafd404 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1940,8 +1940,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
> 	}
>
> 	slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
>-				   IFF_SLAVE_INACTIVE | IFF_BONDING |
>-				   IFF_SLAVE_NEEDARP);
>+				   IFF_SLAVE_INACTIVE | IFF_BONDING);
>
> 	kfree(slave);
>
>@@ -2612,11 +2611,12 @@ static void bond_validate_arp(struct bonding *bond, struct slave *slave, __be32
> 	}
> }
>
>-static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
>+static void bond_handle_arp(struct sk_buff *skb)
> {
> 	struct arphdr *arp;
> 	struct slave *slave;
> 	struct bonding *bond;
>+	struct net_device *dev = skb->dev->master, *orig_dev = skb->dev;
> 	unsigned char *arp_ptr;
> 	__be32 sip, tip;
>
>@@ -2637,9 +2637,8 @@ static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct pack
> 	bond = netdev_priv(dev);
> 	read_lock(&bond->lock);
>
>-	pr_debug("bond_arp_rcv: bond %s skb->dev %s orig_dev %s\n",
>-		 bond->dev->name, skb->dev ? skb->dev->name : "NULL",
>-		 orig_dev ? orig_dev->name : "NULL");
>+	pr_debug("bond_handle_arp: bond: %s, master: %s, slave: %s\n",
>+		bond->dev->name, dev->name, orig_dev->name);
>
> 	slave = bond_get_slave_by_dev(bond, orig_dev);
> 	if (!slave || !slave_do_arp_validate(bond, slave))
>@@ -2684,8 +2683,7 @@ static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct pack
> out_unlock:
> 	read_unlock(&bond->lock);
> out:
>-	dev_kfree_skb(skb);
>-	return NET_RX_SUCCESS;
>+	return;
> }
>
> /*
>@@ -3567,23 +3565,12 @@ static void bond_unregister_lacpdu(struct bonding *bond)
>
> void bond_register_arp(struct bonding *bond)
> {
>-	struct packet_type *pt = &bond->arp_mon_pt;
>-
>-	if (pt->type)
>-		return;
>-
>-	pt->type = htons(ETH_P_ARP);
>-	pt->dev = bond->dev;
>-	pt->func = bond_arp_rcv;
>-	dev_add_pack(pt);
>+	bond->dev->priv_flags |= IFF_MASTER_NEEDARP;
> }
>
> void bond_unregister_arp(struct bonding *bond)
> {
>-	struct packet_type *pt = &bond->arp_mon_pt;
>-
>-	dev_remove_pack(pt);
>-	pt->type = 0;
>+	bond->dev->priv_flags &= ~IFF_MASTER_NEEDARP;
> }
>
> /*---------------------------- Hashing Policies -----------------------------*/
>@@ -5041,6 +5028,7 @@ static int __init bonding_init(void)
> 	register_netdevice_notifier(&bond_netdev_notifier);
> 	register_inetaddr_notifier(&bond_inetaddr_notifier);
> 	bond_register_ipv6_notifier();
>+	bond_handle_arp_hook = bond_handle_arp;
> out:
> 	return res;
> err:
>@@ -5061,6 +5049,7 @@ static void __exit bonding_exit(void)
>
> 	rtnl_link_unregister(&bond_link_ops);
> 	unregister_pernet_subsys(&bond_net_ops);
>+	bond_handle_arp_hook = NULL;
> }
>
> module_init(bonding_init);
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index 257a7a4..57adfe5 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -212,7 +212,6 @@ struct bonding {
> 	struct   bond_params params;
> 	struct   list_head vlan_list;
> 	struct   vlan_group *vlgrp;
>-	struct   packet_type arp_mon_pt;
> 	struct   workqueue_struct *wq;
> 	struct   delayed_work mii_work;
> 	struct   delayed_work arp_work;
>@@ -292,14 +291,12 @@ static inline void bond_set_slave_inactive_flags(struct slave *slave)
> 	if (!bond_is_lb(bond))
> 		slave->state = BOND_STATE_BACKUP;
> 	slave->dev->priv_flags |= IFF_SLAVE_INACTIVE;
>-	if (slave_do_arp_validate(bond, slave))
>-		slave->dev->priv_flags |= IFF_SLAVE_NEEDARP;
> }
>
> static inline void bond_set_slave_active_flags(struct slave *slave)
> {
> 	slave->state = BOND_STATE_ACTIVE;
>-	slave->dev->priv_flags &= ~(IFF_SLAVE_INACTIVE | IFF_SLAVE_NEEDARP);
>+	slave->dev->priv_flags &= ~IFF_SLAVE_INACTIVE;
> }
>
> static inline void bond_set_master_3ad_flags(struct bonding *bond)
>diff --git a/include/linux/if.h b/include/linux/if.h
>index 3a9f410..84ab2c8 100644
>--- a/include/linux/if.h
>+++ b/include/linux/if.h
>@@ -63,7 +63,7 @@
> #define IFF_MASTER_8023AD	0x8	/* bonding master, 802.3ad. 	*/
> #define IFF_MASTER_ALB	0x10		/* bonding master, balance-alb.	*/
> #define IFF_BONDING	0x20		/* bonding master or slave	*/
>-#define IFF_SLAVE_NEEDARP 0x40		/* need ARPs for validation	*/
>+#define IFF_MASTER_NEEDARP 0x40		/* need ARPs for validation	*/
> #define IFF_ISATAP	0x80		/* ISATAP interface (RFC4214)	*/
> #define IFF_MASTER_ARPMON 0x100		/* bonding master, ARP mon in use */
> #define IFF_WAN_HDLC	0x200		/* WAN HDLC device		*/
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index fa8b476..9f82fc6 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -2055,6 +2055,8 @@ static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
> 	}
> }
>
>+extern void (*bond_handle_arp_hook)(struct sk_buff *skb);

	Should this be inside the the skb_bond_should_drop function to
limit its scope?  Just wondering if that's a little tidier.

> /* On bonding slaves other than the currently active slave, suppress
>  * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
>  * ARP on active-backup slaves with arp_validate enabled.
>@@ -2076,11 +2078,13 @@ static inline int skb_bond_should_drop(struct sk_buff *skb,
> 			skb_bond_set_mac_by_master(skb, master);
> 		}
>
>-		if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
>-			if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
>-			    skb->protocol == __cpu_to_be16(ETH_P_ARP))
>-				return 0;
>+		/* pass ARP frames directly to bonding
>+		   before bridging or other hooks change them */
>+		if ((master->priv_flags & IFF_MASTER_NEEDARP) &&
>+		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
>+			bond_handle_arp_hook(skb);
>
>+		if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
> 			if (master->priv_flags & IFF_MASTER_ALB) {
> 				if (skb->pkt_type != PACKET_BROADCAST &&
> 				    skb->pkt_type != PACKET_MULTICAST)
>diff --git a/net/core/dev.c b/net/core/dev.c
>index f769098..98d85a8 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -2314,6 +2314,9 @@ static inline int deliver_skb(struct sk_buff *skb,
> 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> }
>
>+void (*bond_handle_arp_hook)(struct sk_buff *skb);
>+EXPORT_SYMBOL_GPL(bond_handle_arp_hook);

	Should this be hidden by

#if defined(CONFIG_BONDING) || defined(CONFIG_BONDING_MODULE)

	with some parallel changes to skb_bond_should_drop so it
vanishes if bonding is not configured?  Granted, distros will all turn
it on anyway, but the embedded size might be a bit smaller.

> #if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
>
> #if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
>@@ -2507,6 +2510,10 @@ int netif_receive_skb(struct sk_buff *skb)
> 	if (!skb->skb_iif)
> 		skb->skb_iif = skb->dev->ifindex;
>
>+	skb_reset_network_header(skb);
>+	skb_reset_transport_header(skb);
>+	skb->mac_len = skb->network_header - skb->mac_header;
>+
> 	null_or_orig = NULL;
> 	orig_dev = skb->dev;
> 	master = ACCESS_ONCE(orig_dev->master);
>@@ -2519,10 +2526,6 @@ int netif_receive_skb(struct sk_buff *skb)
>
> 	__get_cpu_var(netdev_rx_stat).total++;
>
>-	skb_reset_network_header(skb);
>-	skb_reset_transport_header(skb);
>-	skb->mac_len = skb->network_header - skb->mac_header;
>-
> 	pt_prev = NULL;
>
> 	rcu_read_lock();

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Arnd Bergmann @ 2010-04-28 19:16 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw
In-Reply-To: <C7FDCED5.2C841%scofeldm@cisco.com>

On Wednesday 28 April 2010, Scott Feldman wrote:
> > I thought you had meant that we can do the association of attached interfaces
> > through any interface, rather than tying it to the slave interface. While I'm
> > not sure I read your code correctly, it seems like you now only talk to the
> > slave interface, not to the master at all!
> > 
> > At least the check above should be 'if (enic_is_dynamic(enic)) return
> > -EOPNOTSUPP', not the other way round.
> > Moreover, if the netdev is the master here, you only allow a single slave,
> > which is not enough for larger setups (n > 1), though that could be a
> > limitation of your first version.
> 
> The code is correct.  I probably confused you with earlier patches trying to
> accommodate master/slave devices and you might have assumed enic was such a
> device.  But it's not.  For enic, there are two device IDs, let's call one
> "static" and the other "dynamic".  The only difference between the two is
> static enics load up fully ready to go just like a normal nic, whereas
> dynamic enics load up but can't yet pass traffic because they're not
> "plugged in" to the network.  To plug them in, you need to associate a
> port-profile.  The physical analogy is this: server admin tells network
> admin: plug my nic into a switch port with these characteristics.  Here, the
> port-profile describes those switch port characteristics.  Now, there is no
> master/slave relationship between static and dynamic enics.  There could be
> with a simple firmware update, but it's not there today.  Also, I want to
> point out that a single phys Cisco nic can be provisioned to expose many
> static and/or many dynamic enics to the host.  On the order of 100s.  The
> code above is to block port-profile association on static enics.  Static
> enics where already provisioned on the network when created so there is no
> need for a port-profile push from the host.

Ok, I see. I had asked this before, but never got a definite reply on this.

> > Passing just the slave device however would not work in the general case, as I
> > tried to point out in the mail you replied to. If the slave interface is owned
> > by a guest using PCI passthrough, or it sits below a stack of nested
> > interfaces
> > (vlan, bridge, tap, vhost, ...), it's impossible to know what interface is
> > responsible for setting up the slave.
> 
> For port-profile, we want to pass the device that is to be "plugged-in" to
> the network based on port-profile association.  This is the device that
> gives basic connectivity to the guest interface, regardless of how the guest
> interface is wired to the device.  It could be direct PCI pass-thru, macvtap
> stack, some yet-to-be-invented kernel-bypass stack, etc.

But if the device is already passed to the guest using pass-thru or containers,
you would no longer to query or change the port profile, because it is no
longer visible in the host, right?
 
> > Note that you cannot perform the association
> > through the slave interface itself because the remote switch would discard any
> > traffic originating from an unassociated interface.
> 
> That's not a limitation of our device/switch.

This seems to contradict what you write above, at least when you drop the
assumption that the protocol is implemented in the NIC firmware.
The switch obviously does not care about the interface name in Linux or
any of its data structures. What it cares about instead is the traffic on
the wire and which of its ports this takes place on.

When you create a new dynamic enic device or a macvtap port but not assoicate
it, the switch cannot allow this device to send or receive any traffic itself,
as you write above (not 'plugged in'). The application (or firmware, for that
matter) therefore needs to talk to the switch over an interface that is already
associated. With VDP, this is the base device that a VEPA port is created from,
i.e. the one that talks LLDP to the switch, i.e. the one that comes up at boot
time when you have no virtualization and plug into a dumb switch.

I assumed that this was a specific PF in your NIC,  but it now sounds like it
could be an internal device that is only visible in your firmware and not exposed
as a network interface in Linux, right?

Your firmware can obviously find out the right communication channel for
a associating a dynamic interface with the switch, but when this is implemnted
in software, we cannot generally know that and rely on getting access to the
interface that lets us talk to the switch. The information which interface
is getting associated however is completely useless to an implementation like
this.

	Arnd

^ permalink raw reply

* Re: sctp patches for net-2.6
From: David Miller @ 2010-04-28 19:17 UTC (permalink / raw)
  To: vladislav.yasevich; +Cc: netdev, linux-sctp
In-Reply-To: <1272480442-32673-1-git-send-email-vladislav.yasevich@hp.com>

From: Vlad Yasevich <vladislav.yasevich@hp.com>
Date: Wed, 28 Apr 2010 14:47:17 -0400

> The following are the patches for the current net-2.6 tree that
> solve some critical issues.  Please consider pushing them to
> stable as well.

All applied and queued up for -stable, thanks!

^ permalink raw reply

* Re: [PATCH net-2.6 1/3] sfc: Wait at most 10ms for the MC to finish reading out MAC statistics
From: David Miller @ 2010-04-28 19:18 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272481235.2549.14.camel@achroite.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Wed, 28 Apr 2010 20:00:35 +0100

> The original code would wait indefinitely if MAC stats DMA failed.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> Cc: stable@kernel.org

Applied.

^ permalink raw reply

* Re: [PATCH net-2.6 2/3] sfc: Always close net device at the end of a disabling reset
From: David Miller @ 2010-04-28 19:18 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272481293.2549.15.camel@achroite.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Wed, 28 Apr 2010 20:01:33 +0100

> This fixes a regression introduced by commit
> eb9f6744cbfa97674c13263802259b5aa0034594 "sfc: Implement ethtool
> reset operation".
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> Cc: stable@kernel.org

Applied.

^ permalink raw reply

* Re: [PATCH net-2.6 3/3] sfc: Change falcon_probe_board() to fail for unsupported boards
From: David Miller @ 2010-04-28 19:18 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272481310.2549.16.camel@achroite.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Wed, 28 Apr 2010 20:01:50 +0100

> The driver needs specific PHY and board support code for each SFC4000
> board; there is no point trying to continue if it is missing.
> Currently unsupported boards can trigger an 'oops'.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> Cc: stable@kernel.org

Applied.

^ permalink raw reply

* [PATCH 01/17] sfc: Ignore parity errors in the other port's SRAM
From: Ben Hutchings @ 2010-04-28 19:25 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

From: Steve Hodgson <shodgson@solarflare.com>

Siena has a separate SRAM bank for each port.  On single-port boards
these can be merged together, so each port has an interrupt flag for
parity errors in the other port's SRAM.  Currently we do not enable
such merging and should mask this interrupt source.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/nic.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/sfc/nic.c b/drivers/net/sfc/nic.c
index b06f8e3..664fd6c 100644
--- a/drivers/net/sfc/nic.c
+++ b/drivers/net/sfc/nic.c
@@ -1563,6 +1563,8 @@ void efx_nic_init_common(struct efx_nic *efx)
 			     FRF_AZ_ILL_ADR_INT_KER_EN, 1,
 			     FRF_AZ_RBUF_OWN_INT_KER_EN, 1,
 			     FRF_AZ_TBUF_OWN_INT_KER_EN, 1);
+	if (efx_nic_rev(efx) >= EFX_REV_SIENA_A0)
+		EFX_SET_OWORD_FIELD(temp, FRF_CZ_SRAM_PERR_INT_P_KER_EN, 1);
 	EFX_INVERT_OWORD(temp);
 	efx_writeo(efx, &temp, FR_AZ_FATAL_INTR_KER);
 
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: Poor localhost net performance on recent stable kernel
From: Andrew Morton @ 2010-04-28 19:25 UTC (permalink / raw)
  To: Kelly Burkhart; +Cc: netdev, linux-kernel
In-Reply-To: <q2gfa1e4ce71004150844of9c3aaaeh5e5cc7c20d263a41@mail.gmail.com>

On Thu, 15 Apr 2010 10:44:44 -0500
Kelly Burkhart <kelly.burkhart@gmail.com> wrote:

> Hello,
> 
> While working on upgrading distributions, I've noticed that local
> network communication is much slower on 2.6.33.2 than on our old
> kernel 2.6.16.60 (sles 10.2).
> 
> Results of netperf, UDP_RR against localhost I get around 150000 tps
> on the new kernel vs. 290000 tps with the old kernel.  The netperf
> command:
> 
> netperf -T 1 -H 127.0.0.1 -t UDP_RR -c -C -- -r 100

I ran this command on a Red Hat 2.6.18-1.2868 kernel and on 2.6.34-rc5.

2.6.18-1.2868: 43903.29 per second
2.6.34-rc5:    72506.11 per second

IIRC, localhost communications have always exhibited quite large
variations between kernel versions depending on various vagaries
of alignemnt, cacheline sharing, etc.

> TCP_RR had similar results.  The problem did not exist with TCP_STREAM.
> 
> While trying to track this down, I wrote a test program that writes
> then reads a 32 bit integer to a pipe:
> 
> static void tst_pipe0( int sleep_us )
> {
>     int pipefd[2];
>     int idx;
>     uint32_t tarr[ITERS];
> 
>     printf("tst_pipe0 -- sleep %dus\n", sleep_us);
> 
>     if (pipe(pipefd) < 0)
>         err_exit("pipe");
> 
>     for(idx=0; idx<ITERS; ++idx) {
>         uint32_t btsc;
>         uint32_t rtsc;
>         uint32_t etsc;
>         get_tscl(btsc);
>         write(pipefd[1], (char *)&btsc, sizeof(btsc));
>         read(pipefd[0], (char *)&rtsc, sizeof(rtsc));
>         get_tscl(etsc);
>         tarr[idx] = etsc-btsc;
>         do_sleep(sleep_us);
>     }
>     prt_avg(tarr, ITERS);
>     close(pipefd[0]);
>     close(pipefd[1]);
>     printf("\n");
> }
> 
> There's a dramatic difference if there's a sleep between iterations on
> the new kernel.  On the old kernel the write/read round trip takes
> 1100-1300 cycles with or without sleep.  On the new kernel, with no
> sleep the round trip is about 1400 cycles.  It doubles with a 1us
> sleep then gradually increases to 12000-14000 cycles then stabilizes
> as I increase the sleep time to 1500us.  I'm not sure if this is
> related to the netperf difference or is a completely different
> scheduling issue.
> 
> I'm running on an Intel Xeon X5570 @ 2.93GHz.  Different tick/notick,
> preemption, HZ kernel config option values doesn't substantially change
> the magnitude of the difference.
> 
> Does anyone have any ideas regarding what could be causing the netperf
> issue?  And is the pipe microbenchmark meaningful and if so what does
> it mean?

Pipes don't share much code with udp-to-localhost - this is probably
something different.

If you were using two processes then I'd cheerily blame the scheduler. 
Because blaming the scheduler for WeirdShitWhichBroke is usually
correct.  But as you're using a single process then the pipe code
itself is a more likely source for any slowdowns.

As for the strange behavior with sleeps: dunno.  There are various
adjustments made to the sleep duration when performing short sleeps -
some in-kernel, perhaps some in glibc.  Plus we've been evolving the
internal implementation for sleeps, and changes in x86 clocksources and
NOHZ could impact the accuracy of the sleep duration.  So perhaps
what's happening is that different kernels are sleeping for different
durations when asked to sleep for short durations.

If it's not that then it's probably the scheduler ;) But even the
scheduler would have trouble causing these sorts of effects if the
machine is otherwise idle.


^ permalink raw reply

* [PATCH 02/17] sfc: Consistently report short MCDI responses as EIO
From: Ben Hutchings @ 2010-04-28 19:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

In some cases failing functions were returning 0 which is obviously wrong.
In other cases they were returning inappropriate error codes.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/mcdi.c     |   22 ++++++++++++++--------
 drivers/net/sfc/mcdi_phy.c |    6 +++---
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/net/sfc/mcdi.c b/drivers/net/sfc/mcdi.c
index c48669c..1344afa 100644
--- a/drivers/net/sfc/mcdi.c
+++ b/drivers/net/sfc/mcdi.c
@@ -613,7 +613,7 @@ int efx_mcdi_fwver(struct efx_nic *efx, u64 *version, u32 *build)
 	}
 
 	if (outlength < MC_CMD_GET_VERSION_V1_OUT_LEN) {
-		rc = -EMSGSIZE;
+		rc = -EIO;
 		goto fail;
 	}
 
@@ -647,8 +647,10 @@ int efx_mcdi_drv_attach(struct efx_nic *efx, bool driver_operating,
 			  outbuf, sizeof(outbuf), &outlen);
 	if (rc)
 		goto fail;
-	if (outlen < MC_CMD_DRV_ATTACH_OUT_LEN)
+	if (outlen < MC_CMD_DRV_ATTACH_OUT_LEN) {
+		rc = -EIO;
 		goto fail;
+	}
 
 	if (was_attached != NULL)
 		*was_attached = MCDI_DWORD(outbuf, DRV_ATTACH_OUT_OLD_STATE);
@@ -676,7 +678,7 @@ int efx_mcdi_get_board_cfg(struct efx_nic *efx, u8 *mac_address,
 		goto fail;
 
 	if (outlen < MC_CMD_GET_BOARD_CFG_OUT_LEN) {
-		rc = -EMSGSIZE;
+		rc = -EIO;
 		goto fail;
 	}
 
@@ -738,8 +740,10 @@ int efx_mcdi_nvram_types(struct efx_nic *efx, u32 *nvram_types_out)
 			  outbuf, sizeof(outbuf), &outlen);
 	if (rc)
 		goto fail;
-	if (outlen < MC_CMD_NVRAM_TYPES_OUT_LEN)
+	if (outlen < MC_CMD_NVRAM_TYPES_OUT_LEN) {
+		rc = -EIO;
 		goto fail;
+	}
 
 	*nvram_types_out = MCDI_DWORD(outbuf, NVRAM_TYPES_OUT_TYPES);
 	return 0;
@@ -765,8 +769,10 @@ int efx_mcdi_nvram_info(struct efx_nic *efx, unsigned int type,
 			  outbuf, sizeof(outbuf), &outlen);
 	if (rc)
 		goto fail;
-	if (outlen < MC_CMD_NVRAM_INFO_OUT_LEN)
+	if (outlen < MC_CMD_NVRAM_INFO_OUT_LEN) {
+		rc = -EIO;
 		goto fail;
+	}
 
 	*size_out = MCDI_DWORD(outbuf, NVRAM_INFO_OUT_SIZE);
 	*erase_size_out = MCDI_DWORD(outbuf, NVRAM_INFO_OUT_ERASESIZE);
@@ -968,7 +974,7 @@ static int efx_mcdi_read_assertion(struct efx_nic *efx)
 	if (rc)
 		return rc;
 	if (outlen < MC_CMD_GET_ASSERTS_OUT_LEN)
-		return -EINVAL;
+		return -EIO;
 
 	/* Print out any recorded assertion state */
 	flags = MCDI_DWORD(outbuf, GET_ASSERTS_OUT_GLOBAL_FLAGS);
@@ -1086,7 +1092,7 @@ int efx_mcdi_wol_filter_set(struct efx_nic *efx, u32 type,
 		goto fail;
 
 	if (outlen < MC_CMD_WOL_FILTER_SET_OUT_LEN) {
-		rc = -EMSGSIZE;
+		rc = -EIO;
 		goto fail;
 	}
 
@@ -1121,7 +1127,7 @@ int efx_mcdi_wol_filter_get_magic(struct efx_nic *efx, int *id_out)
 		goto fail;
 
 	if (outlen < MC_CMD_WOL_FILTER_GET_OUT_LEN) {
-		rc = -EMSGSIZE;
+		rc = -EIO;
 		goto fail;
 	}
 
diff --git a/drivers/net/sfc/mcdi_phy.c b/drivers/net/sfc/mcdi_phy.c
index 2f23546..5d34487 100644
--- a/drivers/net/sfc/mcdi_phy.c
+++ b/drivers/net/sfc/mcdi_phy.c
@@ -48,7 +48,7 @@ efx_mcdi_get_phy_cfg(struct efx_nic *efx, struct efx_mcdi_phy_cfg *cfg)
 		goto fail;
 
 	if (outlen < MC_CMD_GET_PHY_CFG_OUT_LEN) {
-		rc = -EMSGSIZE;
+		rc = -EIO;
 		goto fail;
 	}
 
@@ -111,7 +111,7 @@ static int efx_mcdi_loopback_modes(struct efx_nic *efx, u64 *loopback_modes)
 		goto fail;
 
 	if (outlen < MC_CMD_GET_LOOPBACK_MODES_OUT_LEN) {
-		rc = -EMSGSIZE;
+		rc = -EIO;
 		goto fail;
 	}
 
@@ -587,7 +587,7 @@ static int efx_mcdi_phy_test_alive(struct efx_nic *efx)
 		return rc;
 
 	if (outlen < MC_CMD_GET_PHY_STATE_OUT_LEN)
-		return -EMSGSIZE;
+		return -EIO;
 	if (MCDI_DWORD(outbuf, GET_PHY_STATE_STATE) != MC_CMD_PHY_STATE_OK)
 		return -EINVAL;
 
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 03/17] sfc: Handle serious errors in exactly one interrupt handler
From: Ben Hutchings @ 2010-04-28 19:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

From: Steve Hodgson <shodgson@solarflare.com>

'Fatal' errors set an interrupt flag associated with a specific event
queue; only read the syndrome vector if we see that queue's flag set
(legacy interrupts) or in the interrupt handler for that queue (MSI).

Do not ignore an interrupt if the fatal error flag is set but specific
error flags are all zero.  Even if we don't schedule a reset, we must
respect the queue mask and rearm the appropriate event queues.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/falcon.c     |   13 ++++++++-----
 drivers/net/sfc/net_driver.h |    2 ++
 drivers/net/sfc/nic.c        |   35 +++++++++++++++++++----------------
 3 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/drivers/net/sfc/falcon.c b/drivers/net/sfc/falcon.c
index d294d66..d09ad1b 100644
--- a/drivers/net/sfc/falcon.c
+++ b/drivers/net/sfc/falcon.c
@@ -175,16 +175,19 @@ irqreturn_t falcon_legacy_interrupt_a1(int irq, void *dev_id)
 	EFX_TRACE(efx, "IRQ %d on CPU %d status " EFX_OWORD_FMT "\n",
 		  irq, raw_smp_processor_id(), EFX_OWORD_VAL(*int_ker));
 
-	/* Check to see if we have a serious error condition */
-	syserr = EFX_OWORD_FIELD(*int_ker, FSF_AZ_NET_IVEC_FATAL_INT);
-	if (unlikely(syserr))
-		return efx_nic_fatal_interrupt(efx);
-
 	/* Determine interrupting queues, clear interrupt status
 	 * register and acknowledge the device interrupt.
 	 */
 	BUILD_BUG_ON(FSF_AZ_NET_IVEC_INT_Q_WIDTH > EFX_MAX_CHANNELS);
 	queues = EFX_OWORD_FIELD(*int_ker, FSF_AZ_NET_IVEC_INT_Q);
+
+	/* Check to see if we have a serious error condition */
+	if (queues & (1U << efx->fatal_irq_level)) {
+		syserr = EFX_OWORD_FIELD(*int_ker, FSF_AZ_NET_IVEC_FATAL_INT);
+		if (unlikely(syserr))
+			return efx_nic_fatal_interrupt(efx);
+	}
+
 	EFX_ZERO_OWORD(*int_ker);
 	wmb(); /* Ensure the vector is cleared before interrupt ack */
 	falcon_irq_ack_a1(efx);
diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index cb018e2..70aea3a 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -672,6 +672,7 @@ union efx_multicast_hash {
  *	This register is written with the SMP processor ID whenever an
  *	interrupt is handled.  It is used by efx_nic_test_interrupt()
  *	to verify that an interrupt has occurred.
+ * @fatal_irq_level: IRQ level (bit number) used for serious errors
  * @spi_flash: SPI flash device
  *	This field will be %NULL if no flash device is present (or for Siena).
  * @spi_eeprom: SPI EEPROM device
@@ -756,6 +757,7 @@ struct efx_nic {
 	struct efx_buffer irq_status;
 	volatile signed int last_irq_cpu;
 	unsigned long irq_zero_count;
+	unsigned fatal_irq_level;
 
 	struct efx_spi_device *spi_flash;
 	struct efx_spi_device *spi_eeprom;
diff --git a/drivers/net/sfc/nic.c b/drivers/net/sfc/nic.c
index 664fd6c..23738f8 100644
--- a/drivers/net/sfc/nic.c
+++ b/drivers/net/sfc/nic.c
@@ -1229,15 +1229,9 @@ static inline void efx_nic_interrupts(struct efx_nic *efx,
 				      bool enabled, bool force)
 {
 	efx_oword_t int_en_reg_ker;
-	unsigned int level = 0;
-
-	if (EFX_WORKAROUND_17213(efx) && !EFX_INT_MODE_USE_MSI(efx))
-		/* Set the level always even if we're generating a test
-		 * interrupt, because our legacy interrupt handler is safe */
-		level = 0x1f;
 
 	EFX_POPULATE_OWORD_3(int_en_reg_ker,
-			     FRF_AZ_KER_INT_LEVE_SEL, level,
+			     FRF_AZ_KER_INT_LEVE_SEL, efx->fatal_irq_level,
 			     FRF_AZ_KER_INT_KER, force,
 			     FRF_AZ_DRV_INT_EN_KER, enabled);
 	efx_writeo(efx, &int_en_reg_ker, FR_AZ_INT_EN_KER);
@@ -1291,8 +1285,6 @@ irqreturn_t efx_nic_fatal_interrupt(struct efx_nic *efx)
 		EFX_OWORD_FMT ": %s\n", EFX_OWORD_VAL(*int_ker),
 		EFX_OWORD_VAL(fatal_intr),
 		error ? "disabling bus mastering" : "no recognised error");
-	if (error == 0)
-		goto out;
 
 	/* If this is a memory parity error dump which blocks are offending */
 	mem_perr = EFX_OWORD_FIELD(fatal_intr, FRF_AZ_MEM_PERR_INT_KER);
@@ -1324,7 +1316,7 @@ irqreturn_t efx_nic_fatal_interrupt(struct efx_nic *efx)
 			"NIC will be disabled\n");
 		efx_schedule_reset(efx, RESET_TYPE_DISABLE);
 	}
-out:
+
 	return IRQ_HANDLED;
 }
 
@@ -1346,9 +1338,11 @@ static irqreturn_t efx_legacy_interrupt(int irq, void *dev_id)
 	queues = EFX_EXTRACT_DWORD(reg, 0, 31);
 
 	/* Check to see if we have a serious error condition */
-	syserr = EFX_OWORD_FIELD(*int_ker, FSF_AZ_NET_IVEC_FATAL_INT);
-	if (unlikely(syserr))
-		return efx_nic_fatal_interrupt(efx);
+	if (queues & (1U << efx->fatal_irq_level)) {
+		syserr = EFX_OWORD_FIELD(*int_ker, FSF_AZ_NET_IVEC_FATAL_INT);
+		if (unlikely(syserr))
+			return efx_nic_fatal_interrupt(efx);
+	}
 
 	if (queues != 0) {
 		if (EFX_WORKAROUND_15783(efx))
@@ -1413,9 +1407,11 @@ static irqreturn_t efx_msi_interrupt(int irq, void *dev_id)
 		  irq, raw_smp_processor_id(), EFX_OWORD_VAL(*int_ker));
 
 	/* Check to see if we have a serious error condition */
-	syserr = EFX_OWORD_FIELD(*int_ker, FSF_AZ_NET_IVEC_FATAL_INT);
-	if (unlikely(syserr))
-		return efx_nic_fatal_interrupt(efx);
+	if (channel->channel == efx->fatal_irq_level) {
+		syserr = EFX_OWORD_FIELD(*int_ker, FSF_AZ_NET_IVEC_FATAL_INT);
+		if (unlikely(syserr))
+			return efx_nic_fatal_interrupt(efx);
+	}
 
 	/* Schedule processing of the channel */
 	efx_schedule_channel(channel);
@@ -1553,6 +1549,13 @@ void efx_nic_init_common(struct efx_nic *efx)
 			     FRF_AZ_INT_ADR_KER, efx->irq_status.dma_addr);
 	efx_writeo(efx, &temp, FR_AZ_INT_ADR_KER);
 
+	if (EFX_WORKAROUND_17213(efx) && !EFX_INT_MODE_USE_MSI(efx))
+		/* Use an interrupt level unused by event queues */
+		efx->fatal_irq_level = 0x1f;
+	else
+		/* Use a valid MSI-X vector */
+		efx->fatal_irq_level = 0;
+
 	/* Enable all the genuinely fatal interrupts.  (They are still
 	 * masked by the overall interrupt mask, controlled by
 	 * falcon_interrupts()).
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 04/17] sfc: Stop masking out XGMII faults over reconfigures
From: Ben Hutchings @ 2010-04-28 19:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

From: Steve Hodgson <shodgson@solarflare.com>

The aim of this code was to avoid a spurious XGMII fault over a MAC
reconfigure. It's less relevant now that the PHY reconfigure isn't
called from the MAC reconfigure.

After applying this patch, our link stress test passed 48 hours of
testing without ever resetting the PHY.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/falcon_xmac.c |   20 +++++---------------
 1 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/drivers/net/sfc/falcon_xmac.c b/drivers/net/sfc/falcon_xmac.c
index 8ccab2c..3d65abf 100644
--- a/drivers/net/sfc/falcon_xmac.c
+++ b/drivers/net/sfc/falcon_xmac.c
@@ -85,14 +85,14 @@ int falcon_reset_xaui(struct efx_nic *efx)
 	return -ETIMEDOUT;
 }
 
-static void falcon_mask_status_intr(struct efx_nic *efx, bool enable)
+static void falcon_ack_status_intr(struct efx_nic *efx)
 {
 	efx_oword_t reg;
 
 	if ((efx_nic_rev(efx) != EFX_REV_FALCON_B0) || LOOPBACK_INTERNAL(efx))
 		return;
 
-	/* We expect xgmii faults if the wireside link is up */
+	/* We expect xgmii faults if the wireside link is down */
 	if (!EFX_WORKAROUND_5147(efx) || !efx->link_state.up)
 		return;
 
@@ -101,14 +101,7 @@ static void falcon_mask_status_intr(struct efx_nic *efx, bool enable)
 	if (efx->xmac_poll_required)
 		return;
 
-	/* Flush the ISR */
-	if (enable)
-		efx_reado(efx, &reg, FR_AB_XM_MGT_INT_MSK);
-
-	EFX_POPULATE_OWORD_2(reg,
-			     FRF_AB_XM_MSK_RMTFLT, !enable,
-			     FRF_AB_XM_MSK_LCLFLT, !enable);
-	efx_writeo(efx, &reg, FR_AB_XM_MGT_INT_MASK);
+	efx_reado(efx, &reg, FR_AB_XM_MGT_INT_MSK);
 }
 
 static bool falcon_xgxs_link_ok(struct efx_nic *efx)
@@ -283,15 +276,13 @@ static bool falcon_xmac_check_fault(struct efx_nic *efx)
 
 static int falcon_reconfigure_xmac(struct efx_nic *efx)
 {
-	falcon_mask_status_intr(efx, false);
-
 	falcon_reconfigure_xgxs_core(efx);
 	falcon_reconfigure_xmac_core(efx);
 
 	falcon_reconfigure_mac_wrapper(efx);
 
 	efx->xmac_poll_required = !falcon_xmac_link_ok_retry(efx, 5);
-	falcon_mask_status_intr(efx, true);
+	falcon_ack_status_intr(efx);
 
 	return 0;
 }
@@ -362,9 +353,8 @@ void falcon_poll_xmac(struct efx_nic *efx)
 	    !efx->xmac_poll_required)
 		return;
 
-	falcon_mask_status_intr(efx, false);
 	efx->xmac_poll_required = !falcon_xmac_link_ok_retry(efx, 1);
-	falcon_mask_status_intr(efx, true);
+	falcon_ack_status_intr(efx);
 }
 
 struct efx_mac_operations falcon_xmac_operations = {
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 05/17] sfc: Reconfigure the XAUI serdes after an EM reset
From: Ben Hutchings @ 2010-04-28 19:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

From: Steve Hodgson <shodgson@solarflare.com>

Fix a regression introduced in d3245b28ef2a45ec4e115062a38100bd06229289
"sfc: Refactor link configuration".

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/falcon.c      |    3 +++
 drivers/net/sfc/falcon_xmac.c |    2 +-
 drivers/net/sfc/nic.h         |    1 +
 3 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sfc/falcon.c b/drivers/net/sfc/falcon.c
index d09ad1b..f7df24d 100644
--- a/drivers/net/sfc/falcon.c
+++ b/drivers/net/sfc/falcon.c
@@ -507,6 +507,9 @@ static void falcon_reset_macs(struct efx_nic *efx)
 	/* Ensure the correct MAC is selected before statistics
 	 * are re-enabled by the caller */
 	efx_writeo(efx, &mac_ctrl, FR_AB_MAC_CTRL);
+
+	/* This can run even when the GMAC is selected */
+	falcon_setup_xaui(efx);
 }
 
 void falcon_drain_tx_fifo(struct efx_nic *efx)
diff --git a/drivers/net/sfc/falcon_xmac.c b/drivers/net/sfc/falcon_xmac.c
index 3d65abf..c84a2ce 100644
--- a/drivers/net/sfc/falcon_xmac.c
+++ b/drivers/net/sfc/falcon_xmac.c
@@ -26,7 +26,7 @@
  *************************************************************************/
 
 /* Configure the XAUI driver that is an output from Falcon */
-static void falcon_setup_xaui(struct efx_nic *efx)
+void falcon_setup_xaui(struct efx_nic *efx)
 {
 	efx_oword_t sdctl, txdrv;
 
diff --git a/drivers/net/sfc/nic.h b/drivers/net/sfc/nic.h
index 9351c03..7e3bec8 100644
--- a/drivers/net/sfc/nic.h
+++ b/drivers/net/sfc/nic.h
@@ -203,6 +203,7 @@ extern void falcon_irq_ack_a1(struct efx_nic *efx);
 extern int efx_nic_flush_queues(struct efx_nic *efx);
 extern void falcon_start_nic_stats(struct efx_nic *efx);
 extern void falcon_stop_nic_stats(struct efx_nic *efx);
+extern void falcon_setup_xaui(struct efx_nic *efx);
 extern int falcon_reset_xaui(struct efx_nic *efx);
 extern void efx_nic_init_common(struct efx_nic *efx);
 
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 06/17] sfc: Extend the legacy interrupt workarounds
From: Ben Hutchings @ 2010-04-28 19:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

From: Steve Hodgson <shodgson@solarflare.com>

Siena has two problems with legacy interrupts:
  1. There is no synchronisation between the ISR read completion,
     and the interrupt deassert message.
  2. A downstream read at the "wrong" moment can return 0, and
     suppress generating the next interrupt.

Falcon should suffer from both of these, and it appears it does.
Enable EFX_WORKAROUND_15783 on Falcon as well.

Also, when we see queues == 0, ensure we always schedule or rearm
every event queue.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/nic.c         |   23 +++++++++--------------
 drivers/net/sfc/workarounds.h |    2 +-
 2 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/net/sfc/nic.c b/drivers/net/sfc/nic.c
index 23738f8..b61674c 100644
--- a/drivers/net/sfc/nic.c
+++ b/drivers/net/sfc/nic.c
@@ -1356,33 +1356,28 @@ static irqreturn_t efx_legacy_interrupt(int irq, void *dev_id)
 		}
 		result = IRQ_HANDLED;
 
-	} else if (EFX_WORKAROUND_15783(efx) &&
-		   efx->irq_zero_count++ == 0) {
+	} else if (EFX_WORKAROUND_15783(efx)) {
 		efx_qword_t *event;
 
-		/* Ensure we rearm all event queues */
+		/* We can't return IRQ_HANDLED more than once on seeing ISR=0
+		 * because this might be a shared interrupt. */
+		if (efx->irq_zero_count++ == 0)
+			result = IRQ_HANDLED;
+
+		/* Ensure we schedule or rearm all event queues */
 		efx_for_each_channel(channel, efx) {
 			event = efx_event(channel, channel->eventq_read_ptr);
 			if (efx_event_present(event))
 				efx_schedule_channel(channel);
+			else
+				efx_nic_eventq_read_ack(channel);
 		}
-
-		result = IRQ_HANDLED;
 	}
 
 	if (result == IRQ_HANDLED) {
 		efx->last_irq_cpu = raw_smp_processor_id();
 		EFX_TRACE(efx, "IRQ %d on CPU %d status " EFX_DWORD_FMT "\n",
 			  irq, raw_smp_processor_id(), EFX_DWORD_VAL(reg));
-	} else if (EFX_WORKAROUND_15783(efx)) {
-		/* We can't return IRQ_HANDLED more than once on seeing ISR0=0
-		 * because this might be a shared interrupt, but we do need to
-		 * check the channel every time and preemptively rearm it if
-		 * it's idle. */
-		efx_for_each_channel(channel, efx) {
-			if (!channel->work_pending)
-				efx_nic_eventq_read_ack(channel);
-		}
 	}
 
 	return result;
diff --git a/drivers/net/sfc/workarounds.h b/drivers/net/sfc/workarounds.h
index acd9c73..518f7fc 100644
--- a/drivers/net/sfc/workarounds.h
+++ b/drivers/net/sfc/workarounds.h
@@ -37,7 +37,7 @@
 /* Truncated IPv4 packets can confuse the TX packet parser */
 #define EFX_WORKAROUND_15592 EFX_WORKAROUND_FALCON_AB
 /* Legacy ISR read can return zero once */
-#define EFX_WORKAROUND_15783 EFX_WORKAROUND_SIENA
+#define EFX_WORKAROUND_15783 EFX_WORKAROUND_ALWAYS
 /* Legacy interrupt storm when interrupt fifo fills */
 #define EFX_WORKAROUND_17213 EFX_WORKAROUND_SIENA
 
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 07/17] sfc: Log specific message for failure of NVRAM self-test
From: Ben Hutchings @ 2010-04-28 19:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/mcdi.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/sfc/mcdi.c b/drivers/net/sfc/mcdi.c
index 1344afa..93cc3c1 100644
--- a/drivers/net/sfc/mcdi.c
+++ b/drivers/net/sfc/mcdi.c
@@ -932,20 +932,26 @@ int efx_mcdi_nvram_test_all(struct efx_nic *efx)
 
 	rc = efx_mcdi_nvram_types(efx, &nvram_types);
 	if (rc)
-		return rc;
+		goto fail1;
 
 	type = 0;
 	while (nvram_types != 0) {
 		if (nvram_types & 1) {
 			rc = efx_mcdi_nvram_test(efx, type);
 			if (rc)
-				return rc;
+				goto fail2;
 		}
 		type++;
 		nvram_types >>= 1;
 	}
 
 	return 0;
+
+fail2:
+	EFX_ERR(efx, "%s: failed type=%u\n", __func__, type);
+fail1:
+	EFX_ERR(efx, "%s: failed rc=%d\n", __func__, rc);
+	return rc;
 }
 
 static int efx_mcdi_read_assertion(struct efx_nic *efx)
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 08/17] sfc: Read MEM_STAT for SRM_PERR as well as MEM_PERR errors
From: Ben Hutchings @ 2010-04-28 19:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

From: Steve Hodgson <shodgson@solarflare.com>

Parity errors in different blocks of SRAM may set one of two different
interrupt flags.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/nic.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sfc/nic.c b/drivers/net/sfc/nic.c
index b61674c..4105f90 100644
--- a/drivers/net/sfc/nic.c
+++ b/drivers/net/sfc/nic.c
@@ -1287,7 +1287,8 @@ irqreturn_t efx_nic_fatal_interrupt(struct efx_nic *efx)
 		error ? "disabling bus mastering" : "no recognised error");
 
 	/* If this is a memory parity error dump which blocks are offending */
-	mem_perr = EFX_OWORD_FIELD(fatal_intr, FRF_AZ_MEM_PERR_INT_KER);
+	mem_perr = (EFX_OWORD_FIELD(fatal_intr, FRF_AZ_MEM_PERR_INT_KER) ||
+		    EFX_OWORD_FIELD(fatal_intr, FRF_AZ_SRM_PERR_INT_KER));
 	if (mem_perr) {
 		efx_oword_t reg;
 		efx_reado(efx, &reg, FR_AZ_MEM_STAT);
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 09/17] sfc: Enable IPv6 RSS using random key for Toeplitz hash
From: Ben Hutchings @ 2010-04-28 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/nic.h   |    2 ++
 drivers/net/sfc/siena.c |   19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/drivers/net/sfc/nic.h b/drivers/net/sfc/nic.h
index 7e3bec8..5825f37 100644
--- a/drivers/net/sfc/nic.h
+++ b/drivers/net/sfc/nic.h
@@ -135,12 +135,14 @@ static inline struct falcon_board *falcon_board(struct efx_nic *efx)
  * @fw_build: Firmware build number
  * @mcdi: Management-Controller-to-Driver Interface
  * @wol_filter_id: Wake-on-LAN packet filter id
+ * @ipv6_rss_key: Toeplitz hash key for IPv6 RSS
  */
 struct siena_nic_data {
 	u64 fw_version;
 	u32 fw_build;
 	struct efx_mcdi_iface mcdi;
 	int wol_filter_id;
+	u8 ipv6_rss_key[40];
 };
 
 extern void siena_print_fwver(struct efx_nic *efx, char *buf, size_t len);
diff --git a/drivers/net/sfc/siena.c b/drivers/net/sfc/siena.c
index 38dcc42..7bf93fa 100644
--- a/drivers/net/sfc/siena.c
+++ b/drivers/net/sfc/siena.c
@@ -13,6 +13,7 @@
 #include <linux/pci.h>
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <linux/random.h>
 #include "net_driver.h"
 #include "bitfield.h"
 #include "efx.h"
@@ -274,6 +275,9 @@ static int siena_probe_nic(struct efx_nic *efx)
 		goto fail5;
 	}
 
+	get_random_bytes(&nic_data->ipv6_rss_key,
+			 sizeof(nic_data->ipv6_rss_key));
+
 	return 0;
 
 fail5:
@@ -293,6 +297,7 @@ fail1:
  */
 static int siena_init_nic(struct efx_nic *efx)
 {
+	struct siena_nic_data *nic_data = efx->nic_data;
 	efx_oword_t temp;
 	int rc;
 
@@ -319,6 +324,20 @@ static int siena_init_nic(struct efx_nic *efx)
 	EFX_SET_OWORD_FIELD(temp, FRF_BZ_RX_INGR_EN, 1);
 	efx_writeo(efx, &temp, FR_AZ_RX_CFG);
 
+	/* Enable IPv6 RSS */
+	BUILD_BUG_ON(sizeof(nic_data->ipv6_rss_key) !=
+		     2 * sizeof(temp) + FRF_CZ_RX_RSS_IPV6_TKEY_HI_WIDTH / 8 ||
+		     FRF_CZ_RX_RSS_IPV6_TKEY_HI_LBN != 0);
+	memcpy(&temp, nic_data->ipv6_rss_key, sizeof(temp));
+	efx_writeo(efx, &temp, FR_CZ_RX_RSS_IPV6_REG1);
+	memcpy(&temp, nic_data->ipv6_rss_key + sizeof(temp), sizeof(temp));
+	efx_writeo(efx, &temp, FR_CZ_RX_RSS_IPV6_REG2);
+	EFX_POPULATE_OWORD_2(temp, FRF_CZ_RX_RSS_IPV6_THASH_ENABLE, 1,
+			     FRF_CZ_RX_RSS_IPV6_IP_THASH_ENABLE, 1);
+	memcpy(&temp, nic_data->ipv6_rss_key + 2 * sizeof(temp),
+	       FRF_CZ_RX_RSS_IPV6_TKEY_HI_WIDTH / 8);
+	efx_writeo(efx, &temp, FR_CZ_RX_RSS_IPV6_REG3);
+
 	if (efx_nic_rx_xoff_thresh >= 0 || efx_nic_rx_xon_thresh >= 0)
 		/* No MCDI operation has been defined to set thresholds */
 		EFX_ERR(efx, "ignoring RX flow control thresholds\n");
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 10/17] sfc: Update MCDI protocol definitions
From: Ben Hutchings @ 2010-04-28 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/mcdi_pcol.h |   71 +++++++++++++++++++++++++++++++++----------
 1 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/drivers/net/sfc/mcdi_pcol.h b/drivers/net/sfc/mcdi_pcol.h
index bd59302..90359e6 100644
--- a/drivers/net/sfc/mcdi_pcol.h
+++ b/drivers/net/sfc/mcdi_pcol.h
@@ -863,7 +863,7 @@
  * bist output. The driver should only consume the BIST output
  * after validating OUTLEN and PHY_CFG.PHY_TYPE.
  *
- * If a driver can't succesfully parse the BIST output, it should
+ * If a driver can't successfully parse the BIST output, it should
  * still respect the pass/Fail in OUT.RESULT
  *
  * Locks required: PHY_LOCK if doing a  PHY BIST
@@ -872,7 +872,7 @@
 #define MC_CMD_POLL_BIST 0x26
 #define MC_CMD_POLL_BIST_IN_LEN 0
 #define MC_CMD_POLL_BIST_OUT_LEN UNKNOWN
-#define MC_CMD_POLL_BIST_OUT_SFT9001_LEN 40
+#define MC_CMD_POLL_BIST_OUT_SFT9001_LEN 36
 #define MC_CMD_POLL_BIST_OUT_MRSFP_LEN 8
 #define MC_CMD_POLL_BIST_OUT_RESULT_OFST 0
 #define MC_CMD_POLL_BIST_RUNNING 1
@@ -882,15 +882,14 @@
 /* Generic: */
 #define MC_CMD_POLL_BIST_OUT_PRIVATE_OFST 4
 /* SFT9001-specific: */
-/* (offset 4 unused?) */
-#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_LENGTH_A_OFST 8
-#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_LENGTH_B_OFST 12
-#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_LENGTH_C_OFST 16
-#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_LENGTH_D_OFST 20
-#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_STATUS_A_OFST 24
-#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_STATUS_B_OFST 28
-#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_STATUS_C_OFST 32
-#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_STATUS_D_OFST 36
+#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_LENGTH_A_OFST 4
+#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_LENGTH_B_OFST 8
+#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_LENGTH_C_OFST 12
+#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_LENGTH_D_OFST 16
+#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_STATUS_A_OFST 20
+#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_STATUS_B_OFST 24
+#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_STATUS_C_OFST 28
+#define MC_CMD_POLL_BIST_OUT_SFT9001_CABLE_STATUS_D_OFST 32
 #define MC_CMD_POLL_BIST_SFT9001_PAIR_OK 1
 #define MC_CMD_POLL_BIST_SFT9001_PAIR_OPEN 2
 #define MC_CMD_POLL_BIST_SFT9001_INTRA_PAIR_SHORT 3
@@ -1054,9 +1053,13 @@
 /* MC_CMD_PHY_STATS:
  * Get generic PHY statistics
  *
- * This call returns the statistics for a generic PHY, by direct DMA
- * into host memory, in a sparse array (indexed by the enumerate).
- * Each value is represented by a 32bit number.
+ * This call returns the statistics for a generic PHY in a sparse
+ * array (indexed by the enumerate). Each value is represented by
+ * a 32bit number.
+ *
+ * If the DMA_ADDR is 0, then no DMA is performed, and the statistics
+ * may be read directly out of shared memory. If DMA_ADDR != 0, then
+ * the statistics are dmad to that (page-aligned location)
  *
  * Locks required: None
  * Returns: 0, ETIME
@@ -1066,7 +1069,8 @@
 #define MC_CMD_PHY_STATS_IN_LEN 8
 #define MC_CMD_PHY_STATS_IN_DMA_ADDR_LO_OFST 0
 #define MC_CMD_PHY_STATS_IN_DMA_ADDR_HI_OFST 4
-#define MC_CMD_PHY_STATS_OUT_LEN 0
+#define MC_CMD_PHY_STATS_OUT_DMA_LEN 0
+#define MC_CMD_PHY_STATS_OUT_NO_DMA_LEN (MC_CMD_PHY_NSTATS * 4)
 
 /* Unified MAC statistics enumeration */
 #define MC_CMD_MAC_GENERATION_START 0
@@ -1158,11 +1162,13 @@
 #define MC_CMD_MAC_STATS_CMD_CLEAR_WIDTH 1
 #define MC_CMD_MAC_STATS_CMD_PERIODIC_CHANGE_LBN 2
 #define MC_CMD_MAC_STATS_CMD_PERIODIC_CHANGE_WIDTH 1
-/* Fields only relevent when PERIODIC_CHANGE is set */
+/* Remaining PERIOD* fields only relevent when PERIODIC_CHANGE is set */
 #define MC_CMD_MAC_STATS_CMD_PERIODIC_ENABLE_LBN 3
 #define MC_CMD_MAC_STATS_CMD_PERIODIC_ENABLE_WIDTH 1
 #define MC_CMD_MAC_STATS_CMD_PERIODIC_CLEAR_LBN 4
 #define MC_CMD_MAC_STATS_CMD_PERIODIC_CLEAR_WIDTH 1
+#define MC_CMD_MAC_STATS_CMD_PERIODIC_NOEVENT_LBN 5
+#define MC_CMD_MAC_STATS_CMD_PERIODIC_NOEVENT_WIDTH 1
 #define MC_CMD_MAC_STATS_CMD_PERIOD_MS_LBN 16
 #define MC_CMD_MAC_STATS_CMD_PERIOD_MS_WIDTH 16
 #define MC_CMD_MAC_STATS_IN_DMA_LEN_OFST 12
@@ -1729,6 +1735,39 @@
 #define MC_CMD_MRSFP_TWEAK_OUT_IOEXP_OUTPUTS_OFST 4    /* output bits */
 #define MC_CMD_MRSFP_TWEAK_OUT_IOEXP_DIRECTION_OFST 8  /* dirs: 0=out, 1=in */
 
+/* MC_CMD_TEST_HACK: (debug (unsurprisingly))
+  * Change bits of network port state for test purposes in ways that would never be
+  * useful in normal operation and so need a special command to change. */
+#define MC_CMD_TEST_HACK 0x2f
+#define MC_CMD_TEST_HACK_IN_LEN 8
+#define MC_CMD_TEST_HACK_IN_TXPAD_OFST 0
+#define   MC_CMD_TEST_HACK_IN_TXPAD_AUTO  0 /* Let the MC manage things */
+#define   MC_CMD_TEST_HACK_IN_TXPAD_ON    1 /* Force on */
+#define   MC_CMD_TEST_HACK_IN_TXPAD_OFF   2 /* Force on */
+#define MC_CMD_TEST_HACK_IN_IPG_OFST   4 /* Takes a value in bits */
+#define   MC_CMD_TEST_HACK_IN_IPG_AUTO    0 /* The MC picks the value */
+#define MC_CMD_TEST_HACK_OUT_LEN 0
+
+/* MC_CMD_SENSOR_SET_LIMS: (debug) (mostly) adjust the sensor limits. This
+ * is a warranty-voiding operation.
+  *
+ * IN: sensor identifier (one of the enumeration starting with MC_CMD_SENSOR_CONTROLLER_TEMP
+ * followed by 4 32-bit values: min(warning) max(warning), min(fatal), max(fatal). Which
+ * of these limits are meaningful and what their interpretation is is sensor-specific.
+ *
+ * OUT: nothing
+ *
+ * Returns: ENOENT if the sensor specified does not exist, EINVAL if the limits are
+  * out of range.
+ */
+#define MC_CMD_SENSOR_SET_LIMS 0x4e
+#define MC_CMD_SENSOR_SET_LIMS_IN_LEN 20
+#define MC_CMD_SENSOR_SET_LIMS_IN_SENSOR_OFST 0
+#define MC_CMD_SENSOR_SET_LIMS_IN_LOW0_OFST 4
+#define MC_CMD_SENSOR_SET_LIMS_IN_HI0_OFST  8
+#define MC_CMD_SENSOR_SET_LIMS_IN_LOW1_OFST 12
+#define MC_CMD_SENSOR_SET_LIMS_IN_HI1_OFST  16
+
 /* Do NOT add new commands beyond 0x4f as part of 3.0 : 0x50 - 0x7f will be
  * used for post-3.0 extensions. If you run out of space, look for gaps or
  * commands that are unused in the existing range. */
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 11/17] sfc: Set PERIODIC_NOEVENT flag for MC_CMD_MAC_STATS
From: Ben Hutchings @ 2010-04-28 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

From: Steve Hodgson <shodgson@solarflare.com>

When set, an event is not sent whenever periodic MAC statistics are
raised.  This avoids unnecessary wake-ups.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/mcdi_mac.c |   25 +++++++++----------------
 1 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/drivers/net/sfc/mcdi_mac.c b/drivers/net/sfc/mcdi_mac.c
index 06d24a1..3918263 100644
--- a/drivers/net/sfc/mcdi_mac.c
+++ b/drivers/net/sfc/mcdi_mac.c
@@ -80,7 +80,7 @@ int efx_mcdi_mac_stats(struct efx_nic *efx, dma_addr_t dma_addr,
 	u8 inbuf[MC_CMD_MAC_STATS_IN_LEN];
 	int rc;
 	efx_dword_t *cmd_ptr;
-	int period = 1000;
+	int period = enable ? 1000 : 0;
 	u32 addr_hi;
 	u32 addr_lo;
 
@@ -92,21 +92,14 @@ int efx_mcdi_mac_stats(struct efx_nic *efx, dma_addr_t dma_addr,
 	MCDI_SET_DWORD(inbuf, MAC_STATS_IN_DMA_ADDR_LO, addr_lo);
 	MCDI_SET_DWORD(inbuf, MAC_STATS_IN_DMA_ADDR_HI, addr_hi);
 	cmd_ptr = (efx_dword_t *)MCDI_PTR(inbuf, MAC_STATS_IN_CMD);
-	if (enable)
-		EFX_POPULATE_DWORD_6(*cmd_ptr,
-				     MC_CMD_MAC_STATS_CMD_DMA, 1,
-				     MC_CMD_MAC_STATS_CMD_CLEAR, clear,
-				     MC_CMD_MAC_STATS_CMD_PERIODIC_CHANGE, 1,
-				     MC_CMD_MAC_STATS_CMD_PERIODIC_ENABLE, 1,
-				     MC_CMD_MAC_STATS_CMD_PERIODIC_CLEAR, 0,
-				     MC_CMD_MAC_STATS_CMD_PERIOD_MS, period);
-	else
-		EFX_POPULATE_DWORD_5(*cmd_ptr,
-				     MC_CMD_MAC_STATS_CMD_DMA, 0,
-				     MC_CMD_MAC_STATS_CMD_CLEAR, clear,
-				     MC_CMD_MAC_STATS_CMD_PERIODIC_CHANGE, 1,
-				     MC_CMD_MAC_STATS_CMD_PERIODIC_ENABLE, 0,
-				     MC_CMD_MAC_STATS_CMD_PERIODIC_CLEAR, 0);
+	EFX_POPULATE_DWORD_7(*cmd_ptr,
+			     MC_CMD_MAC_STATS_CMD_DMA, !!enable,
+			     MC_CMD_MAC_STATS_CMD_CLEAR, clear,
+			     MC_CMD_MAC_STATS_CMD_PERIODIC_CHANGE, 1,
+			     MC_CMD_MAC_STATS_CMD_PERIODIC_ENABLE, !!enable,
+			     MC_CMD_MAC_STATS_CMD_PERIODIC_CLEAR, 0,
+			     MC_CMD_MAC_STATS_CMD_PERIODIC_NOEVENT, 1,
+			     MC_CMD_MAC_STATS_CMD_PERIOD_MS, period);
 	MCDI_SET_DWORD(inbuf, MAC_STATS_IN_DMA_LEN, dma_len);
 
 	rc = efx_mcdi_rpc(efx, MC_CMD_MAC_STATS, inbuf, sizeof(inbuf),
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 12/17] sfc: Break NAPI processing after one ring-full of TX completions
From: Ben Hutchings @ 2010-04-28 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

Currently TX completions do not count towards the NAPI budget.  This
means a continuous stream of TX completions can cause the polling
function to loop indefinitely with scheduling disabled.  To avoid
this, follow the common practice of reporting the budget spent after
processing one ring-full of TX completions.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/efx.c |   18 +++++++++---------
 drivers/net/sfc/nic.c |   39 ++++++++++++++++++++++++---------------
 2 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index 1ad61b7..5e3f944 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -225,17 +225,17 @@ static void efx_fini_channels(struct efx_nic *efx);
  * never be concurrently called more than once on the same channel,
  * though different channels may be being processed concurrently.
  */
-static int efx_process_channel(struct efx_channel *channel, int rx_quota)
+static int efx_process_channel(struct efx_channel *channel, int budget)
 {
 	struct efx_nic *efx = channel->efx;
-	int rx_packets;
+	int spent;
 
 	if (unlikely(efx->reset_pending != RESET_TYPE_NONE ||
 		     !channel->enabled))
 		return 0;
 
-	rx_packets = efx_nic_process_eventq(channel, rx_quota);
-	if (rx_packets == 0)
+	spent = efx_nic_process_eventq(channel, budget);
+	if (spent == 0)
 		return 0;
 
 	/* Deliver last RX packet. */
@@ -249,7 +249,7 @@ static int efx_process_channel(struct efx_channel *channel, int rx_quota)
 
 	efx_fast_push_rx_descriptors(&efx->rx_queue[channel->channel]);
 
-	return rx_packets;
+	return spent;
 }
 
 /* Mark channel as finished processing
@@ -278,14 +278,14 @@ static int efx_poll(struct napi_struct *napi, int budget)
 {
 	struct efx_channel *channel =
 		container_of(napi, struct efx_channel, napi_str);
-	int rx_packets;
+	int spent;
 
 	EFX_TRACE(channel->efx, "channel %d NAPI poll executing on CPU %d\n",
 		  channel->channel, raw_smp_processor_id());
 
-	rx_packets = efx_process_channel(channel, budget);
+	spent = efx_process_channel(channel, budget);
 
-	if (rx_packets < budget) {
+	if (spent < budget) {
 		struct efx_nic *efx = channel->efx;
 
 		if (channel->used_flags & EFX_USED_BY_RX &&
@@ -318,7 +318,7 @@ static int efx_poll(struct napi_struct *napi, int budget)
 		efx_channel_processed(channel);
 	}
 
-	return rx_packets;
+	return spent;
 }
 
 /* Process the eventq of the specified channel immediately on this CPU
diff --git a/drivers/net/sfc/nic.c b/drivers/net/sfc/nic.c
index 4105f90..f3226bb 100644
--- a/drivers/net/sfc/nic.c
+++ b/drivers/net/sfc/nic.c
@@ -654,22 +654,23 @@ void efx_generate_event(struct efx_channel *channel, efx_qword_t *event)
  * The NIC batches TX completion events; the message we receive is of
  * the form "complete all TX events up to this index".
  */
-static void
+static int
 efx_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
 {
 	unsigned int tx_ev_desc_ptr;
 	unsigned int tx_ev_q_label;
 	struct efx_tx_queue *tx_queue;
 	struct efx_nic *efx = channel->efx;
+	int tx_packets = 0;
 
 	if (likely(EFX_QWORD_FIELD(*event, FSF_AZ_TX_EV_COMP))) {
 		/* Transmit completion */
 		tx_ev_desc_ptr = EFX_QWORD_FIELD(*event, FSF_AZ_TX_EV_DESC_PTR);
 		tx_ev_q_label = EFX_QWORD_FIELD(*event, FSF_AZ_TX_EV_Q_LABEL);
 		tx_queue = &efx->tx_queue[tx_ev_q_label];
-		channel->irq_mod_score +=
-			(tx_ev_desc_ptr - tx_queue->read_count) &
-			EFX_TXQ_MASK;
+		tx_packets = ((tx_ev_desc_ptr - tx_queue->read_count) &
+			      EFX_TXQ_MASK);
+		channel->irq_mod_score += tx_packets;
 		efx_xmit_done(tx_queue, tx_ev_desc_ptr);
 	} else if (EFX_QWORD_FIELD(*event, FSF_AZ_TX_EV_WQ_FF_FULL)) {
 		/* Rewrite the FIFO write pointer */
@@ -689,6 +690,8 @@ efx_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
 			EFX_QWORD_FMT"\n", channel->channel,
 			EFX_QWORD_VAL(*event));
 	}
+
+	return tx_packets;
 }
 
 /* Detect errors included in the rx_evt_pkt_ok bit. */
@@ -947,16 +950,17 @@ efx_handle_driver_event(struct efx_channel *channel, efx_qword_t *event)
 	}
 }
 
-int efx_nic_process_eventq(struct efx_channel *channel, int rx_quota)
+int efx_nic_process_eventq(struct efx_channel *channel, int budget)
 {
 	unsigned int read_ptr;
 	efx_qword_t event, *p_event;
 	int ev_code;
-	int rx_packets = 0;
+	int tx_packets = 0;
+	int spent = 0;
 
 	read_ptr = channel->eventq_read_ptr;
 
-	do {
+	for (;;) {
 		p_event = efx_event(channel, read_ptr);
 		event = *p_event;
 
@@ -970,15 +974,23 @@ int efx_nic_process_eventq(struct efx_channel *channel, int rx_quota)
 		/* Clear this event by marking it all ones */
 		EFX_SET_QWORD(*p_event);
 
+		/* Increment read pointer */
+		read_ptr = (read_ptr + 1) & EFX_EVQ_MASK;
+
 		ev_code = EFX_QWORD_FIELD(event, FSF_AZ_EV_CODE);
 
 		switch (ev_code) {
 		case FSE_AZ_EV_CODE_RX_EV:
 			efx_handle_rx_event(channel, &event);
-			++rx_packets;
+			if (++spent == budget)
+				goto out;
 			break;
 		case FSE_AZ_EV_CODE_TX_EV:
-			efx_handle_tx_event(channel, &event);
+			tx_packets += efx_handle_tx_event(channel, &event);
+			if (tx_packets >= EFX_TXQ_SIZE) {
+				spent = budget;
+				goto out;
+			}
 			break;
 		case FSE_AZ_EV_CODE_DRV_GEN_EV:
 			channel->eventq_magic = EFX_QWORD_FIELD(
@@ -1001,14 +1013,11 @@ int efx_nic_process_eventq(struct efx_channel *channel, int rx_quota)
 				" (data " EFX_QWORD_FMT ")\n", channel->channel,
 				ev_code, EFX_QWORD_VAL(event));
 		}
+	}
 
-		/* Increment read pointer */
-		read_ptr = (read_ptr + 1) & EFX_EVQ_MASK;
-
-	} while (rx_packets < rx_quota);
-
+out:
 	channel->eventq_read_ptr = read_ptr;
-	return rx_packets;
+	return spent;
 }
 

-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 13/17] sfc: Add necessary parentheses to macro definitions in net_driver.h
From: Ben Hutchings @ 2010-04-28 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/net_driver.h |   22 +++++++++++-----------
 1 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index 70aea3a..bdea03c 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -926,8 +926,8 @@ struct efx_nic_type {
 
 /* Iterate over all used channels */
 #define efx_for_each_channel(_channel, _efx)				\
-	for (_channel = &_efx->channel[0];				\
-	     _channel < &_efx->channel[EFX_MAX_CHANNELS];		\
+	for (_channel = &((_efx)->channel[0]);				\
+	     _channel < &((_efx)->channel[EFX_MAX_CHANNELS]);		\
 	     _channel++)						\
 		if (!_channel->used_flags)				\
 			continue;					\
@@ -935,31 +935,31 @@ struct efx_nic_type {
 
 /* Iterate over all used TX queues */
 #define efx_for_each_tx_queue(_tx_queue, _efx)				\
-	for (_tx_queue = &_efx->tx_queue[0];				\
-	     _tx_queue < &_efx->tx_queue[EFX_TX_QUEUE_COUNT];		\
+	for (_tx_queue = &((_efx)->tx_queue[0]);			\
+	     _tx_queue < &((_efx)->tx_queue[EFX_TX_QUEUE_COUNT]);	\
 	     _tx_queue++)
 
 /* Iterate over all TX queues belonging to a channel */
 #define efx_for_each_channel_tx_queue(_tx_queue, _channel)		\
-	for (_tx_queue = &_channel->efx->tx_queue[0];			\
-	     _tx_queue < &_channel->efx->tx_queue[EFX_TX_QUEUE_COUNT];	\
+	for (_tx_queue = &((_channel)->efx->tx_queue[0]);		\
+	     _tx_queue < &((_channel)->efx->tx_queue[EFX_TX_QUEUE_COUNT]); \
 	     _tx_queue++)						\
-		if (_tx_queue->channel != _channel)			\
+		if (_tx_queue->channel != (_channel))			\
 			continue;					\
 		else
 
 /* Iterate over all used RX queues */
 #define efx_for_each_rx_queue(_rx_queue, _efx)				\
-	for (_rx_queue = &_efx->rx_queue[0];				\
-	     _rx_queue < &_efx->rx_queue[_efx->n_rx_queues];		\
+	for (_rx_queue = &((_efx)->rx_queue[0]);			\
+	     _rx_queue < &((_efx)->rx_queue[(_efx)->n_rx_queues]);	\
 	     _rx_queue++)
 
 /* Iterate over all RX queues belonging to a channel */
 #define efx_for_each_channel_rx_queue(_rx_queue, _channel)		\
-	for (_rx_queue = &_channel->efx->rx_queue[_channel->channel];	\
+	for (_rx_queue = &((_channel)->efx->rx_queue[(_channel)->channel]); \
 	     _rx_queue;							\
 	     _rx_queue = NULL)						\
-		if (_rx_queue->channel != _channel)			\
+		if (_rx_queue->channel != (_channel))			\
 			continue;					\
 		else
 
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 14/17] sfc: Clean up efx_nic::irq_zero_count
From: Ben Hutchings @ 2010-04-28 19:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

There is no need for this to be unsigned long; make it unsigned int.
It does need a line in kernel-doc, so add that.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/net_driver.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index bdea03c..d68331c 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -672,6 +672,7 @@ union efx_multicast_hash {
  *	This register is written with the SMP processor ID whenever an
  *	interrupt is handled.  It is used by efx_nic_test_interrupt()
  *	to verify that an interrupt has occurred.
+ * @irq_zero_count: Number of legacy IRQs seen with queue flags == 0
  * @fatal_irq_level: IRQ level (bit number) used for serious errors
  * @spi_flash: SPI flash device
  *	This field will be %NULL if no flash device is present (or for Siena).
@@ -756,7 +757,7 @@ struct efx_nic {
 
 	struct efx_buffer irq_status;
 	volatile signed int last_irq_cpu;
-	unsigned long irq_zero_count;
+	unsigned irq_zero_count;
 	unsigned fatal_irq_level;
 
 	struct efx_spi_device *spi_flash;
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 15/17] sfc: Add Siena PHY BIST and cable diagnostic support
From: Ben Hutchings @ 2010-04-28 19:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

From: Steve Hodgson <shodgson@solarflare.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/mcdi_phy.c |  146 +++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/drivers/net/sfc/mcdi_phy.c b/drivers/net/sfc/mcdi_phy.c
index 5d34487..6032c0e 100644
--- a/drivers/net/sfc/mcdi_phy.c
+++ b/drivers/net/sfc/mcdi_phy.c
@@ -17,6 +17,8 @@
 #include "mcdi.h"
 #include "mcdi_pcol.h"
 #include "mdio_10g.h"
+#include "nic.h"
+#include "selftest.h"
 
 struct efx_mcdi_phy_cfg {
 	u32 flags;
@@ -594,6 +596,146 @@ static int efx_mcdi_phy_test_alive(struct efx_nic *efx)
 	return 0;
 }
 
+static const char *const mcdi_sft9001_cable_diag_names[] = {
+	"cable.pairA.length",
+	"cable.pairB.length",
+	"cable.pairC.length",
+	"cable.pairD.length",
+	"cable.pairA.status",
+	"cable.pairB.status",
+	"cable.pairC.status",
+	"cable.pairD.status",
+};
+
+static int efx_mcdi_bist(struct efx_nic *efx, unsigned int bist_mode,
+			 int *results)
+{
+	unsigned int retry, i, count = 0;
+	size_t outlen;
+	u32 status;
+	u8 *buf, *ptr;
+	int rc;
+
+	buf = kzalloc(0x100, GFP_KERNEL);
+	if (buf == NULL)
+		return -ENOMEM;
+
+	BUILD_BUG_ON(MC_CMD_START_BIST_OUT_LEN != 0);
+	MCDI_SET_DWORD(buf, START_BIST_IN_TYPE, bist_mode);
+	rc = efx_mcdi_rpc(efx, MC_CMD_START_BIST, buf, MC_CMD_START_BIST_IN_LEN,
+			  NULL, 0, NULL);
+	if (rc)
+		goto out;
+
+	/* Wait up to 10s for BIST to finish */
+	for (retry = 0; retry < 100; ++retry) {
+		BUILD_BUG_ON(MC_CMD_POLL_BIST_IN_LEN != 0);
+		rc = efx_mcdi_rpc(efx, MC_CMD_POLL_BIST, NULL, 0,
+				  buf, 0x100, &outlen);
+		if (rc)
+			goto out;
+
+		status = MCDI_DWORD(buf, POLL_BIST_OUT_RESULT);
+		if (status != MC_CMD_POLL_BIST_RUNNING)
+			goto finished;
+
+		msleep(100);
+	}
+
+	rc = -ETIMEDOUT;
+	goto out;
+
+finished:
+	results[count++] = (status == MC_CMD_POLL_BIST_PASSED) ? 1 : -1;
+
+	/* SFT9001 specific cable diagnostics output */
+	if (efx->phy_type == PHY_TYPE_SFT9001B &&
+	    (bist_mode == MC_CMD_PHY_BIST_CABLE_SHORT ||
+	     bist_mode == MC_CMD_PHY_BIST_CABLE_LONG)) {
+		ptr = MCDI_PTR(buf, POLL_BIST_OUT_SFT9001_CABLE_LENGTH_A);
+		if (status == MC_CMD_POLL_BIST_PASSED &&
+		    outlen >= MC_CMD_POLL_BIST_OUT_SFT9001_LEN) {
+			for (i = 0; i < 8; i++) {
+				results[count + i] =
+					EFX_DWORD_FIELD(((efx_dword_t *)ptr)[i],
+							EFX_DWORD_0);
+			}
+		}
+		count += 8;
+	}
+	rc = count;
+
+out:
+	kfree(buf);
+
+	return rc;
+}
+
+static int efx_mcdi_phy_run_tests(struct efx_nic *efx, int *results,
+				  unsigned flags)
+{
+	struct efx_mcdi_phy_cfg *phy_cfg = efx->phy_data;
+	u32 mode;
+	int rc;
+
+	if (phy_cfg->flags & (1 << MC_CMD_GET_PHY_CFG_BIST_LBN)) {
+		rc = efx_mcdi_bist(efx, MC_CMD_PHY_BIST, results);
+		if (rc < 0)
+			return rc;
+
+		results += rc;
+	}
+
+	/* If we support both LONG and SHORT, then run each in response to
+	 * break or not. Otherwise, run the one we support */
+	mode = 0;
+	if (phy_cfg->flags & (1 << MC_CMD_GET_PHY_CFG_BIST_CABLE_SHORT_LBN)) {
+		if ((flags & ETH_TEST_FL_OFFLINE) &&
+		    (phy_cfg->flags &
+		     (1 << MC_CMD_GET_PHY_CFG_BIST_CABLE_LONG_LBN)))
+			mode = MC_CMD_PHY_BIST_CABLE_LONG;
+		else
+			mode = MC_CMD_PHY_BIST_CABLE_SHORT;
+	} else if (phy_cfg->flags &
+		   (1 << MC_CMD_GET_PHY_CFG_BIST_CABLE_LONG_LBN))
+		mode = MC_CMD_PHY_BIST_CABLE_LONG;
+
+	if (mode != 0) {
+		rc = efx_mcdi_bist(efx, mode, results);
+		if (rc < 0)
+			return rc;
+		results += rc;
+	}
+
+	return 0;
+}
+
+const char *efx_mcdi_phy_test_name(struct efx_nic *efx, unsigned int index)
+{
+	struct efx_mcdi_phy_cfg *phy_cfg = efx->phy_data;
+
+	if (phy_cfg->flags & (1 << MC_CMD_GET_PHY_CFG_BIST_LBN)) {
+		if (index == 0)
+			return "bist";
+		--index;
+	}
+
+	if (phy_cfg->flags & ((1 << MC_CMD_GET_PHY_CFG_BIST_CABLE_SHORT_LBN) |
+			      (1 << MC_CMD_GET_PHY_CFG_BIST_CABLE_LONG_LBN))) {
+		if (index == 0)
+			return "cable";
+		--index;
+
+		if (efx->phy_type == PHY_TYPE_SFT9001B) {
+			if (index < ARRAY_SIZE(mcdi_sft9001_cable_diag_names))
+				return mcdi_sft9001_cable_diag_names[index];
+			index -= ARRAY_SIZE(mcdi_sft9001_cable_diag_names);
+		}
+	}
+
+	return NULL;
+}
+
 struct efx_phy_operations efx_mcdi_phy_ops = {
 	.probe		= efx_mcdi_phy_probe,
 	.init 	 	= efx_port_dummy_op_int,
@@ -604,6 +746,6 @@ struct efx_phy_operations efx_mcdi_phy_ops = {
 	.get_settings	= efx_mcdi_phy_get_settings,
 	.set_settings	= efx_mcdi_phy_set_settings,
 	.test_alive	= efx_mcdi_phy_test_alive,
-	.run_tests	= NULL,
-	.test_name	= NULL,
+	.run_tests	= efx_mcdi_phy_run_tests,
+	.test_name	= efx_mcdi_phy_test_name,
 };
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 16/17] sfc: Test only the first pair of TX queues
From: Ben Hutchings @ 2010-04-28 19:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

This makes no immediate difference, but we definitely do not want
to test all TX queues once we allocate a pair of TX queues to each
channel.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/ethtool.c  |    2 +-
 drivers/net/sfc/selftest.c |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/sfc/ethtool.c b/drivers/net/sfc/ethtool.c
index d9f9c02..cbe9319 100644
--- a/drivers/net/sfc/ethtool.c
+++ b/drivers/net/sfc/ethtool.c
@@ -304,7 +304,7 @@ static int efx_fill_loopback_test(struct efx_nic *efx,
 {
 	struct efx_tx_queue *tx_queue;
 
-	efx_for_each_tx_queue(tx_queue, efx) {
+	efx_for_each_channel_tx_queue(tx_queue, &efx->channel[0]) {
 		efx_fill_test(test_index++, strings, data,
 			      &lb_tests->tx_sent[tx_queue->queue],
 			      EFX_TX_QUEUE_NAME(tx_queue),
diff --git a/drivers/net/sfc/selftest.c b/drivers/net/sfc/selftest.c
index 0106b1d..3a16e06 100644
--- a/drivers/net/sfc/selftest.c
+++ b/drivers/net/sfc/selftest.c
@@ -616,8 +616,8 @@ static int efx_test_loopbacks(struct efx_nic *efx, struct efx_self_tests *tests,
 			goto out;
 		}
 
-		/* Test every TX queue */
-		efx_for_each_tx_queue(tx_queue, efx) {
+		/* Test both types of TX queue */
+		efx_for_each_channel_tx_queue(tx_queue, &efx->channel[0]) {
 			state->offload_csum = (tx_queue->queue ==
 					       EFX_TX_QUEUE_OFFLOAD_CSUM);
 			rc = efx_test_loopback(tx_queue,
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 17/17] sfc: Create multiple TX queues
From: Ben Hutchings @ 2010-04-28 19:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1272482722.2549.17.camel@achroite.uk.solarflarecom.com>

Create a core TX queue and 2 hardware TX queues for each channel.
If separate_tx_channels is set, create equal numbers of RX and TX
channels instead.

Rewrite the channel and queue iteration macros accordingly.
Eliminate efx_channel::used_flags as redundant.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/efx.c        |  113 ++++++++++++++++++++++-------------------
 drivers/net/sfc/efx.h        |    4 +-
 drivers/net/sfc/ethtool.c    |    4 +-
 drivers/net/sfc/net_driver.h |   61 ++++++++++------------
 drivers/net/sfc/nic.c        |   12 ++--
 drivers/net/sfc/selftest.c   |    4 +-
 drivers/net/sfc/selftest.h   |    4 +-
 drivers/net/sfc/tx.c         |   61 ++++++++++++++---------
 8 files changed, 140 insertions(+), 123 deletions(-)

diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index 5e3f944..bc75ef6 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -288,7 +288,7 @@ static int efx_poll(struct napi_struct *napi, int budget)
 	if (spent < budget) {
 		struct efx_nic *efx = channel->efx;
 
-		if (channel->used_flags & EFX_USED_BY_RX &&
+		if (channel->channel < efx->n_rx_channels &&
 		    efx->irq_rx_adaptive &&
 		    unlikely(++channel->irq_count == 1000)) {
 			if (unlikely(channel->irq_mod_score <
@@ -333,7 +333,6 @@ void efx_process_channel_now(struct efx_channel *channel)
 {
 	struct efx_nic *efx = channel->efx;
 
-	BUG_ON(!channel->used_flags);
 	BUG_ON(!channel->enabled);
 
 	/* Disable interrupts and wait for ISRs to complete */
@@ -446,12 +445,12 @@ static void efx_set_channel_names(struct efx_nic *efx)
 
 	efx_for_each_channel(channel, efx) {
 		number = channel->channel;
-		if (efx->n_channels > efx->n_rx_queues) {
-			if (channel->channel < efx->n_rx_queues) {
+		if (efx->n_channels > efx->n_rx_channels) {
+			if (channel->channel < efx->n_rx_channels) {
 				type = "-rx";
 			} else {
 				type = "-tx";
-				number -= efx->n_rx_queues;
+				number -= efx->n_rx_channels;
 			}
 		}
 		snprintf(channel->name, sizeof(channel->name),
@@ -585,8 +584,6 @@ static void efx_remove_channel(struct efx_channel *channel)
 	efx_for_each_channel_tx_queue(tx_queue, channel)
 		efx_remove_tx_queue(tx_queue);
 	efx_remove_eventq(channel);
-
-	channel->used_flags = 0;
 }
 
 void efx_schedule_slow_fill(struct efx_rx_queue *rx_queue, int delay)
@@ -956,10 +953,9 @@ static void efx_fini_io(struct efx_nic *efx)
 	pci_disable_device(efx->pci_dev);
 }
 
-/* Get number of RX queues wanted.  Return number of online CPU
- * packages in the expectation that an IRQ balancer will spread
- * interrupts across them. */
-static int efx_wanted_rx_queues(void)
+/* Get number of channels wanted.  Each channel will have its own IRQ,
+ * 1 RX queue and/or 2 TX queues. */
+static int efx_wanted_channels(void)
 {
 	cpumask_var_t core_mask;
 	int count;
@@ -995,34 +991,39 @@ static void efx_probe_interrupts(struct efx_nic *efx)
 
 	if (efx->interrupt_mode == EFX_INT_MODE_MSIX) {
 		struct msix_entry xentries[EFX_MAX_CHANNELS];
-		int wanted_ints;
-		int rx_queues;
+		int n_channels;
 
-		/* We want one RX queue and interrupt per CPU package
-		 * (or as specified by the rss_cpus module parameter).
-		 * We will need one channel per interrupt.
-		 */
-		rx_queues = rss_cpus ? rss_cpus : efx_wanted_rx_queues();
-		wanted_ints = rx_queues + (separate_tx_channels ? 1 : 0);
-		wanted_ints = min(wanted_ints, max_channels);
+		n_channels = efx_wanted_channels();
+		if (separate_tx_channels)
+			n_channels *= 2;
+		n_channels = min(n_channels, max_channels);
 
-		for (i = 0; i < wanted_ints; i++)
+		for (i = 0; i < n_channels; i++)
 			xentries[i].entry = i;
-		rc = pci_enable_msix(efx->pci_dev, xentries, wanted_ints);
+		rc = pci_enable_msix(efx->pci_dev, xentries, n_channels);
 		if (rc > 0) {
 			EFX_ERR(efx, "WARNING: Insufficient MSI-X vectors"
-				" available (%d < %d).\n", rc, wanted_ints);
+				" available (%d < %d).\n", rc, n_channels);
 			EFX_ERR(efx, "WARNING: Performance may be reduced.\n");
-			EFX_BUG_ON_PARANOID(rc >= wanted_ints);
-			wanted_ints = rc;
+			EFX_BUG_ON_PARANOID(rc >= n_channels);
+			n_channels = rc;
 			rc = pci_enable_msix(efx->pci_dev, xentries,
-					     wanted_ints);
+					     n_channels);
 		}
 
 		if (rc == 0) {
-			efx->n_rx_queues = min(rx_queues, wanted_ints);
-			efx->n_channels = wanted_ints;
-			for (i = 0; i < wanted_ints; i++)
+			efx->n_channels = n_channels;
+			if (separate_tx_channels) {
+				efx->n_tx_channels =
+					max(efx->n_channels / 2, 1U);
+				efx->n_rx_channels =
+					max(efx->n_channels -
+					    efx->n_tx_channels, 1U);
+			} else {
+				efx->n_tx_channels = efx->n_channels;
+				efx->n_rx_channels = efx->n_channels;
+			}
+			for (i = 0; i < n_channels; i++)
 				efx->channel[i].irq = xentries[i].vector;
 		} else {
 			/* Fall back to single channel MSI */
@@ -1033,8 +1034,9 @@ static void efx_probe_interrupts(struct efx_nic *efx)
 
 	/* Try single interrupt MSI */
 	if (efx->interrupt_mode == EFX_INT_MODE_MSI) {
-		efx->n_rx_queues = 1;
 		efx->n_channels = 1;
+		efx->n_rx_channels = 1;
+		efx->n_tx_channels = 1;
 		rc = pci_enable_msi(efx->pci_dev);
 		if (rc == 0) {
 			efx->channel[0].irq = efx->pci_dev->irq;
@@ -1046,8 +1048,9 @@ static void efx_probe_interrupts(struct efx_nic *efx)
 
 	/* Assume legacy interrupts */
 	if (efx->interrupt_mode == EFX_INT_MODE_LEGACY) {
-		efx->n_rx_queues = 1;
 		efx->n_channels = 1 + (separate_tx_channels ? 1 : 0);
+		efx->n_rx_channels = 1;
+		efx->n_tx_channels = 1;
 		efx->legacy_irq = efx->pci_dev->irq;
 	}
 }
@@ -1068,21 +1071,24 @@ static void efx_remove_interrupts(struct efx_nic *efx)
 
 static void efx_set_channels(struct efx_nic *efx)
 {
+	struct efx_channel *channel;
 	struct efx_tx_queue *tx_queue;
 	struct efx_rx_queue *rx_queue;
+	unsigned tx_channel_offset =
+		separate_tx_channels ? efx->n_channels - efx->n_tx_channels : 0;
 
-	efx_for_each_tx_queue(tx_queue, efx) {
-		if (separate_tx_channels)
-			tx_queue->channel = &efx->channel[efx->n_channels-1];
-		else
-			tx_queue->channel = &efx->channel[0];
-		tx_queue->channel->used_flags |= EFX_USED_BY_TX;
+	efx_for_each_channel(channel, efx) {
+		if (channel->channel - tx_channel_offset < efx->n_tx_channels) {
+			channel->tx_queue = &efx->tx_queue[
+				(channel->channel - tx_channel_offset) *
+				EFX_TXQ_TYPES];
+			efx_for_each_channel_tx_queue(tx_queue, channel)
+				tx_queue->channel = channel;
+		}
 	}
 
-	efx_for_each_rx_queue(rx_queue, efx) {
+	efx_for_each_rx_queue(rx_queue, efx)
 		rx_queue->channel = &efx->channel[rx_queue->queue];
-		rx_queue->channel->used_flags |= EFX_USED_BY_RX;
-	}
 }
 
 static int efx_probe_nic(struct efx_nic *efx)
@@ -1096,11 +1102,12 @@ static int efx_probe_nic(struct efx_nic *efx)
 	if (rc)
 		return rc;
 
-	/* Determine the number of channels and RX queues by trying to hook
+	/* Determine the number of channels and queues by trying to hook
 	 * in MSI-X interrupts. */
 	efx_probe_interrupts(efx);
 
 	efx_set_channels(efx);
+	efx->net_dev->real_num_tx_queues = efx->n_tx_channels;
 
 	/* Initialise the interrupt moderation settings */
 	efx_init_irq_moderation(efx, tx_irq_mod_usec, rx_irq_mod_usec, true);
@@ -1187,11 +1194,12 @@ static void efx_start_all(struct efx_nic *efx)
 	/* Mark the port as enabled so port reconfigurations can start, then
 	 * restart the transmit interface early so the watchdog timer stops */
 	efx_start_port(efx);
-	if (efx_dev_registered(efx))
-		efx_wake_queue(efx);
 
-	efx_for_each_channel(channel, efx)
+	efx_for_each_channel(channel, efx) {
+		if (efx_dev_registered(efx))
+			efx_wake_queue(channel);
 		efx_start_channel(channel);
+	}
 
 	efx_nic_enable_interrupts(efx);
 
@@ -1282,7 +1290,9 @@ static void efx_stop_all(struct efx_nic *efx)
 	/* Stop the kernel transmit interface late, so the watchdog
 	 * timer isn't ticking over the flush */
 	if (efx_dev_registered(efx)) {
-		efx_stop_queue(efx);
+		struct efx_channel *channel;
+		efx_for_each_channel(channel, efx)
+			efx_stop_queue(channel);
 		netif_tx_lock_bh(efx->net_dev);
 		netif_tx_unlock_bh(efx->net_dev);
 	}
@@ -1537,9 +1547,8 @@ static void efx_watchdog(struct net_device *net_dev)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 
-	EFX_ERR(efx, "TX stuck with stop_count=%d port_enabled=%d:"
-		" resetting channels\n",
-		atomic_read(&efx->netif_stop_count), efx->port_enabled);
+	EFX_ERR(efx, "TX stuck with port_enabled=%d: resetting channels\n",
+		efx->port_enabled);
 
 	efx_schedule_reset(efx, RESET_TYPE_TX_WATCHDOG);
 }
@@ -2014,22 +2023,22 @@ static int efx_init_struct(struct efx_nic *efx, struct efx_nic_type *type,
 
 	efx->net_dev = net_dev;
 	efx->rx_checksum_enabled = true;
-	spin_lock_init(&efx->netif_stop_lock);
 	spin_lock_init(&efx->stats_lock);
 	mutex_init(&efx->mac_lock);
 	efx->mac_op = type->default_mac_ops;
 	efx->phy_op = &efx_dummy_phy_operations;
 	efx->mdio.dev = net_dev;
 	INIT_WORK(&efx->mac_work, efx_mac_work);
-	atomic_set(&efx->netif_stop_count, 1);
 
 	for (i = 0; i < EFX_MAX_CHANNELS; i++) {
 		channel = &efx->channel[i];
 		channel->efx = efx;
 		channel->channel = i;
 		channel->work_pending = false;
+		spin_lock_init(&channel->tx_stop_lock);
+		atomic_set(&channel->tx_stop_count, 1);
 	}
-	for (i = 0; i < EFX_TX_QUEUE_COUNT; i++) {
+	for (i = 0; i < EFX_MAX_TX_QUEUES; i++) {
 		tx_queue = &efx->tx_queue[i];
 		tx_queue->efx = efx;
 		tx_queue->queue = i;
@@ -2201,7 +2210,7 @@ static int __devinit efx_pci_probe(struct pci_dev *pci_dev,
 	int i, rc;
 
 	/* Allocate and initialise a struct net_device and struct efx_nic */
-	net_dev = alloc_etherdev(sizeof(*efx));
+	net_dev = alloc_etherdev_mq(sizeof(*efx), EFX_MAX_CORE_TX_QUEUES);
 	if (!net_dev)
 		return -ENOMEM;
 	net_dev->features |= (type->offload_features | NETIF_F_SG |
diff --git a/drivers/net/sfc/efx.h b/drivers/net/sfc/efx.h
index 7eff0a6..ffd708c 100644
--- a/drivers/net/sfc/efx.h
+++ b/drivers/net/sfc/efx.h
@@ -35,8 +35,8 @@ efx_hard_start_xmit(struct sk_buff *skb, struct net_device *net_dev);
 extern netdev_tx_t
 efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb);
 extern void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
-extern void efx_stop_queue(struct efx_nic *efx);
-extern void efx_wake_queue(struct efx_nic *efx);
+extern void efx_stop_queue(struct efx_channel *channel);
+extern void efx_wake_queue(struct efx_channel *channel);
 #define EFX_TXQ_SIZE 1024
 #define EFX_TXQ_MASK (EFX_TXQ_SIZE - 1)
 
diff --git a/drivers/net/sfc/ethtool.c b/drivers/net/sfc/ethtool.c
index cbe9319..22026bf 100644
--- a/drivers/net/sfc/ethtool.c
+++ b/drivers/net/sfc/ethtool.c
@@ -647,7 +647,7 @@ static int efx_ethtool_get_coalesce(struct net_device *net_dev,
 	efx_for_each_tx_queue(tx_queue, efx) {
 		channel = tx_queue->channel;
 		if (channel->irq_moderation < coalesce->tx_coalesce_usecs_irq) {
-			if (channel->used_flags != EFX_USED_BY_RX_TX)
+			if (channel->channel < efx->n_rx_channels)
 				coalesce->tx_coalesce_usecs_irq =
 					channel->irq_moderation;
 			else
@@ -690,7 +690,7 @@ static int efx_ethtool_set_coalesce(struct net_device *net_dev,
 
 	/* If the channel is shared only allow RX parameters to be set */
 	efx_for_each_tx_queue(tx_queue, efx) {
-		if ((tx_queue->channel->used_flags == EFX_USED_BY_RX_TX) &&
+		if ((tx_queue->channel->channel < efx->n_rx_channels) &&
 		    tx_usecs) {
 			EFX_ERR(efx, "Channel is shared. "
 				"Only RX coalescing may be set\n");
diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index d68331c..2e6fd89 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -85,9 +85,13 @@ do {if (net_ratelimit()) EFX_LOG(efx, fmt, ##args); } while (0)
 #define EFX_MAX_CHANNELS 32
 #define EFX_MAX_RX_QUEUES EFX_MAX_CHANNELS
 
-#define EFX_TX_QUEUE_OFFLOAD_CSUM	0
-#define EFX_TX_QUEUE_NO_CSUM		1
-#define EFX_TX_QUEUE_COUNT		2
+/* Checksum generation is a per-queue option in hardware, so each
+ * queue visible to the networking core is backed by two hardware TX
+ * queues. */
+#define EFX_MAX_CORE_TX_QUEUES	EFX_MAX_CHANNELS
+#define EFX_TXQ_TYPE_OFFLOAD	1
+#define EFX_TXQ_TYPES		2
+#define EFX_MAX_TX_QUEUES	(EFX_TXQ_TYPES * EFX_MAX_CORE_TX_QUEUES)
 
 /**
  * struct efx_special_buffer - An Efx special buffer
@@ -187,7 +191,7 @@ struct efx_tx_buffer {
 struct efx_tx_queue {
 	/* Members which don't change on the fast path */
 	struct efx_nic *efx ____cacheline_aligned_in_smp;
-	int queue;
+	unsigned queue;
 	struct efx_channel *channel;
 	struct efx_nic *nic;
 	struct efx_tx_buffer *buffer;
@@ -306,11 +310,6 @@ struct efx_buffer {
 };
 

-/* Flags for channel->used_flags */
-#define EFX_USED_BY_RX 1
-#define EFX_USED_BY_TX 2
-#define EFX_USED_BY_RX_TX (EFX_USED_BY_RX | EFX_USED_BY_TX)
-
 enum efx_rx_alloc_method {
 	RX_ALLOC_METHOD_AUTO = 0,
 	RX_ALLOC_METHOD_SKB = 1,
@@ -327,7 +326,6 @@ enum efx_rx_alloc_method {
  * @efx: Associated Efx NIC
  * @channel: Channel instance number
  * @name: Name for channel and IRQ
- * @used_flags: Channel is used by net driver
  * @enabled: Channel enabled indicator
  * @irq: IRQ number (MSI and MSI-X only)
  * @irq_moderation: IRQ moderation value (in hardware ticks)
@@ -352,12 +350,14 @@ enum efx_rx_alloc_method {
  * @n_rx_frm_trunc: Count of RX_FRM_TRUNC errors
  * @n_rx_overlength: Count of RX_OVERLENGTH errors
  * @n_skbuff_leaks: Count of skbuffs leaked due to RX overrun
+ * @tx_queue: Pointer to first TX queue, or %NULL if not used for TX
+ * @tx_stop_count: Core TX queue stop count
+ * @tx_stop_lock: Core TX queue stop lock
  */
 struct efx_channel {
 	struct efx_nic *efx;
 	int channel;
 	char name[IFNAMSIZ + 6];
-	int used_flags;
 	bool enabled;
 	int irq;
 	unsigned int irq_moderation;
@@ -389,6 +389,9 @@ struct efx_channel {
 	struct efx_rx_buffer *rx_pkt;
 	bool rx_pkt_csummed;
 
+	struct efx_tx_queue *tx_queue;
+	atomic_t tx_stop_count;
+	spinlock_t tx_stop_lock;
 };
 
 enum efx_led_mode {
@@ -661,8 +664,9 @@ union efx_multicast_hash {
  * @rx_queue: RX DMA queues
  * @channel: Channels
  * @next_buffer_table: First available buffer table id
- * @n_rx_queues: Number of RX queues
  * @n_channels: Number of channels in use
+ * @n_rx_channels: Number of channels used for RX (= number of RX queues)
+ * @n_tx_channels: Number of channels used for TX
  * @rx_buffer_len: RX buffer length
  * @rx_buffer_order: Order (log2) of number of pages for each RX buffer
  * @int_error_count: Number of internal errors seen recently
@@ -693,8 +697,6 @@ union efx_multicast_hash {
  * @port_initialized: Port initialized?
  * @net_dev: Operating system network device. Consider holding the rtnl lock
  * @rx_checksum_enabled: RX checksumming enabled
- * @netif_stop_count: Port stop count
- * @netif_stop_lock: Port stop lock
  * @mac_stats: MAC statistics. These include all statistics the MACs
  *	can provide.  Generic code converts these into a standard
  *	&struct net_device_stats.
@@ -742,13 +744,14 @@ struct efx_nic {
 	enum nic_state state;
 	enum reset_type reset_pending;
 
-	struct efx_tx_queue tx_queue[EFX_TX_QUEUE_COUNT];
+	struct efx_tx_queue tx_queue[EFX_MAX_TX_QUEUES];
 	struct efx_rx_queue rx_queue[EFX_MAX_RX_QUEUES];
 	struct efx_channel channel[EFX_MAX_CHANNELS];
 
 	unsigned next_buffer_table;
-	int n_rx_queues;
-	int n_channels;
+	unsigned n_channels;
+	unsigned n_rx_channels;
+	unsigned n_tx_channels;
 	unsigned int rx_buffer_len;
 	unsigned int rx_buffer_order;
 
@@ -780,9 +783,6 @@ struct efx_nic {
 	struct net_device *net_dev;
 	bool rx_checksum_enabled;
 
-	atomic_t netif_stop_count;
-	spinlock_t netif_stop_lock;
-
 	struct efx_mac_stats mac_stats;
 	struct efx_buffer stats_buffer;
 	spinlock_t stats_lock;
@@ -928,31 +928,26 @@ struct efx_nic_type {
 /* Iterate over all used channels */
 #define efx_for_each_channel(_channel, _efx)				\
 	for (_channel = &((_efx)->channel[0]);				\
-	     _channel < &((_efx)->channel[EFX_MAX_CHANNELS]);		\
-	     _channel++)						\
-		if (!_channel->used_flags)				\
-			continue;					\
-		else
+	     _channel < &((_efx)->channel[(efx)->n_channels]);		\
+	     _channel++)
 
 /* Iterate over all used TX queues */
 #define efx_for_each_tx_queue(_tx_queue, _efx)				\
 	for (_tx_queue = &((_efx)->tx_queue[0]);			\
-	     _tx_queue < &((_efx)->tx_queue[EFX_TX_QUEUE_COUNT]);	\
+	     _tx_queue < &((_efx)->tx_queue[EFX_TXQ_TYPES *		\
+					    (_efx)->n_tx_channels]);	\
 	     _tx_queue++)
 
 /* Iterate over all TX queues belonging to a channel */
 #define efx_for_each_channel_tx_queue(_tx_queue, _channel)		\
-	for (_tx_queue = &((_channel)->efx->tx_queue[0]);		\
-	     _tx_queue < &((_channel)->efx->tx_queue[EFX_TX_QUEUE_COUNT]); \
-	     _tx_queue++)						\
-		if (_tx_queue->channel != (_channel))			\
-			continue;					\
-		else
+	for (_tx_queue = (_channel)->tx_queue;				\
+	     _tx_queue && _tx_queue < (_channel)->tx_queue + EFX_TXQ_TYPES; \
+	     _tx_queue++)
 
 /* Iterate over all used RX queues */
 #define efx_for_each_rx_queue(_rx_queue, _efx)				\
 	for (_rx_queue = &((_efx)->rx_queue[0]);			\
-	     _rx_queue < &((_efx)->rx_queue[(_efx)->n_rx_queues]);	\
+	     _rx_queue < &((_efx)->rx_queue[(_efx)->n_rx_channels]);	\
 	     _rx_queue++)
 
 /* Iterate over all RX queues belonging to a channel */
diff --git a/drivers/net/sfc/nic.c b/drivers/net/sfc/nic.c
index f3226bb..5d3aaec 100644
--- a/drivers/net/sfc/nic.c
+++ b/drivers/net/sfc/nic.c
@@ -418,7 +418,7 @@ void efx_nic_init_tx(struct efx_tx_queue *tx_queue)
 			      FRF_BZ_TX_NON_IP_DROP_DIS, 1);
 
 	if (efx_nic_rev(efx) >= EFX_REV_FALCON_B0) {
-		int csum = tx_queue->queue == EFX_TX_QUEUE_OFFLOAD_CSUM;
+		int csum = tx_queue->queue & EFX_TXQ_TYPE_OFFLOAD;
 		EFX_SET_OWORD_FIELD(tx_desc_ptr, FRF_BZ_TX_IP_CHKSM_DIS, !csum);
 		EFX_SET_OWORD_FIELD(tx_desc_ptr, FRF_BZ_TX_TCP_CHKSM_DIS,
 				    !csum);
@@ -431,10 +431,10 @@ void efx_nic_init_tx(struct efx_tx_queue *tx_queue)
 		efx_oword_t reg;
 
 		/* Only 128 bits in this register */
-		BUILD_BUG_ON(EFX_TX_QUEUE_COUNT >= 128);
+		BUILD_BUG_ON(EFX_MAX_TX_QUEUES > 128);
 
 		efx_reado(efx, &reg, FR_AA_TX_CHKSM_CFG);
-		if (tx_queue->queue == EFX_TX_QUEUE_OFFLOAD_CSUM)
+		if (tx_queue->queue & EFX_TXQ_TYPE_OFFLOAD)
 			clear_bit_le(tx_queue->queue, (void *)&reg);
 		else
 			set_bit_le(tx_queue->queue, (void *)&reg);
@@ -1132,7 +1132,7 @@ static void efx_poll_flush_events(struct efx_nic *efx)
 		    ev_sub_code == FSE_AZ_TX_DESCQ_FLS_DONE_EV) {
 			ev_queue = EFX_QWORD_FIELD(*event,
 						   FSF_AZ_DRIVER_EV_SUBDATA);
-			if (ev_queue < EFX_TX_QUEUE_COUNT) {
+			if (ev_queue < EFX_TXQ_TYPES * efx->n_tx_channels) {
 				tx_queue = efx->tx_queue + ev_queue;
 				tx_queue->flushed = FLUSH_DONE;
 			}
@@ -1142,7 +1142,7 @@ static void efx_poll_flush_events(struct efx_nic *efx)
 				*event, FSF_AZ_DRIVER_EV_RX_DESCQ_ID);
 			ev_failed = EFX_QWORD_FIELD(
 				*event, FSF_AZ_DRIVER_EV_RX_FLUSH_FAIL);
-			if (ev_queue < efx->n_rx_queues) {
+			if (ev_queue < efx->n_rx_channels) {
 				rx_queue = efx->rx_queue + ev_queue;
 				rx_queue->flushed =
 					ev_failed ? FLUSH_FAILED : FLUSH_DONE;
@@ -1441,7 +1441,7 @@ static void efx_setup_rss_indir_table(struct efx_nic *efx)
 	     offset < FR_BZ_RX_INDIRECTION_TBL + 0x800;
 	     offset += 0x10) {
 		EFX_POPULATE_DWORD_1(dword, FRF_BZ_IT_QUEUE,
-				     i % efx->n_rx_queues);
+				     i % efx->n_rx_channels);
 		efx_writed(efx, &dword, offset);
 		i++;
 	}
diff --git a/drivers/net/sfc/selftest.c b/drivers/net/sfc/selftest.c
index 3a16e06..371e86c 100644
--- a/drivers/net/sfc/selftest.c
+++ b/drivers/net/sfc/selftest.c
@@ -618,8 +618,8 @@ static int efx_test_loopbacks(struct efx_nic *efx, struct efx_self_tests *tests,
 
 		/* Test both types of TX queue */
 		efx_for_each_channel_tx_queue(tx_queue, &efx->channel[0]) {
-			state->offload_csum = (tx_queue->queue ==
-					       EFX_TX_QUEUE_OFFLOAD_CSUM);
+			state->offload_csum = (tx_queue->queue &
+					       EFX_TXQ_TYPE_OFFLOAD);
 			rc = efx_test_loopback(tx_queue,
 					       &tests->loopback[mode]);
 			if (rc)
diff --git a/drivers/net/sfc/selftest.h b/drivers/net/sfc/selftest.h
index 643bef7..aed495a 100644
--- a/drivers/net/sfc/selftest.h
+++ b/drivers/net/sfc/selftest.h
@@ -18,8 +18,8 @@
  */
 
 struct efx_loopback_self_tests {
-	int tx_sent[EFX_TX_QUEUE_COUNT];
-	int tx_done[EFX_TX_QUEUE_COUNT];
+	int tx_sent[EFX_TXQ_TYPES];
+	int tx_done[EFX_TXQ_TYPES];
 	int rx_good;
 	int rx_bad;
 };
diff --git a/drivers/net/sfc/tx.c b/drivers/net/sfc/tx.c
index be0e110..6bb12a8 100644
--- a/drivers/net/sfc/tx.c
+++ b/drivers/net/sfc/tx.c
@@ -30,32 +30,46 @@
  */
 #define EFX_TXQ_THRESHOLD (EFX_TXQ_MASK / 2u)
 
-/* We want to be able to nest calls to netif_stop_queue(), since each
- * channel can have an individual stop on the queue.
- */
-void efx_stop_queue(struct efx_nic *efx)
+/* We need to be able to nest calls to netif_tx_stop_queue(), partly
+ * because of the 2 hardware queues associated with each core queue,
+ * but also so that we can inhibit TX for reasons other than a full
+ * hardware queue. */
+void efx_stop_queue(struct efx_channel *channel)
 {
-	spin_lock_bh(&efx->netif_stop_lock);
+	struct efx_nic *efx = channel->efx;
+
+	if (!channel->tx_queue)
+		return;
+
+	spin_lock_bh(&channel->tx_stop_lock);
 	EFX_TRACE(efx, "stop TX queue\n");
 
-	atomic_inc(&efx->netif_stop_count);
-	netif_stop_queue(efx->net_dev);
+	atomic_inc(&channel->tx_stop_count);
+	netif_tx_stop_queue(
+		netdev_get_tx_queue(
+			efx->net_dev,
+			channel->tx_queue->queue / EFX_TXQ_TYPES));
 
-	spin_unlock_bh(&efx->netif_stop_lock);
+	spin_unlock_bh(&channel->tx_stop_lock);
 }
 
-/* Wake netif's TX queue
- * We want to be able to nest calls to netif_stop_queue(), since each
- * channel can have an individual stop on the queue.
- */
-void efx_wake_queue(struct efx_nic *efx)
+/* Decrement core TX queue stop count and wake it if the count is 0 */
+void efx_wake_queue(struct efx_channel *channel)
 {
+	struct efx_nic *efx = channel->efx;
+
+	if (!channel->tx_queue)
+		return;
+
 	local_bh_disable();
-	if (atomic_dec_and_lock(&efx->netif_stop_count,
-				&efx->netif_stop_lock)) {
+	if (atomic_dec_and_lock(&channel->tx_stop_count,
+				&channel->tx_stop_lock)) {
 		EFX_TRACE(efx, "waking TX queue\n");
-		netif_wake_queue(efx->net_dev);
-		spin_unlock(&efx->netif_stop_lock);
+		netif_tx_wake_queue(
+			netdev_get_tx_queue(
+				efx->net_dev,
+				channel->tx_queue->queue / EFX_TXQ_TYPES));
+		spin_unlock(&channel->tx_stop_lock);
 	}
 	local_bh_enable();
 }
@@ -298,7 +312,7 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 	rc = NETDEV_TX_BUSY;
 
 	if (tx_queue->stopped == 1)
-		efx_stop_queue(efx);
+		efx_stop_queue(tx_queue->channel);
 
  unwind:
 	/* Work backwards until we hit the original insert pointer value */
@@ -374,10 +388,9 @@ netdev_tx_t efx_hard_start_xmit(struct sk_buff *skb,
 	if (unlikely(efx->port_inhibited))
 		return NETDEV_TX_BUSY;
 
+	tx_queue = &efx->tx_queue[EFX_TXQ_TYPES * skb_get_queue_mapping(skb)];
 	if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
-		tx_queue = &efx->tx_queue[EFX_TX_QUEUE_OFFLOAD_CSUM];
-	else
-		tx_queue = &efx->tx_queue[EFX_TX_QUEUE_NO_CSUM];
+		tx_queue += EFX_TXQ_TYPE_OFFLOAD;
 
 	return efx_enqueue_skb(tx_queue, skb);
 }
@@ -405,7 +418,7 @@ void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
 			netif_tx_lock(efx->net_dev);
 			if (tx_queue->stopped) {
 				tx_queue->stopped = 0;
-				efx_wake_queue(efx);
+				efx_wake_queue(tx_queue->channel);
 			}
 			netif_tx_unlock(efx->net_dev);
 		}
@@ -488,7 +501,7 @@ void efx_fini_tx_queue(struct efx_tx_queue *tx_queue)
 	/* Release queue's stop on port, if any */
 	if (tx_queue->stopped) {
 		tx_queue->stopped = 0;
-		efx_wake_queue(tx_queue->efx);
+		efx_wake_queue(tx_queue->channel);
 	}
 }
 
@@ -1120,7 +1133,7 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
 
 	/* Stop the queue if it wasn't stopped before. */
 	if (tx_queue->stopped == 1)
-		efx_stop_queue(efx);
+		efx_stop_queue(tx_queue->channel);
 
  unwind:
 	/* Free the DMA mapping we were in the process of writing out */
-- 
1.6.2.5

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: [net-next-2.6 PATCH 1/2] Add netdev port-profile support (take III, was iovnl)
From: Arnd Bergmann @ 2010-04-28 19:33 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw
In-Reply-To: <C7FDC3B8.2C7EB%scofeldm@cisco.com>

On Wednesday 28 April 2010, Scott Feldman wrote:
> On 4/28/10 6:13 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> > 
> > We will need a few more options to cover draft VDP in addition to the protocol
> > your NIC is using. I still think it's possible to use the same interface
> > for both, but the differences are obviously showing.
> > 
> > The missing bits that I can see so far are:
> > 
> > - You only have 'get' and 'set'. We will also need a 'unset' or 'delete'
> >   option in order to get rid of a port profile association.
> 
> That's there in my patch.  If you don't specify anything after port-profile
> keyword, it's an unset.  See extra "[" and "]" above.  Will that work for
> you?

Well, this won't work if you want to have multiple slave interfaces connected
to a single master and record the port profile in the master.

> > - VDP has three different ways to 'set' a port profile: 'associate',
> >   'pre-associate with resource reservation' and 'pre-associate without
> >   resource reservation'. This could become an extra option flag.
> 
> Ok, let's add an option flag bit field.  I'm not sure how that looks from
> the iproute2 cmd line.  Would you take a stab at defining these?
> 
> > - Instead of a port profile name, VDP specifies a tuple like
> >    struct vsi_associate {
> > unsigned char VSI_Mgr_ID;       /* VSI manager ID */
> > unsigned char VSI_Type_ID[3];   /* 24 bit VSI Type ID */
> > unsigned char VSI_Type_Version; /* VSI Type version */
> >    };
> >    I'm not sure how to deal with that best, but there needs to be
> >    some parsing of these numbers.
> 
> PORT-PROFILE above is a u8* array.  You could decide to encode the tuple in
> a string, e.g. "1.12345.1", and let the receiver parse it?  Or pack it in as
> binary.  PORT-PROFILE for us is just a string identifier, e.g. "corp-net-10"
> or "joes-garage".

I guess either one would work, but I'd prefer to do the parsing at
the front-end and passing it as binary to avoid libvirt encoding
it as ascii and lldpad (or the kernel) parsing the data again, which
would be more error-prone.

Another alternative would be to make this a distinct argument (or even
three of them) and only allow passing one or the other. That would
also implicitly choose the protocol (VDP or port extender).
 
> > - VDP requires a vlan ID to be part of the association, in addition to
> >   the MAC address. In theory, we could have multiple tuples of MAC+VLAN
> >   addresses, but we can probably just associate each tuple separately
> >   and ignore that part of the standard.
> 
> I don't think I have enough information to spec this out for this item.
> Would you take a stab at how this would look in the struct and how it would
> look from the iproute2 cmd line?  (Note I'm using iproute2 cmd line for
> illustrative purposes, but the sender of the msg could be something like
> libvirt).

Just adding the VID would be done I wrote at the end of my last mail.
Adding multiple VLAN/MAC pairs is probably not necessary if we can
do multiple associations.

> > - we have a set of possible error conditions that can be returned by
> >   the switch (invalid format, insufficient resources, unknown VTID,
> >   VTID violation, VTID verison violation, out of sync). It should be
> >   possible to return each of these to the user with 'get'.
> 
> There is a status code in the get cmd as defined in my patch.  I have it as
> a u8 with some enum codes.  Can we add to the enum code list?  Or do you
> want to return a full string?  Our requirements are we return one of:
> {success, error, in-progress}.

enum is fine, but I think it would be good to use the same numbers
as the VDP standard where possible. Maybe we could use two bytes,
the first one for the overall status (success, error, in progress)
and the second one for the specific error (as above)?

> >> Since we're using netlink sockets, the receiver of the RTM_SETLINK msg can
> >> be in kernel- or user-space.  For kernel-space recipient, rtnetlink.c, the
> >> new ndo_set_port_profile netdev op is called to set the port-profile.
> >> User-space recipients can decide how they propagate the msg to the switch.
> >> There is also a RTM_GETLINK cmd to to return port-profile setting of an
> >> interface and to also return the status of the last port-profile.
> > 
> > More on a stylistic note, I'm not convinced that using RTM_SETLINK/GETLINK
> > is the right interface for this, unlike the 'ip link set DEV vf ...' stuff,
> > because it seems to suggest that this is an option of the adapter itself.
> > I actually liked the iovnl family better in this regard, because it kept
> > the protocols separate.
> 
> Wait a second...I abandoned iovnl and moved to if_link based on suggestions
> from you and others.  On 4/21/10 2:13 PM, "Arnd Bergmann" <arnd@arndb.de>
> wrote:
>
>    > Right. My preference would probably be make these a subcategory of
>    > the if_link, and use the existing RTM_NEWLINK/RTM_DELLINK commands.
> 
> My latest with if_link fits best with what we're trying to do with enic.

Sorry for the misunderstanding, I was probably not clear enough. This
was a side-discussion about how we should create and destroy virtual
interfaces on IOV capable adapters. My point was that this should be
_separate_ from the port profile association. For creating a
macvlan/macvtap device, we already use RTM_NEWLINK, which does not
associate the device with the switch, so I suggested that for creating
a virtual interface with hardware support we should do the same thing,
but leave the association somewhere else. iovnl sounds like a good
place for that.

> Note the current interface is per netdev interface (a link), where the
> netdev interface could be the adapter itself or any other netdev such as
> macvlan, bond, etc.

We certainly missed each others arguments here, see my other mail.
In order to do the port profile association from software, we
definitely need the master device (nic, PF, bond, bridge, ...) so
we have a way to communicate to the switch, not the slave device
(VF, tap, macvtap, ...) that we are trying to associate.

> > What I could imagine to unify this is something like
> > 
> >    ip port_profile set DEVICE [ { pre_associate | pre_associate_rr } ]
> >                               { name PORT-PROFILE | vsi MGR:VTID:VER }
> >                                     mac LLADDR
> >    [ vlan VID ]
> >                                     [ host_uuid HOST_UUID ]
> >                                     [ client_uuid CLIENT_UUID ]
> >                                     [ client_name CLIENT_NAME ]
> >    ip port_profile del DEVICE [ mac LLADDR [ vlan VID ] ]
> >    ip port_profile show DEVICE
> 
> If we want to break port_profile out into it's own ip cmd, I'm ok with that.
> What you have above would work for enic.  The netdev would have these ops:
> 
>     ndo_set_port_profile
>     ndo_get_port_profile
>     ndo_del_port_profile
> 
> Sounds OK?

That sounds good.

	Arnd

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox