Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/1] ipv6: ignore looped-back NA while dad is running
From: David Miller @ 2011-04-15 22:44 UTC (permalink / raw)
  To: sahne; +Cc: netdev, linux-kernel
In-Reply-To: <20110414070925.GA78446@0x90.at>

From: Daniel Walter <sahne@0x90.at>
Date: Thu, 14 Apr 2011 09:09:25 +0200

> [ipv6] Ignore looped-back NAs while in Duplicate Address Detection
> 
> If we send an unsolicited NA shortly after bringing up an
> IPv6 address, the duplicate address detection algorithm
> fails and the ip stays in tentative mode forever. 
> This is due a missing check if the NA is looped-back to us.
> 
> Signed-off-by: Daniel Walter <dwalter@barracuda.com>

Applied to net-next-2.6

^ permalink raw reply

* Re: [PATCH 1/1] ipv6: RTA_PREFSRC support for ipv6 route source address selection
From: David Miller @ 2011-04-15 22:45 UTC (permalink / raw)
  To: sahne; +Cc: netdev, linux-kernel
In-Reply-To: <20110414071057.GB78446@0x90.at>

From: Daniel Walter <sahne@0x90.at>
Date: Thu, 14 Apr 2011 09:10:57 +0200

> [ipv6] Add support for RTA_PREFSRC
> 
> This patch allows a user to select the preferred source address
> for a specific IPv6-Route. It can be set via a netlink message
> setting RTA_PREFSRC to a valid IPv6 address which must be
> up on the device the route will be bound to.
> 
> 
> Signed-off-by: Daniel Walter <dwalter@barracuda.com>

Applied to net-next-2.6

> +		err = ip6_route_get_saddr(net, rt, &fl6->daddr, 
                                                                ^^

This line had trailing whitespace, please avoid this in the future
as GIT complains about it and I have to fix it up by hand.

^ permalink raw reply

* Re: [PATCH] net: export skb_clone_tx_timestamp
From: David Miller @ 2011-04-15 22:46 UTC (permalink / raw)
  To: richardcochran; +Cc: netdev
In-Reply-To: <20110414173502.GA15244@riccoc20.at.omicron.at>

From: Richard Cochran <richardcochran@gmail.com>
Date: Thu, 14 Apr 2011 19:35:02 +0200

> MAC drivers compiled as modules may well want to call this function via
> the skb_tx_timestamp inline function. This patch exports the function in
> order to let this happen.
> 
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>

You can submit this patch to export this variable when you also submit
a patch to a upstream driver that makes use of this interface in such
a way.

But no sooner.

^ permalink raw reply

* Re: [PATCH] minor cleanup to net_namespace.c.
From: David Miller @ 2011-04-15 22:48 UTC (permalink / raw)
  To: jpirko; +Cc: rlandley, linux-kernel, netdev, eric.dumazet
In-Reply-To: <20110415123751.GC2697@psychotron>

From: Jiri Pirko <jpirko@redhat.com>
Date: Fri, 15 Apr 2011 14:37:52 +0200

> Fri, Apr 15, 2011 at 02:26:25PM CEST, rlandley@parallels.com wrote:
>>From: Rob Landley <rlandley@parallels.com>
>>
>>Inline a small static function that's only ever called from one place.
>>
>>Signed-off-by: Rob Landley <rlandley@parallels.com>
 ...
> Reviewed-by: Jiri Pirko <jpirko@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: mlx4: convert to hw_features
From: David Miller @ 2011-04-15 22:50 UTC (permalink / raw)
  To: mirq-linux; +Cc: netdev, yevgenyp, eli
In-Reply-To: <20110415145049.D929D13A67@rere.qmqm.pl>

From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Fri, 15 Apr 2011 16:50:49 +0200 (CEST)

> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Applied.

^ permalink raw reply

* Re: [PATCH] net: spider_net: convert to hw_features
From: David Miller @ 2011-04-15 22:51 UTC (permalink / raw)
  To: mirq-linux; +Cc: netdev, kou.ishizaki, jens
In-Reply-To: <20110415145049.D0047138DD@rere.qmqm.pl>

From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Fri, 15 Apr 2011 16:50:49 +0200 (CEST)

> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Applied.

^ permalink raw reply

* Re: [PATCH] net: dm9000: convert to hw_features
From: David Miller @ 2011-04-15 22:51 UTC (permalink / raw)
  To: mirq-linux; +Cc: netdev, ben-linux, henry.nestler
In-Reply-To: <20110415145050.0222B13A68@rere.qmqm.pl>

From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Fri, 15 Apr 2011 16:50:49 +0200 (CEST)

> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Applied.

^ permalink raw reply

* Re: [PATCH] net: gianfar: convert to hw_features
From: David Miller @ 2011-04-15 22:51 UTC (permalink / raw)
  To: mirq-linux; +Cc: netdev, oakad, cbouatmailru, jarkao2
In-Reply-To: <20110415145050.39D5013A69@rere.qmqm.pl>

From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Fri, 15 Apr 2011 16:50:50 +0200 (CEST)

> Note: I bet that gfar_set_features() don't really need a full reset.
> 
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Applied.

^ permalink raw reply

* Re: [PATCH] net: forcedeth: convert to hw_features
From: David Miller @ 2011-04-15 22:51 UTC (permalink / raw)
  To: mirq-linux; +Cc: netdev
In-Reply-To: <20110415145049.CC33613A65@rere.qmqm.pl>

From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Fri, 15 Apr 2011 16:50:49 +0200 (CEST)

> This also fixes a race around np->txrxctl_bits while changing RXCSUM offload.
> 
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Applied.

^ permalink raw reply

* Re: [PATCH 2/3] net: Add net device irq siloing feature
From: Ben Hutchings @ 2011-04-15 22:49 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, davem, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert
In-Reply-To: <1302898677-3833-3-git-send-email-nhorman@tuxdriver.com>

On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> Using the irq affinity infrastrucuture, we can now allow net devices to call
> request_irq using a new wrapper function (request_net_irq), which will attach a
> common affinty_update handler to each requested irq.  This affinty update
> mechanism correlates each tracked irq to the flow(s) that said irq processes
> most frequently.  The highest traffic flow is noted, marked and exported to user
> space via the affinity_hint proc file for each irq. In this way, utilities like
> irqbalance are able to determine  which cpu is recieving the most data from each
> rx queue on a given NIC, and set irq affinity accordingly.
[...]

Is irqbalance expected to poll the affinity hints?  How often?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: net: Automatic IRQ siloing for network devices
From: Ben Hutchings @ 2011-04-15 22:54 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem
In-Reply-To: <1302898677-3833-1-git-send-email-nhorman@tuxdriver.com>

On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> Automatic IRQ siloing for network devices
> 
> At last years netconf:
> http://vger.kernel.org/netconf2010.html
> 
> Tom Herbert gave a talk in which he outlined some of the things we can do to
> improve scalability and througput in our network stack
> 
> One of the big items on the slides was the notion of siloing irqs, which is the
> practice of setting irq affinity to a cpu or cpu set that was 'close' to the
> process that would be consuming data.  The idea was to ensure that a hard irq
> for a nic (and its subsequent softirq) would execute on the same cpu as the
> process consuming the data, increasing cache hit rates and speeding up overall
> throughput.
> 
> I had taken an idea away from that talk, and have finally gotten around to
> implementing it.  One of the problems with the above approach is that its all
> quite manual.  I.e. to properly enact this siloiong, you have to do a few things
> by hand:
> 
> 1) decide which process is the heaviest user of a given rx queue 
> 2) restrict the cpus which that task will run on
> 3) identify the irq which the rx queue in (1) maps to
> 4) manually set the affinity for the irq in (3) to cpus which match the cpus in
> (2)
[...]

This presumably works well with small numbers of flows and/or large
numbers of queues.  You could scale it up somewhat by manipulating the
device's flow hash indirection table, but that usually only has 128
entries.  (Changing the indirection table is currently quite expensive,
though that could be changed.)

I see RFS and accelerated RFS as the only reasonable way to scale to
large numbers of flows.  And as part of accelerated RFS, I already did
the work for mapping CPUs to IRQs (note, not the other way round).  If
IRQ affinity keeps changing then it will significantly undermine the
usefulness of hardware flow steering.

Now I'm not saying that your approach is useless.  There is more
hardware out there with flow hashing than with flow steering, and there
are presumably many systems with small numbers of active flows.  But I
think we need to avoid having two features that conflict and a
requirement for administrators to make a careful selection between them.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH net-next-2.6 0/3] bonding,ipv4,ipv6,vlan: Use notifiers to trigger advertisements on failover
From: Ben Hutchings @ 2011-04-15 23:41 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, Patrick McHardy; +Cc: netdev, Ian Campbell

It is undesirable for the bonding driver to be poking into higher
level protocols, and notifiers provide a way to avoid that.

Ian added NETDEV_NOTIFY_PEERS to trigger gratuitous ARP on VM migration.
We should extend that to unsolicited NAs and propagate it through VLANs,
then treat bonding failover in the same way.

Ben.

Ben Hutchings (3):
  ipv6: Send unsolicited neighbour advertisements when notified
  vlan: Propagate NETDEV_NOTIFY_PEERS notifier
  bonding,ipv4,ipv6,vlan: Handle NETDEV_BONDING_FAILOVER like
    NETDEV_NOTIFY_PEERS

 drivers/net/bonding/Makefile     |    3 -
 drivers/net/bonding/bond_ipv6.c  |  225 --------------------------------------
 drivers/net/bonding/bond_main.c  |   96 ----------------
 drivers/net/bonding/bond_sysfs.c |   80 --------------
 drivers/net/bonding/bonding.h    |   29 -----
 net/8021q/vlan.c                 |   12 ++
 net/ipv4/devinet.c               |    1 +
 net/ipv6/ndisc.c                 |   27 +++++
 8 files changed, 40 insertions(+), 433 deletions(-)
 delete mode 100644 drivers/net/bonding/bond_ipv6.c

-- 
1.7.4


-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* [PATCH net-next-2.6 1/3] ipv6: Send unsolicited neighbour advertismements when notified
From: Ben Hutchings @ 2011-04-15 23:46 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Alexey Kuznetsov, Pekka Savola (ipv6), James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy

The NETDEV_NOTIFY_PEERS notifier is a request to send such
advertisements following migration to a different physical link,
e.g. virtual machine migration.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
This seems to work and should match what the bonding driver was
previously doing on failover, except that it iterates over all
addresses.  I don't know whether it's actually right though.

Ben.

 net/ipv6/ndisc.c |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 92f952d..a51fa74c 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -611,6 +611,29 @@ static void ndisc_send_na(struct net_device *dev, struct neighbour *neigh,
 		     inc_opt ? ND_OPT_TARGET_LL_ADDR : 0);
 }
 
+static void ndisc_send_unsol_na(struct net_device *dev)
+{
+	struct inet6_dev *idev;
+	struct inet6_ifaddr *ifa;
+	struct in6_addr mcaddr;
+
+	idev = in6_dev_get(dev);
+	if (!idev)
+		return;
+
+	read_lock_bh(&idev->lock);
+	list_for_each_entry(ifa, &idev->addr_list, if_list) {
+		addrconf_addr_solict_mult(&ifa->addr, &mcaddr);
+		ndisc_send_na(dev, NULL, &mcaddr, &ifa->addr,
+			      /*router=*/ !!idev->cnf.forwarding,
+			      /*solicited=*/ false, /*override=*/ true,
+			      /*inc_opt=*/ true);
+	}
+	read_unlock_bh(&idev->lock);
+
+	in6_dev_put(idev);
+}
+
 void ndisc_send_ns(struct net_device *dev, struct neighbour *neigh,
 		   const struct in6_addr *solicit,
 		   const struct in6_addr *daddr, const struct in6_addr *saddr)
@@ -1722,6 +1745,9 @@ static int ndisc_netdev_event(struct notifier_block *this, unsigned long event,
 		neigh_ifdown(&nd_tbl, dev);
 		fib6_run_gc(~0UL, net);
 		break;
+	case NETDEV_NOTIFY_PEERS:
+		ndisc_send_unsol_na(dev);
+		break;
 	default:
 		break;
 	}
-- 
1.7.4



-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH net-next-2.6 2/3] vlan: Propagate NETDEV_NOTIFY_PEERS notifier
From: Ben Hutchings @ 2011-04-15 23:46 UTC (permalink / raw)
  To: David Miller, Patrick McHardy; +Cc: netdev

The NETDEV_NOTIFY_PEERS notifier indicates that a device moved to a
different physical link; this also applies to any VLAN devices on top
of it.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 net/8021q/vlan.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 14ef5ef..b2ff70f 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -499,6 +499,17 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event,
 	case NETDEV_PRE_TYPE_CHANGE:
 		/* Forbid underlaying device to change its type. */
 		return NOTIFY_BAD;
+
+	case NETDEV_NOTIFY_PEERS:
+		/* Propagate to vlan devices */
+		for (i = 0; i < VLAN_N_VID; i++) {
+			vlandev = vlan_group_get_device(grp, i);
+			if (!vlandev)
+				continue;
+
+			call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, vlandev);
+		}
+		break;
 	}
 
 out:
-- 
1.7.4



-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH net-next-2.6 3/3] bonding,ipv4,ipv6,vlan: Handle NETDEV_BONDING_FAILOVER like NETDEV_NOTIFY_PEERS
From: Ben Hutchings @ 2011-04-15 23:47 UTC (permalink / raw)
  To: David Miller, Jay Vosburgh, Andy Gospodarek, Patrick McHardy; +Cc: netdev

It is undesirable for the bonding driver to be poking into higher
level protocols, and notifiers provide a way to avoid that.  This does
mean removing the ability to configure reptitition of gratuitous ARPs
and unsolicited NAs.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/bonding/Makefile     |    3 -
 drivers/net/bonding/bond_ipv6.c  |  225 --------------------------------------
 drivers/net/bonding/bond_main.c  |   96 ----------------
 drivers/net/bonding/bond_sysfs.c |   80 --------------
 drivers/net/bonding/bonding.h    |   29 -----
 net/8021q/vlan.c                 |    3 +-
 net/ipv4/devinet.c               |    1 +
 net/ipv6/ndisc.c                 |    1 +
 8 files changed, 4 insertions(+), 434 deletions(-)
 delete mode 100644 drivers/net/bonding/bond_ipv6.c

diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index 3c5c014..4c21bf6 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -9,6 +9,3 @@ bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o
 proc-$(CONFIG_PROC_FS) += bond_procfs.o
 bonding-objs += $(proc-y)
 
-ipv6-$(subst m,y,$(CONFIG_IPV6)) += bond_ipv6.o
-bonding-objs += $(ipv6-y)
-
diff --git a/drivers/net/bonding/bond_ipv6.c b/drivers/net/bonding/bond_ipv6.c
deleted file mode 100644
index 84fbd4e..0000000
--- a/drivers/net/bonding/bond_ipv6.c
+++ /dev/null
@@ -1,225 +0,0 @@
-/*
- * Copyright(c) 2008 Hewlett-Packard Development Company, L.P.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the
- * Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
- * or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
- * for more details.
- *
- * You should have received a copy of the GNU General Public License along
- * with this program; if not, write to the Free Software Foundation, Inc.,
- * 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
- *
- * The full GNU General Public License is included in this distribution in the
- * file called LICENSE.
- *
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include <linux/types.h>
-#include <linux/if_vlan.h>
-#include <net/ipv6.h>
-#include <net/ndisc.h>
-#include <net/addrconf.h>
-#include <net/netns/generic.h>
-#include "bonding.h"
-
-/*
- * Assign bond->master_ipv6 to the next IPv6 address in the list, or
- * zero it out if there are none.
- */
-static void bond_glean_dev_ipv6(struct net_device *dev, struct in6_addr *addr)
-{
-	struct inet6_dev *idev;
-
-	if (!dev)
-		return;
-
-	idev = in6_dev_get(dev);
-	if (!idev)
-		return;
-
-	read_lock_bh(&idev->lock);
-	if (!list_empty(&idev->addr_list)) {
-		struct inet6_ifaddr *ifa
-			= list_first_entry(&idev->addr_list,
-					   struct inet6_ifaddr, if_list);
-		ipv6_addr_copy(addr, &ifa->addr);
-	} else
-		ipv6_addr_set(addr, 0, 0, 0, 0);
-
-	read_unlock_bh(&idev->lock);
-
-	in6_dev_put(idev);
-}
-
-static void bond_na_send(struct net_device *slave_dev,
-			 struct in6_addr *daddr,
-			 int router,
-			 unsigned short vlan_id)
-{
-	struct in6_addr mcaddr;
-	struct icmp6hdr icmp6h = {
-		.icmp6_type = NDISC_NEIGHBOUR_ADVERTISEMENT,
-	};
-	struct sk_buff *skb;
-
-	icmp6h.icmp6_router = router;
-	icmp6h.icmp6_solicited = 0;
-	icmp6h.icmp6_override = 1;
-
-	addrconf_addr_solict_mult(daddr, &mcaddr);
-
-	pr_debug("ipv6 na on slave %s: dest %pI6, src %pI6\n",
-		 slave_dev->name, &mcaddr, daddr);
-
-	skb = ndisc_build_skb(slave_dev, &mcaddr, daddr, &icmp6h, daddr,
-			      ND_OPT_TARGET_LL_ADDR);
-
-	if (!skb) {
-		pr_err("NA packet allocation failed\n");
-		return;
-	}
-
-	if (vlan_id) {
-		/* The Ethernet header is not present yet, so it is
-		 * too early to insert a VLAN tag.  Force use of an
-		 * out-of-line tag here and let dev_hard_start_xmit()
-		 * insert it if the slave hardware can't.
-		 */
-		skb = __vlan_hwaccel_put_tag(skb, vlan_id);
-		if (!skb) {
-			pr_err("failed to insert VLAN tag\n");
-			return;
-		}
-	}
-
-	ndisc_send_skb(skb, slave_dev, NULL, &mcaddr, daddr, &icmp6h);
-}
-
-/*
- * Kick out an unsolicited Neighbor Advertisement for an IPv6 address on
- * the bonding master.  This will help the switch learn our address
- * if in active-backup mode.
- *
- * Caller must hold curr_slave_lock for read or better
- */
-void bond_send_unsolicited_na(struct bonding *bond)
-{
-	struct slave *slave = bond->curr_active_slave;
-	struct vlan_entry *vlan;
-	struct inet6_dev *idev;
-	int is_router;
-
-	pr_debug("%s: bond %s slave %s\n", bond->dev->name,
-		 __func__, slave ? slave->dev->name : "NULL");
-
-	if (!slave || !bond->send_unsol_na ||
-	    test_bit(__LINK_STATE_LINKWATCH_PENDING, &slave->dev->state))
-		return;
-
-	bond->send_unsol_na--;
-
-	idev = in6_dev_get(bond->dev);
-	if (!idev)
-		return;
-
-	is_router = !!idev->cnf.forwarding;
-
-	in6_dev_put(idev);
-
-	if (!ipv6_addr_any(&bond->master_ipv6))
-		bond_na_send(slave->dev, &bond->master_ipv6, is_router, 0);
-
-	list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
-		if (!ipv6_addr_any(&vlan->vlan_ipv6)) {
-			bond_na_send(slave->dev, &vlan->vlan_ipv6, is_router,
-				     vlan->vlan_id);
-		}
-	}
-}
-
-/*
- * bond_inet6addr_event: handle inet6addr notifier chain events.
- *
- * We keep track of device IPv6 addresses primarily to use as source
- * addresses in NS probes.
- *
- * We track one IPv6 for the main device (if it has one).
- */
-static int bond_inet6addr_event(struct notifier_block *this,
-				unsigned long event,
-				void *ptr)
-{
-	struct inet6_ifaddr *ifa = ptr;
-	struct net_device *vlan_dev, *event_dev = ifa->idev->dev;
-	struct bonding *bond;
-	struct vlan_entry *vlan;
-	struct bond_net *bn = net_generic(dev_net(event_dev), bond_net_id);
-
-	list_for_each_entry(bond, &bn->dev_list, bond_list) {
-		if (bond->dev == event_dev) {
-			switch (event) {
-			case NETDEV_UP:
-				if (ipv6_addr_any(&bond->master_ipv6))
-					ipv6_addr_copy(&bond->master_ipv6,
-						       &ifa->addr);
-				return NOTIFY_OK;
-			case NETDEV_DOWN:
-				if (ipv6_addr_equal(&bond->master_ipv6,
-						    &ifa->addr))
-					bond_glean_dev_ipv6(bond->dev,
-							    &bond->master_ipv6);
-				return NOTIFY_OK;
-			default:
-				return NOTIFY_DONE;
-			}
-		}
-
-		list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
-			if (!bond->vlgrp)
-				continue;
-			vlan_dev = vlan_group_get_device(bond->vlgrp,
-							 vlan->vlan_id);
-			if (vlan_dev == event_dev) {
-				switch (event) {
-				case NETDEV_UP:
-					if (ipv6_addr_any(&vlan->vlan_ipv6))
-						ipv6_addr_copy(&vlan->vlan_ipv6,
-							       &ifa->addr);
-					return NOTIFY_OK;
-				case NETDEV_DOWN:
-					if (ipv6_addr_equal(&vlan->vlan_ipv6,
-							    &ifa->addr))
-						bond_glean_dev_ipv6(vlan_dev,
-								    &vlan->vlan_ipv6);
-					return NOTIFY_OK;
-				default:
-					return NOTIFY_DONE;
-				}
-			}
-		}
-	}
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block bond_inet6addr_notifier = {
-	.notifier_call = bond_inet6addr_event,
-};
-
-void bond_register_ipv6_notifier(void)
-{
-	register_inet6addr_notifier(&bond_inet6addr_notifier);
-}
-
-void bond_unregister_ipv6_notifier(void)
-{
-	unregister_inet6addr_notifier(&bond_inet6addr_notifier);
-}
-
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index b51e021..5cd4766 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -89,8 +89,6 @@
 
 static int max_bonds	= BOND_DEFAULT_MAX_BONDS;
 static int tx_queues	= BOND_DEFAULT_TX_QUEUES;
-static int num_grat_arp = 1;
-static int num_unsol_na = 1;
 static int miimon	= BOND_LINK_MON_INTERV;
 static int updelay;
 static int downdelay;
@@ -113,10 +111,6 @@ module_param(max_bonds, int, 0);
 MODULE_PARM_DESC(max_bonds, "Max number of bonded devices");
 module_param(tx_queues, int, 0);
 MODULE_PARM_DESC(tx_queues, "Max number of transmit queues (default = 16)");
-module_param(num_grat_arp, int, 0644);
-MODULE_PARM_DESC(num_grat_arp, "Number of gratuitous ARP packets to send on failover event");
-module_param(num_unsol_na, int, 0644);
-MODULE_PARM_DESC(num_unsol_na, "Number of unsolicited IPv6 Neighbor Advertisements packets to send on failover event");
 module_param(miimon, int, 0);
 MODULE_PARM_DESC(miimon, "Link check interval in milliseconds");
 module_param(updelay, int, 0);
@@ -234,7 +228,6 @@ struct bond_parm_tbl ad_select_tbl[] = {
 
 /*-------------------------- Forward declarations ---------------------------*/
 
-static void bond_send_gratuitous_arp(struct bonding *bond);
 static int bond_init(struct net_device *bond_dev);
 static void bond_uninit(struct net_device *bond_dev);
 
@@ -1160,14 +1153,6 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 				bond_do_fail_over_mac(bond, new_active,
 						      old_active);
 
-			if (netif_running(bond->dev)) {
-				bond->send_grat_arp = bond->params.num_grat_arp;
-				bond_send_gratuitous_arp(bond);
-
-				bond->send_unsol_na = bond->params.num_unsol_na;
-				bond_send_unsolicited_na(bond);
-			}
-
 			write_unlock_bh(&bond->curr_slave_lock);
 			read_unlock(&bond->lock);
 
@@ -2578,18 +2563,6 @@ void bond_mii_monitor(struct work_struct *work)
 	if (bond->slave_cnt == 0)
 		goto re_arm;
 
-	if (bond->send_grat_arp) {
-		read_lock(&bond->curr_slave_lock);
-		bond_send_gratuitous_arp(bond);
-		read_unlock(&bond->curr_slave_lock);
-	}
-
-	if (bond->send_unsol_na) {
-		read_lock(&bond->curr_slave_lock);
-		bond_send_unsolicited_na(bond);
-		read_unlock(&bond->curr_slave_lock);
-	}
-
 	if (bond_miimon_inspect(bond)) {
 		read_unlock(&bond->lock);
 		rtnl_lock();
@@ -2751,44 +2724,6 @@ static void bond_arp_send_all(struct bonding *bond, struct slave *slave)
 	}
 }
 
-/*
- * Kick out a gratuitous ARP for an IP on the bonding master plus one
- * for each VLAN above us.
- *
- * Caller must hold curr_slave_lock for read or better
- */
-static void bond_send_gratuitous_arp(struct bonding *bond)
-{
-	struct slave *slave = bond->curr_active_slave;
-	struct vlan_entry *vlan;
-	struct net_device *vlan_dev;
-
-	pr_debug("bond_send_grat_arp: bond %s slave %s\n",
-		 bond->dev->name, slave ? slave->dev->name : "NULL");
-
-	if (!slave || !bond->send_grat_arp ||
-	    test_bit(__LINK_STATE_LINKWATCH_PENDING, &slave->dev->state))
-		return;
-
-	bond->send_grat_arp--;
-
-	if (bond->master_ip) {
-		bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
-				bond->master_ip, 0);
-	}
-
-	if (!bond->vlgrp)
-		return;
-
-	list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
-		vlan_dev = vlan_group_get_device(bond->vlgrp, vlan->vlan_id);
-		if (vlan->vlan_ip) {
-			bond_arp_send(slave->dev, ARPOP_REPLY, vlan->vlan_ip,
-				      vlan->vlan_ip, vlan->vlan_id);
-		}
-	}
-}
-
 static void bond_validate_arp(struct bonding *bond, struct slave *slave, __be32 sip, __be32 tip)
 {
 	int i;
@@ -3255,18 +3190,6 @@ void bond_activebackup_arp_mon(struct work_struct *work)
 	if (bond->slave_cnt == 0)
 		goto re_arm;
 
-	if (bond->send_grat_arp) {
-		read_lock(&bond->curr_slave_lock);
-		bond_send_gratuitous_arp(bond);
-		read_unlock(&bond->curr_slave_lock);
-	}
-
-	if (bond->send_unsol_na) {
-		read_lock(&bond->curr_slave_lock);
-		bond_send_unsolicited_na(bond);
-		read_unlock(&bond->curr_slave_lock);
-	}
-
 	if (bond_ab_arp_inspect(bond, delta_in_ticks)) {
 		read_unlock(&bond->lock);
 		rtnl_lock();
@@ -3645,9 +3568,6 @@ static int bond_close(struct net_device *bond_dev)
 
 	write_lock_bh(&bond->lock);
 
-	bond->send_grat_arp = 0;
-	bond->send_unsol_na = 0;
-
 	/* signal timers not to re-arm */
 	bond->kill_timers = 1;
 
@@ -4724,18 +4644,6 @@ static int bond_check_params(struct bond_params *params)
 		use_carrier = 1;
 	}
 
-	if (num_grat_arp < 0 || num_grat_arp > 255) {
-		pr_warning("Warning: num_grat_arp (%d) not in range 0-255 so it was reset to 1\n",
-			   num_grat_arp);
-		num_grat_arp = 1;
-	}
-
-	if (num_unsol_na < 0 || num_unsol_na > 255) {
-		pr_warning("Warning: num_unsol_na (%d) not in range 0-255 so it was reset to 1\n",
-			   num_unsol_na);
-		num_unsol_na = 1;
-	}
-
 	/* reset values for 802.3ad */
 	if (bond_mode == BOND_MODE_8023AD) {
 		if (!miimon) {
@@ -4925,8 +4833,6 @@ static int bond_check_params(struct bond_params *params)
 	params->mode = bond_mode;
 	params->xmit_policy = xmit_hashtype;
 	params->miimon = miimon;
-	params->num_grat_arp = num_grat_arp;
-	params->num_unsol_na = num_unsol_na;
 	params->arp_interval = arp_interval;
 	params->arp_validate = arp_validate_value;
 	params->updelay = updelay;
@@ -5121,7 +5027,6 @@ static int __init bonding_init(void)
 
 	register_netdevice_notifier(&bond_netdev_notifier);
 	register_inetaddr_notifier(&bond_inetaddr_notifier);
-	bond_register_ipv6_notifier();
 out:
 	return res;
 err:
@@ -5136,7 +5041,6 @@ static void __exit bonding_exit(void)
 {
 	unregister_netdevice_notifier(&bond_netdev_notifier);
 	unregister_inetaddr_notifier(&bond_inetaddr_notifier);
-	bond_unregister_ipv6_notifier();
 
 	bond_destroy_sysfs();
 	bond_destroy_debugfs();
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index de87aea..259ff32 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -874,84 +874,6 @@ static DEVICE_ATTR(ad_select, S_IRUGO | S_IWUSR,
 		   bonding_show_ad_select, bonding_store_ad_select);
 
 /*
- * Show and set the number of grat ARP to send after a failover event.
- */
-static ssize_t bonding_show_n_grat_arp(struct device *d,
-				   struct device_attribute *attr,
-				   char *buf)
-{
-	struct bonding *bond = to_bond(d);
-
-	return sprintf(buf, "%d\n", bond->params.num_grat_arp);
-}
-
-static ssize_t bonding_store_n_grat_arp(struct device *d,
-				    struct device_attribute *attr,
-				    const char *buf, size_t count)
-{
-	int new_value, ret = count;
-	struct bonding *bond = to_bond(d);
-
-	if (sscanf(buf, "%d", &new_value) != 1) {
-		pr_err("%s: no num_grat_arp value specified.\n",
-		       bond->dev->name);
-		ret = -EINVAL;
-		goto out;
-	}
-	if (new_value < 0 || new_value > 255) {
-		pr_err("%s: Invalid num_grat_arp value %d not in range 0-255; rejected.\n",
-		       bond->dev->name, new_value);
-		ret = -EINVAL;
-		goto out;
-	} else {
-		bond->params.num_grat_arp = new_value;
-	}
-out:
-	return ret;
-}
-static DEVICE_ATTR(num_grat_arp, S_IRUGO | S_IWUSR,
-		   bonding_show_n_grat_arp, bonding_store_n_grat_arp);
-
-/*
- * Show and set the number of unsolicited NA's to send after a failover event.
- */
-static ssize_t bonding_show_n_unsol_na(struct device *d,
-				       struct device_attribute *attr,
-				       char *buf)
-{
-	struct bonding *bond = to_bond(d);
-
-	return sprintf(buf, "%d\n", bond->params.num_unsol_na);
-}
-
-static ssize_t bonding_store_n_unsol_na(struct device *d,
-					struct device_attribute *attr,
-					const char *buf, size_t count)
-{
-	int new_value, ret = count;
-	struct bonding *bond = to_bond(d);
-
-	if (sscanf(buf, "%d", &new_value) != 1) {
-		pr_err("%s: no num_unsol_na value specified.\n",
-		       bond->dev->name);
-		ret = -EINVAL;
-		goto out;
-	}
-
-	if (new_value < 0 || new_value > 255) {
-		pr_err("%s: Invalid num_unsol_na value %d not in range 0-255; rejected.\n",
-		       bond->dev->name, new_value);
-		ret = -EINVAL;
-		goto out;
-	} else
-		bond->params.num_unsol_na = new_value;
-out:
-	return ret;
-}
-static DEVICE_ATTR(num_unsol_na, S_IRUGO | S_IWUSR,
-		   bonding_show_n_unsol_na, bonding_store_n_unsol_na);
-
-/*
  * Show and set the MII monitor interval.  There are two tricky bits
  * here.  First, if MII monitoring is activated, then we must disable
  * ARP monitoring.  Second, if the timer isn't running, we must
@@ -1650,8 +1572,6 @@ static struct attribute *per_bond_attrs[] = {
 	&dev_attr_lacp_rate.attr,
 	&dev_attr_ad_select.attr,
 	&dev_attr_xmit_hash_policy.attr,
-	&dev_attr_num_grat_arp.attr,
-	&dev_attr_num_unsol_na.attr,
 	&dev_attr_miimon.attr,
 	&dev_attr_primary.attr,
 	&dev_attr_primary_reselect.attr,
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 90736cb..77180b1 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -149,8 +149,6 @@ struct bond_params {
 	int mode;
 	int xmit_policy;
 	int miimon;
-	int num_grat_arp;
-	int num_unsol_na;
 	int arp_interval;
 	int arp_validate;
 	int use_carrier;
@@ -178,9 +176,6 @@ struct vlan_entry {
 	struct list_head vlan_list;
 	__be32 vlan_ip;
 	unsigned short vlan_id;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
-	struct in6_addr vlan_ipv6;
-#endif
 };
 
 struct slave {
@@ -234,8 +229,6 @@ struct bonding {
 	rwlock_t lock;
 	rwlock_t curr_slave_lock;
 	s8       kill_timers;
-	s8	 send_grat_arp;
-	s8	 send_unsol_na;
 	s8	 setup_by_slave;
 	s8       igmp_retrans;
 #ifdef CONFIG_PROC_FS
@@ -260,9 +253,6 @@ struct bonding {
 	struct   delayed_work alb_work;
 	struct   delayed_work ad_work;
 	struct   delayed_work mcast_work;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
-	struct   in6_addr master_ipv6;
-#endif
 #ifdef CONFIG_DEBUG_FS
 	/* debugging suport via debugfs */
 	struct	 dentry *debug_dir;
@@ -459,23 +449,4 @@ extern const struct bond_parm_tbl fail_over_mac_tbl[];
 extern const struct bond_parm_tbl pri_reselect_tbl[];
 extern struct bond_parm_tbl ad_select_tbl[];
 
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
-void bond_send_unsolicited_na(struct bonding *bond);
-void bond_register_ipv6_notifier(void);
-void bond_unregister_ipv6_notifier(void);
-#else
-static inline void bond_send_unsolicited_na(struct bonding *bond)
-{
-	return;
-}
-static inline void bond_register_ipv6_notifier(void)
-{
-	return;
-}
-static inline void bond_unregister_ipv6_notifier(void)
-{
-	return;
-}
-#endif
-
 #endif /* _LINUX_BONDING_H */
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index b2ff70f..969e700 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -501,13 +501,14 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event,
 		return NOTIFY_BAD;
 
 	case NETDEV_NOTIFY_PEERS:
+	case NETDEV_BONDING_FAILOVER:
 		/* Propagate to vlan devices */
 		for (i = 0; i < VLAN_N_VID; i++) {
 			vlandev = vlan_group_get_device(grp, i);
 			if (!vlandev)
 				continue;
 
-			call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, vlandev);
+			call_netdevice_notifiers(event, vlandev);
 		}
 		break;
 	}
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 5345b0b..acf553f 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1203,6 +1203,7 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
 			break;
 		/* fall through */
 	case NETDEV_NOTIFY_PEERS:
+	case NETDEV_BONDING_FAILOVER:
 		/* Send gratuitous ARP to notify of link change */
 		inetdev_send_gratuitous_arp(dev, in_dev);
 		break;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index a51fa74c..6f7d491 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1746,6 +1746,7 @@ static int ndisc_netdev_event(struct notifier_block *this, unsigned long event,
 		fib6_run_gc(~0UL, net);
 		break;
 	case NETDEV_NOTIFY_PEERS:
+	case NETDEV_BONDING_FAILOVER:
 		ndisc_send_unsol_na(dev);
 		break;
 	default:
-- 
1.7.4


-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Steps to integrate new 40G Network driver to the kernel tree
From: Joyce Yu - System Software @ 2011-04-15 23:51 UTC (permalink / raw)
  To: netdev

Hello,

We have a new 40G network Linux driver. What are the steps to integrate  
it to the kernel tree?

Thanks,
Joyce


^ permalink raw reply

* Re: [PATCH 1/3] irq: Add registered affinity guidance infrastructure
From: Thomas Gleixner @ 2011-04-16  0:22 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, davem, nhorman, Dimitris Michailidis, David Howells,
	Eric Dumazet, Tom Herbert
In-Reply-To: <1302898677-3833-2-git-send-email-nhorman@tuxdriver.com>

On Fri, 15 Apr 2011, Neil Horman wrote:

> From: nhorman <nhorman@devel2.think-freely.org>
> 
> This patch adds the needed data to the irq_desc struct, as well as the needed
> API calls to allow the requester of an irq to register a handler function to
> determine the affinity_hint of that irq when queried from user space.

This changelog simply sucks. It does not explain the rationale for
this churn at all.

Which problem is it solving?
Why are the current interfaces not sufficient?
....

> +#ifdef CONFIG_AFFINITY_UPDATE
> +extern int setup_affinity_data(int irq, irq_affinity_init_t, void *);

yuck, irq_affinity_init_t ???  

> +#ifdef CONFIG_AFFINITY_UPDATE
> +static inline int __must_check
> +request_affinity_irq(unsigned int irq, irq_handler_t handler,
> +		     irq_handler_t thread_fn,
> +		     unsigned long flags, const char *name, void *dev,
> +		     irq_affinity_init_t af_init, void *af_priv)

So next time we make a wrapper around request_affinity_irq() which
takes another 3 arguments?

> +{
> +	int rc;
> +
> +	rc = request_threaded_irq(irq, handler, thread_fn, flags, name, dev);
> +	if (rc)
> +		goto out;

Brilliant use case for a goto. _NOT_

> +	if (af_init)
> +		rc = setup_affinity_data(irq, af_init, af_priv);
> +	if (rc)
> +		free_irq(irq, dev);
> +
> +out:
> +	return rc;
> +}
> +#else
> +#define request_affinity_irq(irq, hnd, tfn, flg, nm, dev, init, priv) \
> +	request_threaded_irq(irq, hnd, NULL, flg, nm, dev)

Oh nice. tfn becomes magically NULL if that magic CONFIG switch is not
set.

  
>  struct irq_desc;
>  struct irq_data;
> +struct affin_data {

Gah. Do you think that I went to major pain to consolidate the irq
namespace just to accecpt another random one?

Also that's completely undocumented. Hint: 

# grep -C1 "/\*\*" $this_file

> +	void *priv;
> +	char *affinity_alg;

const perhaps ?

> +	void (*affin_update)(int irq, struct affin_data *ad);
> +	void (*affin_cleanup)(int irq, struct affin_data *ad);
> +};
> +
> +typedef int (*irq_affinity_init_t)(int, struct affin_data*, void *);

Whee. Why do you want a typedef for that ?

> --- a/kernel/irq/Kconfig
> +++ b/kernel/irq/Kconfig
> @@ -51,6 +51,17 @@ config IRQ_PREFLOW_FASTEOI
>  config IRQ_FORCED_THREADING
>         bool
>  
> +config AFFINITY_UPDATE
> +	bool "Support irq affinity direction"
> +	depends on GENERIC_HARDIRQS

Right. We need a dependency for somthing which is inside of a guarded
section which selects GENERIC_HARDIRQS.

> +	---help---
> +
> +	Affinity updating adds the ability for requestors of irqs to
> +	register affinity update methods against the irq in question
> +	in so doing the requestor can be informed every time user space
> +	queries an irq for its optimal affinity, giving the requstor the
> +	chance to tell user space where the irq can be optimally handled

-ENOPARSE. I still do not understand what you are trying to solve.

> @@ -64,6 +75,5 @@ config SPARSE_IRQ
>  	    out the interrupt descriptors in a more NUMA-friendly way. )
>  
>  	  If you don't know what to do here, say N.
> -

Unrelated

>  endmenu
>  endif
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index acd599a..257ea4d 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1159,6 +1159,17 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
>  
>  	unregister_handler_proc(irq, action);
>  
> +#ifdef CONFIG_AFFINITY_UPDATE
> +	/*
> +	 * Have to do this after we unregister proc accessors
> +	 */
> +	if (desc->af_data) {
> +		if (desc->af_data->affin_cleanup)
> +			desc->af_data->affin_cleanup(irq, desc->af_data);
> +		kfree(desc->af_data);
> +		desc->af_data = NULL;
> +	}
> +#endif

Grr. Aside of the fact, that I think that whole thing is silly and
overengineered, please move this out of line and keep your fricking
#ifdef mess out of the main code.

>  	/* Make sure it's not being used on another CPU: */
>  	synchronize_irq(irq);
>  
> @@ -1345,6 +1356,34 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
>  }
>  EXPORT_SYMBOL(request_threaded_irq);
>  
> +#ifdef CONFIG_AFFINITY_UPDATE
> +int setup_affinity_data(int irq, irq_affinity_init_t af_init, void *af_priv)

That interface is completely wrong for various reasons:

1) Namespace violation: irq_....

2) This want's to be separated into a allocation and a setter function

> +{
> +	struct affin_data *data;
> +	struct irq_desc *desc;
> +	int rc;
> +
> +	desc = irq_to_desc(irq);
> +	if (!desc)
> +		return -ENOENT;
> +
> +	data = kzalloc(sizeof(struct affin_data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	rc = af_init(irq, data, af_priv);
> +	if (rc) {
> +		kfree(data);
> +		return rc;
> +	}
> +
> +	desc->af_data = data;

Right, we do this unlocked of course.

> +	return 0;
> +}
> +EXPORT_SYMBOL(setup_affinity_data);

No. That want's to be EXPORT_SYMBOL_GPL if at all.

> --- a/kernel/irq/proc.c
> +++ b/kernel/irq/proc.c
> @@ -42,6 +42,11 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
>  	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_AFFINITY_UPDATE
> +	if (desc->af_data && desc->af_data->affin_update)
> +		desc->af_data->affin_update((long)m->private, desc->af_data);
> +#endif
> +

Yikes. How the hell is this related to the changelog and to the scope
of this function? 

This function shows the hint we agreed on and nothing else. We do not
call magic crap via proc.

Locking is not your favourite topic, right ?

>  	raw_spin_lock_irqsave(&desc->lock, flags);
>  	if (desc->affinity_hint)
>  		cpumask_copy(mask, desc->affinity_hint);
> @@ -54,6 +59,19 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
>  	return 0;
>  }
>  
> +static int irq_affinity_alg_proc_show(struct seq_file *m, void *v)
> +{
> +	char *alg = "none";
> +#ifdef CONFIG_AFFINITY_UPDATE
> +	struct irq_desc *desc = irq_to_desc((long)m->private);
> +
> +	if (desc->af_data->affinity_alg)
> +		alg = desc->af_data->affinity_alg;
> +#endif

Nice, we add the policy concept to the kernel another time. No, we
don't want policies in the kernel except there is some reasonable
explanation.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH net-next-2.6 3/3] bonding,ipv4,ipv6,vlan: Handle NETDEV_BONDING_FAILOVER like NETDEV_NOTIFY_PEERS
From: Jay Vosburgh @ 2011-04-16  0:30 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, Andy Gospodarek, Patrick McHardy, netdev,
	Brian Haley
In-Reply-To: <1302911271.2845.41.camel@bwh-desktop>

Ben Hutchings <bhutchings@solarflare.com> wrote:

>It is undesirable for the bonding driver to be poking into higher
>level protocols, and notifiers provide a way to avoid that.  This does
>mean removing the ability to configure reptitition of gratuitous ARPs
>and unsolicited NAs.

	In principle I think this is a good thing (getting rid of some
of those dependencies, duplicated code, etc).

	However, the removal of the multiple grat ARP and NAs may be an
issue for some users.  I don't know that we can just remove this (along
with its API) without going through the feature removal process.

	As I recall, the multiple gratuitous ARP stuff was added for
Infiniband, because it is dependent on the grat ARP for a smooth
failover.

	There is also currently logic to check the linkwatch link state
to wait for the link to go up prior to sending a grat ARP; this is also
for IB.

	Brian Haley added the unsolicited NAs; I've added him to the cc
so perhaps he (or somebody else) can comment on the necessity of keeping
the ability to send multiple NAs.

	-J

>Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
>---
> drivers/net/bonding/Makefile     |    3 -
> drivers/net/bonding/bond_ipv6.c  |  225 --------------------------------------
> drivers/net/bonding/bond_main.c  |   96 ----------------
> drivers/net/bonding/bond_sysfs.c |   80 --------------
> drivers/net/bonding/bonding.h    |   29 -----
> net/8021q/vlan.c                 |    3 +-
> net/ipv4/devinet.c               |    1 +
> net/ipv6/ndisc.c                 |    1 +
> 8 files changed, 4 insertions(+), 434 deletions(-)
> delete mode 100644 drivers/net/bonding/bond_ipv6.c
>
>diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
>index 3c5c014..4c21bf6 100644
>--- a/drivers/net/bonding/Makefile
>+++ b/drivers/net/bonding/Makefile
>@@ -9,6 +9,3 @@ bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o
> proc-$(CONFIG_PROC_FS) += bond_procfs.o
> bonding-objs += $(proc-y)
>
>-ipv6-$(subst m,y,$(CONFIG_IPV6)) += bond_ipv6.o
>-bonding-objs += $(ipv6-y)
>-
>diff --git a/drivers/net/bonding/bond_ipv6.c b/drivers/net/bonding/bond_ipv6.c
>deleted file mode 100644
>index 84fbd4e..0000000
>--- a/drivers/net/bonding/bond_ipv6.c
>+++ /dev/null
>@@ -1,225 +0,0 @@
>-/*
>- * Copyright(c) 2008 Hewlett-Packard Development Company, L.P.
>- *
>- * This program is free software; you can redistribute it and/or modify it
>- * under the terms of the GNU General Public License as published by the
>- * Free Software Foundation; either version 2 of the License, or
>- * (at your option) any later version.
>- *
>- * This program is distributed in the hope that it will be useful, but
>- * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
>- * or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>- * for more details.
>- *
>- * You should have received a copy of the GNU General Public License along
>- * with this program; if not, write to the Free Software Foundation, Inc.,
>- * 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
>- *
>- * The full GNU General Public License is included in this distribution in the
>- * file called LICENSE.
>- *
>- */
>-
>-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>-
>-#include <linux/types.h>
>-#include <linux/if_vlan.h>
>-#include <net/ipv6.h>
>-#include <net/ndisc.h>
>-#include <net/addrconf.h>
>-#include <net/netns/generic.h>
>-#include "bonding.h"
>-
>-/*
>- * Assign bond->master_ipv6 to the next IPv6 address in the list, or
>- * zero it out if there are none.
>- */
>-static void bond_glean_dev_ipv6(struct net_device *dev, struct in6_addr *addr)
>-{
>-	struct inet6_dev *idev;
>-
>-	if (!dev)
>-		return;
>-
>-	idev = in6_dev_get(dev);
>-	if (!idev)
>-		return;
>-
>-	read_lock_bh(&idev->lock);
>-	if (!list_empty(&idev->addr_list)) {
>-		struct inet6_ifaddr *ifa
>-			= list_first_entry(&idev->addr_list,
>-					   struct inet6_ifaddr, if_list);
>-		ipv6_addr_copy(addr, &ifa->addr);
>-	} else
>-		ipv6_addr_set(addr, 0, 0, 0, 0);
>-
>-	read_unlock_bh(&idev->lock);
>-
>-	in6_dev_put(idev);
>-}
>-
>-static void bond_na_send(struct net_device *slave_dev,
>-			 struct in6_addr *daddr,
>-			 int router,
>-			 unsigned short vlan_id)
>-{
>-	struct in6_addr mcaddr;
>-	struct icmp6hdr icmp6h = {
>-		.icmp6_type = NDISC_NEIGHBOUR_ADVERTISEMENT,
>-	};
>-	struct sk_buff *skb;
>-
>-	icmp6h.icmp6_router = router;
>-	icmp6h.icmp6_solicited = 0;
>-	icmp6h.icmp6_override = 1;
>-
>-	addrconf_addr_solict_mult(daddr, &mcaddr);
>-
>-	pr_debug("ipv6 na on slave %s: dest %pI6, src %pI6\n",
>-		 slave_dev->name, &mcaddr, daddr);
>-
>-	skb = ndisc_build_skb(slave_dev, &mcaddr, daddr, &icmp6h, daddr,
>-			      ND_OPT_TARGET_LL_ADDR);
>-
>-	if (!skb) {
>-		pr_err("NA packet allocation failed\n");
>-		return;
>-	}
>-
>-	if (vlan_id) {
>-		/* The Ethernet header is not present yet, so it is
>-		 * too early to insert a VLAN tag.  Force use of an
>-		 * out-of-line tag here and let dev_hard_start_xmit()
>-		 * insert it if the slave hardware can't.
>-		 */
>-		skb = __vlan_hwaccel_put_tag(skb, vlan_id);
>-		if (!skb) {
>-			pr_err("failed to insert VLAN tag\n");
>-			return;
>-		}
>-	}
>-
>-	ndisc_send_skb(skb, slave_dev, NULL, &mcaddr, daddr, &icmp6h);
>-}
>-
>-/*
>- * Kick out an unsolicited Neighbor Advertisement for an IPv6 address on
>- * the bonding master.  This will help the switch learn our address
>- * if in active-backup mode.
>- *
>- * Caller must hold curr_slave_lock for read or better
>- */
>-void bond_send_unsolicited_na(struct bonding *bond)
>-{
>-	struct slave *slave = bond->curr_active_slave;
>-	struct vlan_entry *vlan;
>-	struct inet6_dev *idev;
>-	int is_router;
>-
>-	pr_debug("%s: bond %s slave %s\n", bond->dev->name,
>-		 __func__, slave ? slave->dev->name : "NULL");
>-
>-	if (!slave || !bond->send_unsol_na ||
>-	    test_bit(__LINK_STATE_LINKWATCH_PENDING, &slave->dev->state))
>-		return;
>-
>-	bond->send_unsol_na--;
>-
>-	idev = in6_dev_get(bond->dev);
>-	if (!idev)
>-		return;
>-
>-	is_router = !!idev->cnf.forwarding;
>-
>-	in6_dev_put(idev);
>-
>-	if (!ipv6_addr_any(&bond->master_ipv6))
>-		bond_na_send(slave->dev, &bond->master_ipv6, is_router, 0);
>-
>-	list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
>-		if (!ipv6_addr_any(&vlan->vlan_ipv6)) {
>-			bond_na_send(slave->dev, &vlan->vlan_ipv6, is_router,
>-				     vlan->vlan_id);
>-		}
>-	}
>-}
>-
>-/*
>- * bond_inet6addr_event: handle inet6addr notifier chain events.
>- *
>- * We keep track of device IPv6 addresses primarily to use as source
>- * addresses in NS probes.
>- *
>- * We track one IPv6 for the main device (if it has one).
>- */
>-static int bond_inet6addr_event(struct notifier_block *this,
>-				unsigned long event,
>-				void *ptr)
>-{
>-	struct inet6_ifaddr *ifa = ptr;
>-	struct net_device *vlan_dev, *event_dev = ifa->idev->dev;
>-	struct bonding *bond;
>-	struct vlan_entry *vlan;
>-	struct bond_net *bn = net_generic(dev_net(event_dev), bond_net_id);
>-
>-	list_for_each_entry(bond, &bn->dev_list, bond_list) {
>-		if (bond->dev == event_dev) {
>-			switch (event) {
>-			case NETDEV_UP:
>-				if (ipv6_addr_any(&bond->master_ipv6))
>-					ipv6_addr_copy(&bond->master_ipv6,
>-						       &ifa->addr);
>-				return NOTIFY_OK;
>-			case NETDEV_DOWN:
>-				if (ipv6_addr_equal(&bond->master_ipv6,
>-						    &ifa->addr))
>-					bond_glean_dev_ipv6(bond->dev,
>-							    &bond->master_ipv6);
>-				return NOTIFY_OK;
>-			default:
>-				return NOTIFY_DONE;
>-			}
>-		}
>-
>-		list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
>-			if (!bond->vlgrp)
>-				continue;
>-			vlan_dev = vlan_group_get_device(bond->vlgrp,
>-							 vlan->vlan_id);
>-			if (vlan_dev == event_dev) {
>-				switch (event) {
>-				case NETDEV_UP:
>-					if (ipv6_addr_any(&vlan->vlan_ipv6))
>-						ipv6_addr_copy(&vlan->vlan_ipv6,
>-							       &ifa->addr);
>-					return NOTIFY_OK;
>-				case NETDEV_DOWN:
>-					if (ipv6_addr_equal(&vlan->vlan_ipv6,
>-							    &ifa->addr))
>-						bond_glean_dev_ipv6(vlan_dev,
>-								    &vlan->vlan_ipv6);
>-					return NOTIFY_OK;
>-				default:
>-					return NOTIFY_DONE;
>-				}
>-			}
>-		}
>-	}
>-	return NOTIFY_DONE;
>-}
>-
>-static struct notifier_block bond_inet6addr_notifier = {
>-	.notifier_call = bond_inet6addr_event,
>-};
>-
>-void bond_register_ipv6_notifier(void)
>-{
>-	register_inet6addr_notifier(&bond_inet6addr_notifier);
>-}
>-
>-void bond_unregister_ipv6_notifier(void)
>-{
>-	unregister_inet6addr_notifier(&bond_inet6addr_notifier);
>-}
>-
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index b51e021..5cd4766 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -89,8 +89,6 @@
>
> static int max_bonds	= BOND_DEFAULT_MAX_BONDS;
> static int tx_queues	= BOND_DEFAULT_TX_QUEUES;
>-static int num_grat_arp = 1;
>-static int num_unsol_na = 1;
> static int miimon	= BOND_LINK_MON_INTERV;
> static int updelay;
> static int downdelay;
>@@ -113,10 +111,6 @@ module_param(max_bonds, int, 0);
> MODULE_PARM_DESC(max_bonds, "Max number of bonded devices");
> module_param(tx_queues, int, 0);
> MODULE_PARM_DESC(tx_queues, "Max number of transmit queues (default = 16)");
>-module_param(num_grat_arp, int, 0644);
>-MODULE_PARM_DESC(num_grat_arp, "Number of gratuitous ARP packets to send on failover event");
>-module_param(num_unsol_na, int, 0644);
>-MODULE_PARM_DESC(num_unsol_na, "Number of unsolicited IPv6 Neighbor Advertisements packets to send on failover event");
> module_param(miimon, int, 0);
> MODULE_PARM_DESC(miimon, "Link check interval in milliseconds");
> module_param(updelay, int, 0);
>@@ -234,7 +228,6 @@ struct bond_parm_tbl ad_select_tbl[] = {
>
> /*-------------------------- Forward declarations ---------------------------*/
>
>-static void bond_send_gratuitous_arp(struct bonding *bond);
> static int bond_init(struct net_device *bond_dev);
> static void bond_uninit(struct net_device *bond_dev);
>
>@@ -1160,14 +1153,6 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
> 				bond_do_fail_over_mac(bond, new_active,
> 						      old_active);
>
>-			if (netif_running(bond->dev)) {
>-				bond->send_grat_arp = bond->params.num_grat_arp;
>-				bond_send_gratuitous_arp(bond);
>-
>-				bond->send_unsol_na = bond->params.num_unsol_na;
>-				bond_send_unsolicited_na(bond);
>-			}
>-
> 			write_unlock_bh(&bond->curr_slave_lock);
> 			read_unlock(&bond->lock);
>
>@@ -2578,18 +2563,6 @@ void bond_mii_monitor(struct work_struct *work)
> 	if (bond->slave_cnt == 0)
> 		goto re_arm;
>
>-	if (bond->send_grat_arp) {
>-		read_lock(&bond->curr_slave_lock);
>-		bond_send_gratuitous_arp(bond);
>-		read_unlock(&bond->curr_slave_lock);
>-	}
>-
>-	if (bond->send_unsol_na) {
>-		read_lock(&bond->curr_slave_lock);
>-		bond_send_unsolicited_na(bond);
>-		read_unlock(&bond->curr_slave_lock);
>-	}
>-
> 	if (bond_miimon_inspect(bond)) {
> 		read_unlock(&bond->lock);
> 		rtnl_lock();
>@@ -2751,44 +2724,6 @@ static void bond_arp_send_all(struct bonding *bond, struct slave *slave)
> 	}
> }
>
>-/*
>- * Kick out a gratuitous ARP for an IP on the bonding master plus one
>- * for each VLAN above us.
>- *
>- * Caller must hold curr_slave_lock for read or better
>- */
>-static void bond_send_gratuitous_arp(struct bonding *bond)
>-{
>-	struct slave *slave = bond->curr_active_slave;
>-	struct vlan_entry *vlan;
>-	struct net_device *vlan_dev;
>-
>-	pr_debug("bond_send_grat_arp: bond %s slave %s\n",
>-		 bond->dev->name, slave ? slave->dev->name : "NULL");
>-
>-	if (!slave || !bond->send_grat_arp ||
>-	    test_bit(__LINK_STATE_LINKWATCH_PENDING, &slave->dev->state))
>-		return;
>-
>-	bond->send_grat_arp--;
>-
>-	if (bond->master_ip) {
>-		bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
>-				bond->master_ip, 0);
>-	}
>-
>-	if (!bond->vlgrp)
>-		return;
>-
>-	list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
>-		vlan_dev = vlan_group_get_device(bond->vlgrp, vlan->vlan_id);
>-		if (vlan->vlan_ip) {
>-			bond_arp_send(slave->dev, ARPOP_REPLY, vlan->vlan_ip,
>-				      vlan->vlan_ip, vlan->vlan_id);
>-		}
>-	}
>-}
>-
> static void bond_validate_arp(struct bonding *bond, struct slave *slave, __be32 sip, __be32 tip)
> {
> 	int i;
>@@ -3255,18 +3190,6 @@ void bond_activebackup_arp_mon(struct work_struct *work)
> 	if (bond->slave_cnt == 0)
> 		goto re_arm;
>
>-	if (bond->send_grat_arp) {
>-		read_lock(&bond->curr_slave_lock);
>-		bond_send_gratuitous_arp(bond);
>-		read_unlock(&bond->curr_slave_lock);
>-	}
>-
>-	if (bond->send_unsol_na) {
>-		read_lock(&bond->curr_slave_lock);
>-		bond_send_unsolicited_na(bond);
>-		read_unlock(&bond->curr_slave_lock);
>-	}
>-
> 	if (bond_ab_arp_inspect(bond, delta_in_ticks)) {
> 		read_unlock(&bond->lock);
> 		rtnl_lock();
>@@ -3645,9 +3568,6 @@ static int bond_close(struct net_device *bond_dev)
>
> 	write_lock_bh(&bond->lock);
>
>-	bond->send_grat_arp = 0;
>-	bond->send_unsol_na = 0;
>-
> 	/* signal timers not to re-arm */
> 	bond->kill_timers = 1;
>
>@@ -4724,18 +4644,6 @@ static int bond_check_params(struct bond_params *params)
> 		use_carrier = 1;
> 	}
>
>-	if (num_grat_arp < 0 || num_grat_arp > 255) {
>-		pr_warning("Warning: num_grat_arp (%d) not in range 0-255 so it was reset to 1\n",
>-			   num_grat_arp);
>-		num_grat_arp = 1;
>-	}
>-
>-	if (num_unsol_na < 0 || num_unsol_na > 255) {
>-		pr_warning("Warning: num_unsol_na (%d) not in range 0-255 so it was reset to 1\n",
>-			   num_unsol_na);
>-		num_unsol_na = 1;
>-	}
>-
> 	/* reset values for 802.3ad */
> 	if (bond_mode == BOND_MODE_8023AD) {
> 		if (!miimon) {
>@@ -4925,8 +4833,6 @@ static int bond_check_params(struct bond_params *params)
> 	params->mode = bond_mode;
> 	params->xmit_policy = xmit_hashtype;
> 	params->miimon = miimon;
>-	params->num_grat_arp = num_grat_arp;
>-	params->num_unsol_na = num_unsol_na;
> 	params->arp_interval = arp_interval;
> 	params->arp_validate = arp_validate_value;
> 	params->updelay = updelay;
>@@ -5121,7 +5027,6 @@ static int __init bonding_init(void)
>
> 	register_netdevice_notifier(&bond_netdev_notifier);
> 	register_inetaddr_notifier(&bond_inetaddr_notifier);
>-	bond_register_ipv6_notifier();
> out:
> 	return res;
> err:
>@@ -5136,7 +5041,6 @@ static void __exit bonding_exit(void)
> {
> 	unregister_netdevice_notifier(&bond_netdev_notifier);
> 	unregister_inetaddr_notifier(&bond_inetaddr_notifier);
>-	bond_unregister_ipv6_notifier();
>
> 	bond_destroy_sysfs();
> 	bond_destroy_debugfs();
>diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>index de87aea..259ff32 100644
>--- a/drivers/net/bonding/bond_sysfs.c
>+++ b/drivers/net/bonding/bond_sysfs.c
>@@ -874,84 +874,6 @@ static DEVICE_ATTR(ad_select, S_IRUGO | S_IWUSR,
> 		   bonding_show_ad_select, bonding_store_ad_select);
>
> /*
>- * Show and set the number of grat ARP to send after a failover event.
>- */
>-static ssize_t bonding_show_n_grat_arp(struct device *d,
>-				   struct device_attribute *attr,
>-				   char *buf)
>-{
>-	struct bonding *bond = to_bond(d);
>-
>-	return sprintf(buf, "%d\n", bond->params.num_grat_arp);
>-}
>-
>-static ssize_t bonding_store_n_grat_arp(struct device *d,
>-				    struct device_attribute *attr,
>-				    const char *buf, size_t count)
>-{
>-	int new_value, ret = count;
>-	struct bonding *bond = to_bond(d);
>-
>-	if (sscanf(buf, "%d", &new_value) != 1) {
>-		pr_err("%s: no num_grat_arp value specified.\n",
>-		       bond->dev->name);
>-		ret = -EINVAL;
>-		goto out;
>-	}
>-	if (new_value < 0 || new_value > 255) {
>-		pr_err("%s: Invalid num_grat_arp value %d not in range 0-255; rejected.\n",
>-		       bond->dev->name, new_value);
>-		ret = -EINVAL;
>-		goto out;
>-	} else {
>-		bond->params.num_grat_arp = new_value;
>-	}
>-out:
>-	return ret;
>-}
>-static DEVICE_ATTR(num_grat_arp, S_IRUGO | S_IWUSR,
>-		   bonding_show_n_grat_arp, bonding_store_n_grat_arp);
>-
>-/*
>- * Show and set the number of unsolicited NA's to send after a failover event.
>- */
>-static ssize_t bonding_show_n_unsol_na(struct device *d,
>-				       struct device_attribute *attr,
>-				       char *buf)
>-{
>-	struct bonding *bond = to_bond(d);
>-
>-	return sprintf(buf, "%d\n", bond->params.num_unsol_na);
>-}
>-
>-static ssize_t bonding_store_n_unsol_na(struct device *d,
>-					struct device_attribute *attr,
>-					const char *buf, size_t count)
>-{
>-	int new_value, ret = count;
>-	struct bonding *bond = to_bond(d);
>-
>-	if (sscanf(buf, "%d", &new_value) != 1) {
>-		pr_err("%s: no num_unsol_na value specified.\n",
>-		       bond->dev->name);
>-		ret = -EINVAL;
>-		goto out;
>-	}
>-
>-	if (new_value < 0 || new_value > 255) {
>-		pr_err("%s: Invalid num_unsol_na value %d not in range 0-255; rejected.\n",
>-		       bond->dev->name, new_value);
>-		ret = -EINVAL;
>-		goto out;
>-	} else
>-		bond->params.num_unsol_na = new_value;
>-out:
>-	return ret;
>-}
>-static DEVICE_ATTR(num_unsol_na, S_IRUGO | S_IWUSR,
>-		   bonding_show_n_unsol_na, bonding_store_n_unsol_na);
>-
>-/*
>  * Show and set the MII monitor interval.  There are two tricky bits
>  * here.  First, if MII monitoring is activated, then we must disable
>  * ARP monitoring.  Second, if the timer isn't running, we must
>@@ -1650,8 +1572,6 @@ static struct attribute *per_bond_attrs[] = {
> 	&dev_attr_lacp_rate.attr,
> 	&dev_attr_ad_select.attr,
> 	&dev_attr_xmit_hash_policy.attr,
>-	&dev_attr_num_grat_arp.attr,
>-	&dev_attr_num_unsol_na.attr,
> 	&dev_attr_miimon.attr,
> 	&dev_attr_primary.attr,
> 	&dev_attr_primary_reselect.attr,
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index 90736cb..77180b1 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -149,8 +149,6 @@ struct bond_params {
> 	int mode;
> 	int xmit_policy;
> 	int miimon;
>-	int num_grat_arp;
>-	int num_unsol_na;
> 	int arp_interval;
> 	int arp_validate;
> 	int use_carrier;
>@@ -178,9 +176,6 @@ struct vlan_entry {
> 	struct list_head vlan_list;
> 	__be32 vlan_ip;
> 	unsigned short vlan_id;
>-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>-	struct in6_addr vlan_ipv6;
>-#endif
> };
>
> struct slave {
>@@ -234,8 +229,6 @@ struct bonding {
> 	rwlock_t lock;
> 	rwlock_t curr_slave_lock;
> 	s8       kill_timers;
>-	s8	 send_grat_arp;
>-	s8	 send_unsol_na;
> 	s8	 setup_by_slave;
> 	s8       igmp_retrans;
> #ifdef CONFIG_PROC_FS
>@@ -260,9 +253,6 @@ struct bonding {
> 	struct   delayed_work alb_work;
> 	struct   delayed_work ad_work;
> 	struct   delayed_work mcast_work;
>-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>-	struct   in6_addr master_ipv6;
>-#endif
> #ifdef CONFIG_DEBUG_FS
> 	/* debugging suport via debugfs */
> 	struct	 dentry *debug_dir;
>@@ -459,23 +449,4 @@ extern const struct bond_parm_tbl fail_over_mac_tbl[];
> extern const struct bond_parm_tbl pri_reselect_tbl[];
> extern struct bond_parm_tbl ad_select_tbl[];
>
>-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>-void bond_send_unsolicited_na(struct bonding *bond);
>-void bond_register_ipv6_notifier(void);
>-void bond_unregister_ipv6_notifier(void);
>-#else
>-static inline void bond_send_unsolicited_na(struct bonding *bond)
>-{
>-	return;
>-}
>-static inline void bond_register_ipv6_notifier(void)
>-{
>-	return;
>-}
>-static inline void bond_unregister_ipv6_notifier(void)
>-{
>-	return;
>-}
>-#endif
>-
> #endif /* _LINUX_BONDING_H */
>diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
>index b2ff70f..969e700 100644
>--- a/net/8021q/vlan.c
>+++ b/net/8021q/vlan.c
>@@ -501,13 +501,14 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event,
> 		return NOTIFY_BAD;
>
> 	case NETDEV_NOTIFY_PEERS:
>+	case NETDEV_BONDING_FAILOVER:
> 		/* Propagate to vlan devices */
> 		for (i = 0; i < VLAN_N_VID; i++) {
> 			vlandev = vlan_group_get_device(grp, i);
> 			if (!vlandev)
> 				continue;
>
>-			call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, vlandev);
>+			call_netdevice_notifiers(event, vlandev);
> 		}
> 		break;
> 	}
>diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
>index 5345b0b..acf553f 100644
>--- a/net/ipv4/devinet.c
>+++ b/net/ipv4/devinet.c
>@@ -1203,6 +1203,7 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
> 			break;
> 		/* fall through */
> 	case NETDEV_NOTIFY_PEERS:
>+	case NETDEV_BONDING_FAILOVER:
> 		/* Send gratuitous ARP to notify of link change */
> 		inetdev_send_gratuitous_arp(dev, in_dev);
> 		break;
>diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
>index a51fa74c..6f7d491 100644
>--- a/net/ipv6/ndisc.c
>+++ b/net/ipv6/ndisc.c
>@@ -1746,6 +1746,7 @@ static int ndisc_netdev_event(struct notifier_block *this, unsigned long event,
> 		fib6_run_gc(~0UL, net);
> 		break;
> 	case NETDEV_NOTIFY_PEERS:
>+	case NETDEV_BONDING_FAILOVER:
> 		ndisc_send_unsol_na(dev);
> 		break;
> 	default:
>-- 
>1.7.4
>
>
>-- 
>Ben Hutchings, Senior Software Engineer, Solarflare
>Not speaking for my employer; that's the marketing department's job.
>They asked us to note that Solarflare product names are trademarked.
>

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com


^ permalink raw reply

* Re: net: Automatic IRQ siloing for network devices
From: Ben Hutchings @ 2011-04-16  0:50 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem
In-Reply-To: <1302908069.2845.29.camel@bwh-desktop>

On Fri, 2011-04-15 at 23:54 +0100, Ben Hutchings wrote:
> On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > Automatic IRQ siloing for network devices
> > 
> > At last years netconf:
> > http://vger.kernel.org/netconf2010.html
> > 
> > Tom Herbert gave a talk in which he outlined some of the things we can do to
> > improve scalability and througput in our network stack
> > 
> > One of the big items on the slides was the notion of siloing irqs, which is the
> > practice of setting irq affinity to a cpu or cpu set that was 'close' to the
> > process that would be consuming data.  The idea was to ensure that a hard irq
> > for a nic (and its subsequent softirq) would execute on the same cpu as the
> > process consuming the data, increasing cache hit rates and speeding up overall
> > throughput.
> > 
> > I had taken an idea away from that talk, and have finally gotten around to
> > implementing it.  One of the problems with the above approach is that its all
> > quite manual.  I.e. to properly enact this siloiong, you have to do a few things
> > by hand:
> > 
> > 1) decide which process is the heaviest user of a given rx queue 
> > 2) restrict the cpus which that task will run on
> > 3) identify the irq which the rx queue in (1) maps to
> > 4) manually set the affinity for the irq in (3) to cpus which match the cpus in
> > (2)
> [...]
> 
> This presumably works well with small numbers of flows and/or large
> numbers of queues.  You could scale it up somewhat by manipulating the
> device's flow hash indirection table, but that usually only has 128
> entries.  (Changing the indirection table is currently quite expensive,
> though that could be changed.)
[...]

Actually, I reckon you could do a more or less generic implementation of
accelerated RFS on top of a flow hash indirection table.  It would
require the drivers to provide a new function to update single table
entries, and some way to switch between automatic configuration by RFS
and manual configuration with ethtool.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: Steps to integrate new 40G Network driver to the kernel tree
From: Stephen Hemminger @ 2011-04-16  0:54 UTC (permalink / raw)
  To: Joyce Yu - System Software; +Cc: netdev
In-Reply-To: <4DA8D9F6.2070908@oracle.com>

On Fri, 15 Apr 2011 16:51:18 -0700
Joyce Yu - System Software <joyce.yu@oracle.com> wrote:

> Hello,
> 
> We have a new 40G network Linux driver. What are the steps to integrate  
> it to the kernel tree?

There is a simple well documented process.
See the file SubmittingPatches in the kernel source Documentation directory.

Network drivers should be submitted to netdev list.

^ permalink raw reply

* Re: Steps to integrate new 40G Network driver to the kernel tree
From: Ben Hutchings @ 2011-04-16  0:57 UTC (permalink / raw)
  To: Joyce Yu - System Software; +Cc: netdev
In-Reply-To: <4DA8D9F6.2070908@oracle.com>

On Fri, 2011-04-15 at 16:51 -0700, Joyce Yu - System Software wrote:
> Hello,
> 
> We have a new 40G network Linux driver. What are the steps to integrate  
> it to the kernel tree?

Make sure it works as an addition to David Miller's net-2.6 git tree.
Send patches here for review, in <100K chunks (that's the limit for this
list).  Update it based on review feedback, and repeat as necessary.

Documentation/SubmittingPatches has general advice on patches.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 2/3] net: Add net device irq siloing feature
From: Neil Horman @ 2011-04-16  1:49 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, davem, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert
In-Reply-To: <1302907743.2845.23.camel@bwh-desktop>

On Fri, Apr 15, 2011 at 11:49:03PM +0100, Ben Hutchings wrote:
> On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > Using the irq affinity infrastrucuture, we can now allow net devices to call
> > request_irq using a new wrapper function (request_net_irq), which will attach a
> > common affinty_update handler to each requested irq.  This affinty update
> > mechanism correlates each tracked irq to the flow(s) that said irq processes
> > most frequently.  The highest traffic flow is noted, marked and exported to user
> > space via the affinity_hint proc file for each irq. In this way, utilities like
> > irqbalance are able to determine  which cpu is recieving the most data from each
> > rx queue on a given NIC, and set irq affinity accordingly.
> [...]
> 
> Is irqbalance expected to poll the affinity hints?  How often?
> 
Yes, its done just that for quite some time.  Intel added that ability at the
same time they added the affinity_hint proc file.  Irqbalance polls the
affinity_hint file at the same time it rebalances all irqs (every 10 seconds).
If the affinity_hint is non-zero, irqbalance just copies it to smp_affinity for
the same irq.  Up until now thats been just about dead code because only ixgbe
sets affinity_hint.  Thats why I added the affinity_alg file, so irqbalance
could do something more intellegent than just a blind copy.  With the patch that
I referenced I added code to irqbalance to allow it to preform different
balancing methods based on the output of affinity_alg.
Neil

> Ben.
> 
> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 

^ permalink raw reply

* Re: [PATCH net-next-2.6 3/3] bonding,ipv4,ipv6,vlan: Handle NETDEV_BONDING_FAILOVER like NETDEV_NOTIFY_PEERS
From: Brian Haley @ 2011-04-16  1:51 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Ben Hutchings, David Miller, Andy Gospodarek, Patrick McHardy,
	netdev
In-Reply-To: <22334.1302913805@death>

On 04/15/2011 08:30 PM, Jay Vosburgh wrote:
> Ben Hutchings <bhutchings@solarflare.com> wrote:
> 
>> It is undesirable for the bonding driver to be poking into higher
>> level protocols, and notifiers provide a way to avoid that.  This does
>> mean removing the ability to configure reptitition of gratuitous ARPs
>> and unsolicited NAs.
> 
> 	In principle I think this is a good thing (getting rid of some
> of those dependencies, duplicated code, etc).
> 
> 	However, the removal of the multiple grat ARP and NAs may be an
> issue for some users.  I don't know that we can just remove this (along
> with its API) without going through the feature removal process.

Right, I don't know how many people are using these, they might not be
happy, especially since specifying an unknown parameter will cause a
module load to fail:

--> modprobe bonding foobar=27
FATAL: Error inserting bonding (/lib/modules/2.6.32-31-generic/kernel/drivers/net/bonding/bonding.ko): Unknown symbol in module, or unknown parameter (see dmesg)

When these params are stuffed in /etc/modprobe.d/options, a reboot to
a kernel without them will cause some swearing :)

BTW, if this is accepted you need to update the documentation as well.

> 	As I recall, the multiple gratuitous ARP stuff was added for
> Infiniband, because it is dependent on the grat ARP for a smooth
> failover.
> 
> 	There is also currently logic to check the linkwatch link state
> to wait for the link to go up prior to sending a grat ARP; this is also
> for IB.
> 
> 	Brian Haley added the unsolicited NAs; I've added him to the cc
> so perhaps he (or somebody else) can comment on the necessity of keeping
> the ability to send multiple NAs.

I added it because in an IPv6-only environment I was seeing really long
failover times on bonds.  I believe this was a customer-reported issue, so
there *might* be someone setting it, but I think my testing always showed
one was enough to wake-up the switch.

Is it useful to call netdev_bonding_change() multiple times from within
bond_change_active_slave(), like MAX(arp, na) times?

One comment below...

>> -/*
>> - * Kick out a gratuitous ARP for an IP on the bonding master plus one
>> - * for each VLAN above us.
>> - *
>> - * Caller must hold curr_slave_lock for read or better
>> - */
>> -static void bond_send_gratuitous_arp(struct bonding *bond)
>> -{
>> -	struct slave *slave = bond->curr_active_slave;
>> -	struct vlan_entry *vlan;
>> -	struct net_device *vlan_dev;
>> -
>> -	pr_debug("bond_send_grat_arp: bond %s slave %s\n",
>> -		 bond->dev->name, slave ? slave->dev->name : "NULL");
>> -
>> -	if (!slave || !bond->send_grat_arp ||
>> -	    test_bit(__LINK_STATE_LINKWATCH_PENDING, &slave->dev->state))
>> -		return;
>> -
>> -	bond->send_grat_arp--;
>> -
>> -	if (bond->master_ip) {
>> -		bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
>> -				bond->master_ip, 0);
>> -	}
>> -
>> -	if (!bond->vlgrp)
>> -		return;
>> -
>> -	list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
>> -		vlan_dev = vlan_group_get_device(bond->vlgrp, vlan->vlan_id);
>> -		if (vlan->vlan_ip) {
>> -			bond_arp_send(slave->dev, ARPOP_REPLY, vlan->vlan_ip,
>> -				      vlan->vlan_ip, vlan->vlan_id);
>> -		}
>> -	}
>> -}

Does your change also cover this case with multiple VLAN IDs?  Is that covered in
the vlan.c code below?

>> diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
>> index b2ff70f..969e700 100644
>> --- a/net/8021q/vlan.c
>> +++ b/net/8021q/vlan.c
>> @@ -501,13 +501,14 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event,
>> 		return NOTIFY_BAD;
>>
>> 	case NETDEV_NOTIFY_PEERS:
>> +	case NETDEV_BONDING_FAILOVER:
>> 		/* Propagate to vlan devices */
>> 		for (i = 0; i < VLAN_N_VID; i++) {
>> 			vlandev = vlan_group_get_device(grp, i);
>> 			if (!vlandev)
>> 				continue;
>>
>> -			call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, vlandev);
>> +			call_netdevice_notifiers(event, vlandev);
>> 		}
>> 		break;
>> 	}

Thanks,

-Brian

^ permalink raw reply

* Re: net: Automatic IRQ siloing for network devices
From: Neil Horman @ 2011-04-16  1:59 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, davem
In-Reply-To: <1302908069.2845.29.camel@bwh-desktop>

On Fri, Apr 15, 2011 at 11:54:29PM +0100, Ben Hutchings wrote:
> On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > Automatic IRQ siloing for network devices
> > 
> > At last years netconf:
> > http://vger.kernel.org/netconf2010.html
> > 
> > Tom Herbert gave a talk in which he outlined some of the things we can do to
> > improve scalability and througput in our network stack
> > 
> > One of the big items on the slides was the notion of siloing irqs, which is the
> > practice of setting irq affinity to a cpu or cpu set that was 'close' to the
> > process that would be consuming data.  The idea was to ensure that a hard irq
> > for a nic (and its subsequent softirq) would execute on the same cpu as the
> > process consuming the data, increasing cache hit rates and speeding up overall
> > throughput.
> > 
> > I had taken an idea away from that talk, and have finally gotten around to
> > implementing it.  One of the problems with the above approach is that its all
> > quite manual.  I.e. to properly enact this siloiong, you have to do a few things
> > by hand:
> > 
> > 1) decide which process is the heaviest user of a given rx queue 
> > 2) restrict the cpus which that task will run on
> > 3) identify the irq which the rx queue in (1) maps to
> > 4) manually set the affinity for the irq in (3) to cpus which match the cpus in
> > (2)
> [...]
> 
> This presumably works well with small numbers of flows and/or large
> numbers of queues.  You could scale it up somewhat by manipulating the
> device's flow hash indirection table, but that usually only has 128
> entries.  (Changing the indirection table is currently quite expensive,
> though that could be changed.)
> 
> I see RFS and accelerated RFS as the only reasonable way to scale to
> large numbers of flows.  And as part of accelerated RFS, I already did
> the work for mapping CPUs to IRQs (note, not the other way round).  If
> IRQ affinity keeps changing then it will significantly undermine the
> usefulness of hardware flow steering.
> 
> Now I'm not saying that your approach is useless.  There is more
> hardware out there with flow hashing than with flow steering, and there
> are presumably many systems with small numbers of active flows.  But I
> think we need to avoid having two features that conflict and a
> requirement for administrators to make a careful selection between them.
> 
> Ben.
> 
I hear what your saying and I agree, theres no point in having features work
against each other.  That said, I'm not sure I agree that these features have to
work against one another, nor does a sysadmin need to make a choice between the
two.  Note the third patch in this series.  Making this work requires that
network drivers wanting to participate in this affinity algorithm opt in by
using the request_net_irq macro to attach the interrupt to the rfs affinity code
that I added.  Theres no reason that a driver which supports hardware that still
uses flow steering can't opt out of this algorithm, and as a result irqbalance
will still treat those interrupts as it normally does.  And for those drivers
which do opt in, irqbalance can take care of affinity assignment, using the
provided hint.  No need for sysadmin intervention.

I'm sure there can be improvements made to this code, but I think theres less
conflict between the work you've done and this code than there appears to be at
first blush.

Best
Neil

> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 

^ permalink raw reply

* Re: [PATCH 1/3] irq: Add registered affinity guidance infrastructure
From: Neil Horman @ 2011-04-16  2:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, davem, nhorman, Dimitris Michailidis, David Howells,
	Eric Dumazet, Tom Herbert
In-Reply-To: <alpine.LFD.2.00.1104160144370.2744@localhost6.localdomain6>

On Sat, Apr 16, 2011 at 02:22:58AM +0200, Thomas Gleixner wrote:
> On Fri, 15 Apr 2011, Neil Horman wrote:
> 
> > From: nhorman <nhorman@devel2.think-freely.org>
> > 
> > This patch adds the needed data to the irq_desc struct, as well as the needed
> > API calls to allow the requester of an irq to register a handler function to
> > determine the affinity_hint of that irq when queried from user space.
> 
> This changelog simply sucks. It does not explain the rationale for
> this churn at all.
> 
It seems pretty clear to me.  It allows a common function to update the value of
affinity_hint when its queried.

> Which problem is it solving?
> Why are the current interfaces not sufficient?
Did you read the initial post that I sent with it?  Apparently not.  Apologies,
its seems my git-send-email didn't cc everyone I expected it to:
http://marc.info/?l=linux-netdev&m=130291921026187&w=2

I'll skip the rest of your your email, and just try to turn some of your rant
into something more acceptible to you.
Neil

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox