* Re: [PATCH 1/3] Kernel interfaces for multiqueue aware socket
From: Junchang Wang @ 2010-12-17 6:15 UTC (permalink / raw)
To: Eric Dumazet
Cc: Fenghua Yu, David S. Miller, John Fastabend, Xinan Tang, netdev,
linux-kernel
In-Reply-To: <1292475627.2603.39.camel@edumazet-laptop>
On Thu, Dec 16, 2010 at 1:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 16 décembre 2010 à 09:52 +0800, Junchang Wang a écrit :
>> Commit 564824b0c52c34692d had been used in the experiments, but the problem
>> remained unsolved.
>>
>> SLUB was used, and both servers were equipped with 8G physical memory.
>> Is there any
>> additional information I can provide?
>>
>
> Yes, sure, you could provide a description of the bench you used, and
> data you gathered to make the conclusion that NUMA was a problem.
>
Under the current circumstances (1Mpps), we can hardly see side effects
from memory allocator. With higher speed (say, 5Mpps with this patch set),
the problem emerged.
I'll continue this work after the patch set is done.
Thanks.
--
--Junchang
^ permalink raw reply
* Re: [PATCH 1/3] Kernel interfaces for multiqueue aware socket
From: Junchang Wang @ 2010-12-17 6:12 UTC (permalink / raw)
To: Eric Dumazet
Cc: Fenghua Yu, David S. Miller, Fastabend, John R, Tang, Xinan,
netdev, linux-kernel
In-Reply-To: <1292474660.2603.37.camel@edumazet-laptop>
On Thu, Dec 16, 2010 at 12:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> We really need to be smarter than that, not adding raw API.
>
> Tom Herbert added RPS, RFS, XPS, in a way applications dont have to use
> special API, just run normal code.
>
> Please understand that using 8 AF_PACKET sockets bound to a given device
> is a total waste, because the way we loop on ptype_all before entering
> AF_PACKET code, and in 12% of the cases deliver the packet into a queue,
> and 77.5% of the case reject the packet.
>
> This is absolutely not scalable to say... 64 queues.
>
> I do believe we can handle that using one AF_PACKET socket for the RX
> side, in order to not slow down the loop we have in
> __netif_receive_skb()
>
> list_for_each_entry_rcu(ptype, &ptype_all, list) {
> ...
> deliver_skb(skb, pt_prev, orig_dev);
> }
>
> (Same problem with dev_queue_xmit_nit() by the way, even worse since we
> skb_clone() packet _before_ entering af_packet code)
>
> And we can change af_packet to split the load to N skb queues or N ring
> buffers, N not being necessarly number of NIC queues, but the number
> needed to handle the expected load.
>
> There is nothing preventing us changing af_packet/udp/tcp_listener to
> something more scalable in itself, using a set of receive queues, and
> NUMA friendly data set. We did multiqueue for a net_device like this,
> not adding N pseudo devices as we could have done.
>
Valuable comments. Thank you very much.
We'll cook a new version and resubmit it.
--
--Junchang
^ permalink raw reply
* biosdevname v0.3.4
From: Matt Domsch @ 2010-12-17 5:06 UTC (permalink / raw)
To: linux-hotplug, netdev, K, Narendra, Hargrave, Jordan,
Rose, Charles, Co
biosdevname, now version 0.3.4.
The main visible change is that port indices now start at 1 rather
than 0, when assigned by biosdevname (such as falling back to PIRQ)
rather explicitly assigned by BIOS. This is in keeping with how the
indices are assigned by BIOS on Dell and HP servers.
em<port> where port starts at 1
pci<slot>#<port> where port starts at 1
As a side effect, the first VMware Workstation guest NIC now appears as pci3#1
because the virtual machine BIOS exposes the device as being in a PCI
slot via PIRQ.
This also drops an explicit dependency check on a particular udev
version. That version was supposed to properly handle parallel
conflicting renames when swizzling within the ethX namespace, but as
we've discovered, that doesn't always work. The udev in RHEL5 is
older than what we were specifying, but it works just fine, so no more
check.
Furthermore, if biosdevname somehow messes up (either through its own
bug or because of a buggy BIOS), and would assign the same name to two
different devices, it won't try to assign names to either (who knows
which is correct?). You can see the duplciates when running with the
-d debug option.
Grab it here:
http://linux.dell.com/files/biosdevname/permalink/biosdevname-0.3.4.tar.gz
http://linux.dell.com/files/biosdevname/permalink/biosdevname-0.3.4.tar.gz.sign
git://linux.dell.com/biosdevname.git
I built this today for Fedora rawhide (will be 15), and I encourage
other distributions to pick it up as well.
shortlog:
Matt Domsch (5):
require any udev
Return nothing if duplicate names would be assigned.
Don't assign names to unknown devices
only supress duplicates, not all names if any duplicates exist
start with port index 1, not index 0
Thanks,
Matt
--
Matt Domsch
Technology Strategist
Dell | Office of the CTO
^ permalink raw reply
* [PATCH net-next] ipv6: remove duplicate neigh_ifdown
From: Stephen Hemminger @ 2010-12-17 3:42 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20101216175152.6767d0a7@nehalam>
When device is being set to down, neigh_ifdown was being called
twice. Once from addrconf notifier and once from ndisc notifier.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/net/ipv6/addrconf.c 2010-12-16 17:50:26.658169250 -0800
+++ b/net/ipv6/addrconf.c 2010-12-16 17:52:15.227220647 -0800
@@ -2672,7 +2672,6 @@ static int addrconf_ifdown(struct net_de
/* Flush routes if device is being removed or it is not loopback */
if (how || !(dev->flags & IFF_LOOPBACK))
rt6_ifdown(net, dev);
- neigh_ifdown(&nd_tbl, dev);
idev = __in6_dev_get(dev);
if (idev == NULL)
^ permalink raw reply
* [PATCH net-next] ipv6: fib6_ifdown cleanup
From: Stephen Hemminger @ 2010-12-17 3:42 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Remove (unnecessary) casts to make code cleaner.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/net/ipv6/route.c 2010-12-16 15:09:42.336361755 -0800
+++ b/net/ipv6/route.c 2010-12-16 17:50:27.997984246 -0800
@@ -2004,11 +2004,11 @@ struct arg_dev_net {
static int fib6_ifdown(struct rt6_info *rt, void *arg)
{
- struct net_device *dev = ((struct arg_dev_net *)arg)->dev;
- struct net *net = ((struct arg_dev_net *)arg)->net;
+ const struct arg_dev_net *adn = arg;
+ const struct net_device *dev = adn->dev;
- if (((void *)rt->rt6i_dev == dev || dev == NULL) &&
- rt != net->ipv6.ip6_null_entry) {
+ if ((rt->rt6i_dev == dev || dev == NULL) &&
+ rt != adn->net->ipv6.ip6_null_entry) {
RT6_TRACE("deleted by ifdown %p\n", rt);
return -1;
}
^ permalink raw reply
* Re: [PATCH v2] e1000e: convert to stats64
From: Jeff Kirsher @ 2010-12-17 3:14 UTC (permalink / raw)
To: Flavio Leitner; +Cc: Eric Dumazet, netdev, e1000-devel
In-Reply-To: <20101216123131.GA3070@redhat.com>
On Thu, Dec 16, 2010 at 04:31, Flavio Leitner <fleitner@redhat.com> wrote:
> On Tue, Dec 14, 2010 at 10:29:33PM +0100, Eric Dumazet wrote:
>> Le mardi 14 décembre 2010 à 18:32 -0200, Flavio Leitner a écrit :
>> > Provides accurate stats at the time user reads them.
>> >
>> > Signed-off-by: Flavio Leitner <fleitner@redhat.com>
>> > ---
>> > drivers/net/e1000e/e1000.h | 5 ++-
>> > drivers/net/e1000e/ethtool.c | 27 +++++++++-------
>> > drivers/net/e1000e/netdev.c | 68 ++++++++++++++++++++++++-----------------
>> > 3 files changed, 59 insertions(+), 41 deletions(-)
>> >
>> > diff --git a/drivers/net/e1000e/e1000.h b/drivers/net/e1000e/e1000.h
>> > index fdc67fe..5a5e944 100644
>> > --- a/drivers/net/e1000e/e1000.h
>> > +++ b/drivers/net/e1000e/e1000.h
>> > @@ -363,6 +363,8 @@ struct e1000_adapter {
>> > /* structs defined in e1000_hw.h */
>> > struct e1000_hw hw;
>> >
>> > + spinlock_t stats64_lock;
>> > + struct rtnl_link_stats64 stats64;
>>
>> I am not sure why you add this stats64 in e1000_adapter ?
>>
>> Why isnt it provided by callers (automatic variable, or provided to
>> ndo_get_stats64()). I dont see accumulators, only a full rewrite of this
>> structure in e1000e_update_stats() ?
>
> Good point. I have modified the patch to fix that.
> thanks!
>
> From 3487bd7dacd0c23bba315270139dab6e00e5ff02 Mon Sep 17 00:00:00 2001
> From: Flavio Leitner <fleitner@redhat.com>
> Date: Thu, 16 Dec 2010 10:26:03 -0200
> Subject: [PATCH] e1000e: convert to stats64
>
> Provides accurate stats at the time user reads them.
>
> Signed-off-by: Flavio Leitner <fleitner@redhat.com>
> ---
> drivers/net/e1000e/e1000.h | 3 ++
> drivers/net/e1000e/ethtool.c | 25 ++++++++-------
> drivers/net/e1000e/netdev.c | 68 +++++++++++++++++++++++++++++++++--------
> 3 files changed, 70 insertions(+), 26 deletions(-)
>
I have dropped you previous version of the patch and applied v2 to my
tree for review and testing.
Thanks Flavio!
--
Cheers,
Jeff
^ permalink raw reply
* Re: [PATCH net-next] bnx2x: Add Nic partitioning mode (57712 devices)
From: Matt Domsch @ 2010-12-17 2:45 UTC (permalink / raw)
To: Eilon Greenstein
Cc: Dimitris Michailidis, Dmitry Kravkov, davem@davemloft.net,
netdev@vger.kernel.org, narendra_k@dell.com,
jordan_hargrave@dell.com
In-Reply-To: <1291906166.21210.10.camel@lb-tlvb-eilong.il.broadcom.com>
On Thu, Dec 09, 2010 at 04:49:25PM +0200, Eilon Greenstein wrote:
> On Mon, 2010-12-06 at 10:21 -0800, Dimitris Michailidis wrote:
> > Matt Domsch wrote:
> ...
> > /sys/class/net/<ifname>/dev_id indicates the physical port <ifname> is
> > associated with. At least a few drivers set up dev_id this way.
> >
> >
>
> So we are on agreement? This can satisf all needs? If so, we will add
> this scheme to the bnx2x as well.
I don't think that's enough. Necessary, but not sufficient.
If dev_id is a field that starts over with each PCI device (e.g. is
used to distinguish multiple ports that share the same PCI
device), that's enough to handle the Chelsio case, but not the NPAR &
SR-IOV case.
If the above is true, then a value of dev_id=0 for all 1:1 PCI Device
: Port relations is fine, leaving the three drivers that set dev_id
non-zero are all multi-port, single PCI device controllers.
cxgb4/t4_hw.c: adap->port[i]->dev_id = j;
mlx4/en_netdev.c: dev->dev_id = port - 1;
sfc/siena.c: efx->net_dev->dev_id = EFX_OWORD_FIELD(reg, FRF_CZ_CS_PORT_NUM) - 1;
Is that truly how these three controllers work: they set dev_id when
there are multiple physical ports that a single PCI d/b/d/f drives?
My naming convention of:
pci<slot>#<port>
wants to express this relationship. If I have a card with 2 PCI
devices, and 2 physical ports on each device, I have 4 ports to
describe. The dev_ids would look like: 0,1 0,1 , so I can't use that
value directly. I can make a list of PCI devices on the same card,
look at the dev_id field of each, and run a counter:
for each slot:
int port=1;
for each pci device:
for each in net/<interface>/dev_id:
use name pci<slot>#<port>
port++
OK? Can someone with such a card send me tree /sys, so I can see the
tree does really look like I expect:
/sys/devices/pci0000:00/0000:00:1c.0/0000:0b:00.0/net/eth0/dev_id = 0
/sys/devices/pci0000:00/0000:00:1c.0/0000:0b:00.0/net/eth1/dev_id = 1
simply finding a net/ subdir under a PCI device, each of the
directories in net/ are interface names, with different dev_id values.
Now for the partitioned devices (NPAR or SR-IOV). Here, we have
multiple PCI devices mapped to the same port.
My naming convention of:
pci<slot>#<port>_<partition>
wants to express this relationship.
I need a way to express which port a given partition maps to. I'm
also presuming this is a static mapping right now, that it won't
change around during runtime (ala Xsigo, which I have no solution here
for; if the mapping isn't static, this is going to get trickier).
As dev_ids are only unique per PCI device, we would need a pointer to
the "base" device. However, in the Broadcom 57712 case, there is no
such "base" device. :-( So, using dev_id here doesn't seem like the
right approach for these devices.
What if we did something like this?
/sys/devices/net_ports/port0/
/sys/devices/pci0000:00/0000:00:1c.0/0000:0b:00.0/net/eth0/port ->
/../../../../../net_ports/port0
/sys/devices/pci0000:00/0000:00:1c.0/0000:0b:00.1/net/eth1/port ->
/../../../../../net_ports/port0
In this case, the port0 "name" is simply a way to group interfaces
into ports, it's not how ports are labeled on the chassis.
Do network drivers know how many ports they have?
What are the characteristics of network ports? Ideally, physical
location (PCI slot), and index within that physical location. These
right now I'm deriving from SMBIOS and PCI, and if not explicitly
exposed, counting devices on the same slot and assigning port numbers
that way, but I would love to have explicit information from the
drivers.
Thoughts?
Thanks,
Matt
--
Matt Domsch
Technology Strategist
Dell | Office of the CTO
^ permalink raw reply
* Re: [RFC] ipv6: don't flush routes when setting loopback down
From: David Miller @ 2010-12-17 2:26 UTC (permalink / raw)
To: ebiederm
Cc: shemminger, brian.haley, netdev, maheshkelkar, lorenzo, yoshfuji,
stable
In-Reply-To: <m1mxo5tcx6.fsf@fess.ebiederm.org>
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Thu, 16 Dec 2010 17:18:13 -0800
> Stephen Hemminger <shemminger@vyatta.com> writes:
>
>> When loopback device is being brought down, then keep the route table
>> entries because they are special. The entries in the local table for
>> linklocal routes and ::1 address should not be purged.
>>
>> This is a sub optimal solution to the problem and should be replaced
>> by a better fix in future.
>>
>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> Stephen thanks for this. This patch looks good to me. I just tested
> this against 2.6.37-rc6 and my simple tests show it to be working
> without problems.
>
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Applied to net-2.6, thanks everyone.
^ permalink raw reply
* RE: [PATCH] e1000e: workaround missing power down mii control bit on 82571
From: Allan, Bruce W @ 2010-12-17 1:46 UTC (permalink / raw)
To: Arthur Jones; +Cc: Ben Hutchings, Kirsher, Jeffrey T, netdev@vger.kernel.org
In-Reply-To: <20101216221421.GR18990@ajones-laptop.nbttech.com>
>-----Original Message-----
>From: Arthur Jones [mailto:arthur.jones@riverbed.com]
>Sent: Thursday, December 16, 2010 2:14 PM
>To: Allan, Bruce W
>Cc: Ben Hutchings; Kirsher, Jeffrey T; netdev@vger.kernel.org
>Subject: Re: [PATCH] e1000e: workaround missing power down mii control bit on
>82571
>
>> > It's the reset in e1000_set_settings() which ignores that we had previously
>> > powered off the Phy. I'll go through the rest of the code and fix up this
>> > and any other occurrences of similar issues properly.
>>
>> Thanks for having a look!
>>
>> We do a read-modify-write there of
>> the PHY control register. We take
>> the rest of the bits as being good,
>> but, for some reason we don't get the
>> power down bit (always reads back
>> zero). Is this a known 82571 issue?
>> On 82574, e.g., we seem to get the
>> power down bit back when we read...
>
>BTW: The 802.3 spec seems to indicate
>that this bit _should_ be readable even
>when the PHY is powered down (i.e. this
>is a PHY bug)...
>
>Arthur
>
>>
>> Are you sure you want to spread that
>> 82571 specific logic all over the driver?
>>
>> Arthur
No, not a PHY bug. One difference between 82571 and 82574 is during a
hardware reset (which is done by the ethtool command in your example
repro case), the reset on 82571 is a much more aggressive reset than on
82574 which causes the bit to be cleared automatically.
Bruce.
^ permalink raw reply
* [PATCH RFC v3 1/2] bonding: generic netlink infrastructure
From: Jay Vosburgh @ 2010-12-17 1:35 UTC (permalink / raw)
To: netdev; +Cc: Andy Gospodarek
In-Reply-To: <1292549726-15957-1-git-send-email-fubar@us.ibm.com>
Generic netlink infrastructure for bonding. Includes two
netlink operations: notification for slave link state change, and a
"get mode" netlink command.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
---
drivers/net/bonding/Makefile | 3 +-
drivers/net/bonding/bond_main.c | 41 +++----
drivers/net/bonding/bond_netlink.c | 212 ++++++++++++++++++++++++++++++++++++
drivers/net/bonding/bond_netlink.h | 6 +
drivers/net/bonding/bonding.h | 1 +
include/linux/if_bonding.h | 23 ++++
6 files changed, 262 insertions(+), 24 deletions(-)
create mode 100644 drivers/net/bonding/bond_netlink.c
create mode 100644 drivers/net/bonding/bond_netlink.h
diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index 0e2737e..b5fba40 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -4,7 +4,8 @@
obj-$(CONFIG_BONDING) += bonding.o
-bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o
+bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o \
+ bond_netlink.o
ipv6-$(subst m,y,$(CONFIG_IPV6)) += bond_ipv6.o
bonding-objs += $(ipv6-y)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 07011e4..ac1c2f0 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -83,6 +83,7 @@
#include "bonding.h"
#include "bond_3ad.h"
#include "bond_alb.h"
+#include "bond_netlink.h"
/*---------------------------- Module parameters ----------------------------*/
@@ -2417,6 +2418,8 @@ static void bond_miimon_commit(struct bonding *bond)
bond_alb_handle_link_change(bond, slave,
BOND_LINK_UP);
+ bond_nl_link_change(bond, slave, BOND_LINK_UP);
+
if (!bond->curr_active_slave ||
(slave == bond->primary_slave))
goto do_failover;
@@ -2444,6 +2447,8 @@ static void bond_miimon_commit(struct bonding *bond)
bond_alb_handle_link_change(bond, slave,
BOND_LINK_DOWN);
+ bond_nl_link_change(bond, slave, BOND_LINK_DOWN);
+
if (slave == bond->curr_active_slave)
goto do_failover;
@@ -2865,6 +2870,7 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
bond->dev->name,
slave->dev->name);
}
+ bond_nl_link_change(bond, slave, BOND_LINK_UP);
}
} else {
/* slave->link == BOND_LINK_UP */
@@ -2892,6 +2898,9 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
if (slave == oldcurrent)
do_failover = 1;
+
+ bond_nl_link_change(bond, slave,
+ BOND_LINK_DOWN);
}
}
@@ -3038,6 +3047,8 @@ static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
pr_info("%s: link status definitely up for interface %s.\n",
bond->dev->name, slave->dev->name);
+ bond_nl_link_change(bond, slave, BOND_LINK_UP);
+
if (!bond->curr_active_slave ||
(slave == bond->primary_slave))
goto do_failover;
@@ -3056,6 +3067,8 @@ static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
pr_info("%s: link status definitely down for interface %s, disabling it\n",
bond->dev->name, slave->dev->name);
+ bond_nl_link_change(bond, slave, BOND_LINK_DOWN);
+
if (slave == bond->curr_active_slave) {
bond->current_arp_slave = NULL;
goto do_failover;
@@ -4685,7 +4698,7 @@ static void bond_destructor(struct net_device *bond_dev)
free_netdev(bond_dev);
}
-static void bond_setup(struct net_device *bond_dev)
+void bond_setup(struct net_device *bond_dev)
{
struct bonding *bond = netdev_priv(bond_dev);
@@ -5197,24 +5210,6 @@ static int bond_init(struct net_device *bond_dev)
return 0;
}
-static int bond_validate(struct nlattr *tb[], struct nlattr *data[])
-{
- if (tb[IFLA_ADDRESS]) {
- if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
- return -EINVAL;
- if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
- return -EADDRNOTAVAIL;
- }
- return 0;
-}
-
-static struct rtnl_link_ops bond_link_ops __read_mostly = {
- .kind = "bond",
- .priv_size = sizeof(struct bonding),
- .setup = bond_setup,
- .validate = bond_validate,
-};
-
/* Create a new bond based on the specified name and bonding parameters.
* If name is NULL, obtain a suitable "bond%d" name for us.
* Caller must NOT hold rtnl_lock; we need to release it here before we
@@ -5236,7 +5231,7 @@ int bond_create(struct net *net, const char *name)
}
dev_net_set(bond_dev, net);
- bond_dev->rtnl_link_ops = &bond_link_ops;
+ bond_set_rtnl_link_ops(bond_dev);
if (!name) {
res = dev_alloc_name(bond_dev, "bond%d");
@@ -5310,7 +5305,7 @@ static int __init bonding_init(void)
if (res)
goto out;
- res = rtnl_link_register(&bond_link_ops);
+ res = bond_netlink_init();
if (res)
goto err_link;
@@ -5332,7 +5327,7 @@ static int __init bonding_init(void)
out:
return res;
err:
- rtnl_link_unregister(&bond_link_ops);
+ bond_netlink_fini();
err_link:
unregister_pernet_subsys(&bond_net_ops);
#ifdef CONFIG_NET_POLL_CONTROLLER
@@ -5351,7 +5346,7 @@ static void __exit bonding_exit(void)
bond_destroy_sysfs();
bond_destroy_debugfs();
- rtnl_link_unregister(&bond_link_ops);
+ bond_netlink_fini();
unregister_pernet_subsys(&bond_net_ops);
#ifdef CONFIG_NET_POLL_CONTROLLER
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
new file mode 100644
index 0000000..b77c772
--- /dev/null
+++ b/drivers/net/bonding/bond_netlink.c
@@ -0,0 +1,212 @@
+/*
+ * Generic Netlink support for bonding
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2010
+ *
+ * Author: Jay Vosburgh <fubar@us.ibm.com>
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <net/ip.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/etherdevice.h>
+#include <linux/skbuff.h>
+#include <linux/if_bonding.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/genetlink.h>
+
+#include "bonding.h"
+
+int bond_nl_seq;
+
+struct genl_family bond_genl_family = {
+ .id = GENL_ID_GENERATE,
+ .name = "bond",
+ .version = BOND_GENL_VERSION,
+ .maxattr = BOND_GENL_ATTR_MAX,
+};
+
+struct genl_multicast_group bond_genl_mcgrp = {
+ .name = BOND_GENL_MC_GROUP,
+};
+
+static int bond_genl_validate(struct genl_info *info)
+{
+ switch (info->genlhdr->cmd) {
+ case BOND_GENL_CMD_GET_MODE:
+ if (!info->attrs[BOND_GENL_ATTR_MASTER_INDEX])
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/*
+ * Send netlink notification of slave link state change.
+ */
+int bond_nl_link_change(struct bonding *bond, struct slave *slave, int state)
+{
+ struct sk_buff *skb;
+ void *msg;
+ int rv;
+
+ skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
+ if (!skb)
+ return -ENOMEM;
+
+ msg = genlmsg_put(skb, 0, bond_nl_seq++, &bond_genl_family, 0,
+ BOND_GENL_SLAVE_LINK);
+ if (!msg)
+ goto nla_put_failure;
+
+ NLA_PUT_U32(skb, BOND_GENL_ATTR_SLAVE_INDEX, slave->dev->ifindex);
+ NLA_PUT_U32(skb, BOND_GENL_ATTR_MASTER_INDEX, bond->dev->ifindex);
+ NLA_PUT_U32(skb, BOND_GENL_ATTR_SLAVE_LINK, state);
+
+ rv = genlmsg_end(skb, msg);
+ if (rv < 0)
+ goto nla_put_failure;
+
+ return genlmsg_multicast(skb, 0, bond_genl_mcgrp.id, GFP_ATOMIC);
+
+nla_put_failure:
+ nlmsg_free(skb);
+ return -EMSGSIZE;
+}
+
+static int bond_genl_get_mode(struct sk_buff *skb, struct genl_info *info)
+{
+ struct bonding *bond;
+ struct net_device *bond_dev;
+ struct sk_buff *rep_skb;
+ void *reply;
+ u32 m_idx, mode;
+ int rv;
+
+ rv = bond_genl_validate(info);
+ if (rv)
+ return rv;
+
+ m_idx = nla_get_u32(info->attrs[BOND_GENL_ATTR_MASTER_INDEX]);
+ bond_dev = dev_get_by_index(&init_net, m_idx);
+ if (!bond_dev || !(bond_dev->flags & IFF_MASTER) ||
+ !(bond_dev->priv_flags & IFF_BONDING))
+ return -EINVAL;
+
+ bond = netdev_priv(bond_dev);
+ mode = bond->params.mode;
+
+ rep_skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+ if (!rep_skb)
+ return -ENOMEM;
+
+ reply = genlmsg_put_reply(rep_skb, info, &bond_genl_family, 0,
+ info->genlhdr->cmd);
+ if (!reply)
+ goto nla_put_failure;
+
+ NLA_PUT_U32(rep_skb, BOND_GENL_ATTR_MODE, mode);
+
+ genlmsg_end(rep_skb, reply);
+
+ return genlmsg_reply(rep_skb, info);
+
+nla_put_failure:
+ nlmsg_free(rep_skb);
+ return -EMSGSIZE;
+}
+
+static struct nla_policy bond_genl_policy[BOND_GENL_ATTR_MAX + 1] = {
+ [BOND_GENL_ATTR_MASTER_INDEX] = { .type = NLA_U32 },
+ [BOND_GENL_ATTR_SLAVE_INDEX] = { .type = NLA_U32 },
+ [BOND_GENL_ATTR_MODE] = { .type = NLA_U32 },
+ [BOND_GENL_ATTR_SLAVE_LINK] = { .type = NLA_U32 },
+};
+
+static struct genl_ops bond_genl_ops[] = {
+ {
+ .cmd = BOND_GENL_CMD_GET_MODE,
+ .doit = bond_genl_get_mode,
+ .policy = bond_genl_policy,
+ },
+};
+
+static int bond_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+ if (tb[IFLA_ADDRESS]) {
+ if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
+ return -EINVAL;
+ if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
+ return -EADDRNOTAVAIL;
+ }
+ return 0;
+}
+
+struct rtnl_link_ops bond_link_ops __read_mostly = {
+ .kind = "bond",
+ .priv_size = sizeof(struct bonding),
+ .setup = bond_setup,
+ .validate = bond_validate,
+};
+
+void bond_set_rtnl_link_ops(struct net_device *bond_dev)
+{
+ bond_dev->rtnl_link_ops = &bond_link_ops;
+}
+
+int __init bond_netlink_init(void)
+{
+ int rv;
+
+ rv = rtnl_link_register(&bond_link_ops);
+ if (rv)
+ goto out1;
+
+ rv = genl_register_family_with_ops(&bond_genl_family,
+ bond_genl_ops,
+ ARRAY_SIZE(bond_genl_ops));
+ if (rv)
+ goto out2;
+
+ rv = genl_register_mc_group(&bond_genl_family, &bond_genl_mcgrp);
+ if (rv)
+ goto out3;
+
+ return 0;
+
+out3:
+ genl_unregister_family(&bond_genl_family);
+out2:
+ rtnl_link_unregister(&bond_link_ops);
+out1:
+ return rv;
+}
+
+void __exit bond_netlink_fini(void)
+{
+ rtnl_link_unregister(&bond_link_ops);
+ genl_unregister_family(&bond_genl_family);
+}
diff --git a/drivers/net/bonding/bond_netlink.h b/drivers/net/bonding/bond_netlink.h
new file mode 100644
index 0000000..030c2af
--- /dev/null
+++ b/drivers/net/bonding/bond_netlink.h
@@ -0,0 +1,6 @@
+
+extern int bond_nl_link_change(struct bonding *bond, struct slave *slave,
+ int state);
+extern void bond_set_rtnl_link_ops(struct net_device *bond_dev);
+extern int bond_netlink_init(void);
+extern void bond_netlink_fini(void);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 03710f8..ed09a79 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -389,6 +389,7 @@ void bond_destroy_debugfs(void);
void bond_debug_register(struct bonding *bond);
void bond_debug_unregister(struct bonding *bond);
void bond_debug_reregister(struct bonding *bond);
+extern void bond_setup(struct net_device *bond_dev);
struct bond_net {
struct net * net; /* Associated network namespace */
diff --git a/include/linux/if_bonding.h b/include/linux/if_bonding.h
index a17edda..b03d832 100644
--- a/include/linux/if_bonding.h
+++ b/include/linux/if_bonding.h
@@ -114,6 +114,29 @@ struct ad_info {
__u8 partner_system[ETH_ALEN];
};
+enum {
+ BOND_GENL_ATTR_UNSPEC = 0,
+ BOND_GENL_ATTR_MASTER_INDEX,
+ BOND_GENL_ATTR_SLAVE_INDEX,
+ BOND_GENL_ATTR_MODE,
+ BOND_GENL_ATTR_SLAVE_LINK,
+ __BOND_GENL_ATTR_MAX,
+};
+
+#define BOND_GENL_ATTR_MAX (__BOND_GENL_ATTR_MAX - 1)
+
+enum {
+ BOND_GENL_CMD_UNSPEC = 0,
+ BOND_GENL_CMD_GET_MODE,
+ BOND_GENL_SLAVE_LINK,
+ __BOND_GENL_MAX,
+};
+
+#define BOND_GENL_MAX (__BOND_GENL_MAX - 1)
+
+#define BOND_GENL_VERSION 1
+#define BOND_GENL_MC_GROUP "bond_mc_group"
+
#endif /* _LINUX_IF_BONDING_H */
/*
--
1.6.0.2
^ permalink raw reply related
* [PATCH RFC v3 0/2] bonding: generic netlink, multi-link mode
From: Jay Vosburgh @ 2010-12-17 1:35 UTC (permalink / raw)
To: netdev; +Cc: Andy Gospodarek
[ v3: moved up to today's net-next-2.6, cleaned up various cruft,
checkpatch stuff ]
These patches add support to bonding for generic netlink and a new
multi-link mode. At the moment, I'm looking primarily for discussion
about the generic netlink and implementation of multi-link.
First, in patch 1, is a generic netlink infrastructure for
bonding. This patch provides a "get mode" command and a "slave link state
change" asychnronous notification via a netlink multicast group. One long
term goal is to have bonding be controlled via netlink, both for
administrative purposes (add / remove slaves, etc) and policy (slave A is
better than slave B). I'd appreciate feedback from netlink savvy folks as
to whether this is the appropriate starting point.
Second, in patch 2, is the multi-link kernel code itself, which is
at present a work in progress. Here, I'm primarily looking for comments
regarding the control interface for this mode.
As implemented, this is a new mode to bonding, controlled via
generic netlink commands from a user space daemon. Slave assignment for
outgoing traffic is handled directly by bonding (the mapping table used by
multi-link is within bonding itself, and the usual transmit hash policy is
applied to the set of slaves allowable for a given destination).
In some private discussion with Andy, he suggested that this would
be better if it utilized the recently added queue mapping facility within
bonding, and then having the queue (and thus slave) assignments performed
at the qdisc level (via a tc filter) instead of within bonding itself.
This, I believe, would require a new tc filter that implements the ability
to set a skb queue_mapping in a hash (of protocol data in the packet) or
round robin fashion. In this case, the tc filter would also incorporate
all of the netlink functionality for communicating with the user space
daemon (to permit the mappings to be updated).
Thoughts?
Lastly, a description of the multi-link system itself. This is a
reimplementation of a load balancing scheme that has been available on AIX
for some time. It operates essentially as a load balancer by subnet, with
a UDP-based protocol to exchange multi-link topology information between
participating systems. Hosts participating in multi-link have IP
addresses in a separate subnet. Interfaces enslaved to multi-link do not
lose their assigned IP address information, and may also operate
separately from multi-link.
One notable feature is that multi-link provides load balancing
facilities for network devices that cannot change their MAC address, such
as Infiniband.
For example, given two systems as follows:
host A:
bond0 10.88.0.1/16
slave eth0 10.0.0.1/16
slave eth1 10.1.0.1/16
slave eth2 10.2.0.1/16
host B:
bond0 10.88.0.2/16
slave eth0 10.0.0.2/16
slave eth1 10.1.0.2/16
slave eth2 10.2.0.2/16
in this case, host A's bond0 running multi-link would load balance
traffic from 10.88.0.1 to 10.88.0.2 across eth0, eth1 and eth2. The user
space daemon negotiates the link set to use with other participating
hosts, and communicates that to the multi-link implementation.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply
* [PATCH RFC v3 2/2] bonding: add multi-link mode
From: Jay Vosburgh @ 2010-12-17 1:35 UTC (permalink / raw)
To: netdev; +Cc: Andy Gospodarek
In-Reply-To: <1292549726-15957-1-git-send-email-fubar@us.ibm.com>
Adds multi-link mode for bonding.
This mode performs per-subnet balancing, wherein each slave is
typically a member of a discrete IP subnet, and the multi-link (ML)
addresses exist in a subnet of their own. A user space daemon runs the
ML discovery protocol, which locates other ML hosts and exchanges link
information. The daemon then informs bonding of the appropriate set of
slaves to reach a particular ML destination. The ML daemon also monitors
the links to insure continued availabilty.
Note that ML slaves maintain their assigned IP addresses, and
may operate outside the scope of the bond.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
---
drivers/net/bonding/Makefile | 2 +-
drivers/net/bonding/bond_main.c | 34 ++-
drivers/net/bonding/bond_ml.c | 638 ++++++++++++++++++++++++++++++++++++
drivers/net/bonding/bond_ml.h | 88 +++++
drivers/net/bonding/bond_netlink.c | 134 ++++++++
drivers/net/bonding/bond_netlink.h | 5 +
drivers/net/bonding/bonding.h | 13 +
include/linux/if.h | 1 +
include/linux/if_bonding.h | 15 +
net/core/dev.c | 37 ++-
10 files changed, 955 insertions(+), 12 deletions(-)
create mode 100644 drivers/net/bonding/bond_ml.c
create mode 100644 drivers/net/bonding/bond_ml.h
diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index b5fba40..ef3fab4 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -5,7 +5,7 @@
obj-$(CONFIG_BONDING) += bonding.o
bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o \
- bond_netlink.o
+ bond_netlink.o bond_ml.o
ipv6-$(subst m,y,$(CONFIG_IPV6)) += bond_ipv6.o
bonding-objs += $(ipv6-y)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index ac1c2f0..9b93248 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -200,6 +200,7 @@ const struct bond_parm_tbl bond_mode_tbl[] = {
{ "802.3ad", BOND_MODE_8023AD},
{ "balance-tlb", BOND_MODE_TLB},
{ "balance-alb", BOND_MODE_ALB},
+{ "multi-link", BOND_MODE_ML},
{ NULL, -1},
};
@@ -257,9 +258,10 @@ static const char *bond_mode_name(int mode)
[BOND_MODE_8023AD] = "IEEE 802.3ad Dynamic link aggregation",
[BOND_MODE_TLB] = "transmit load balancing",
[BOND_MODE_ALB] = "adaptive load balancing",
+ [BOND_MODE_ML] = "multi-link",
};
- if (mode < 0 || mode > BOND_MODE_ALB)
+ if (mode < 0 || mode > BOND_MODE_ML)
return "unknown";
return names[mode];
@@ -1603,7 +1605,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
*/
memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN);
- if (!bond->params.fail_over_mac) {
+ if (!bond->params.fail_over_mac && bond->params.mode != BOND_MODE_ML) {
/*
* Set slave to master's mac address. The application already
* set the master's mac address to that of the first slave
@@ -2097,6 +2099,9 @@ static int bond_release_all(struct net_device *bond_dev)
if (bond->params.mode == BOND_MODE_8023AD)
bond_3ad_unbind_slave(slave);
+ if (bond->params.mode == BOND_MODE_ML)
+ bond_ml_unbind_slave(bond, slave);
+
slave_dev = slave->dev;
bond_detach_slave(bond, slave);
@@ -3358,6 +3363,8 @@ static void bond_info_show_master(struct seq_file *seq)
seq_printf(seq, "\tPartner Mac Address: %pM\n",
ad_info.partner_system);
}
+ } else if (bond->params.mode == BOND_MODE_ML) {
+ bond_ml_show_proc(seq, bond);
}
}
@@ -3846,6 +3853,11 @@ static int bond_open(struct net_device *bond_dev)
bond_3ad_initiate_agg_selection(bond, 1);
}
+ if (bond->params.mode == BOND_MODE_ML) {
+ INIT_DELAYED_WORK(&bond->ml_work, bond_ml_monitor);
+ queue_delayed_work(bond->wq, &bond->ml_work, 0);
+ }
+
return 0;
}
@@ -3887,6 +3899,9 @@ static int bond_close(struct net_device *bond_dev)
case BOND_MODE_ALB:
cancel_delayed_work(&bond->alb_work);
break;
+ case BOND_MODE_ML:
+ cancel_delayed_work(&bond->ml_work);
+ break;
default:
break;
}
@@ -4605,6 +4620,8 @@ static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
case BOND_MODE_ALB:
case BOND_MODE_TLB:
return bond_alb_xmit(skb, dev);
+ case BOND_MODE_ML:
+ return bond_xmit_ml(skb, dev);
default:
/* Should never happen, mode already checked */
pr_err("%s: Error: Unknown bonding mode %d\n",
@@ -4642,6 +4659,11 @@ void bond_set_mode_ops(struct bonding *bond, int mode)
/* FALLTHRU */
case BOND_MODE_TLB:
break;
+ case BOND_MODE_ML:
+ bond_set_xmit_hash_policy(bond);
+ bond_set_master_ml_flags(bond);
+ bond_ml_init(bond);
+ break;
default:
/* Should never happen, mode already checked */
pr_err("%s: Error: Unknown bonding mode %d\n",
@@ -4716,7 +4738,6 @@ void bond_setup(struct net_device *bond_dev)
ether_setup(bond_dev);
bond_dev->netdev_ops = &bond_netdev_ops;
bond_dev->ethtool_ops = &bond_ethtool_ops;
- bond_set_mode_ops(bond, bond->params.mode);
bond_dev->destructor = bond_destructor;
@@ -4729,6 +4750,8 @@ void bond_setup(struct net_device *bond_dev)
if (bond->params.arp_interval)
bond_dev->priv_flags |= IFF_MASTER_ARPMON;
+ bond_set_mode_ops(bond, bond->params.mode);
+
/* At first, we block adding VLANs. That's the only way to
* prevent problems that occur when adding VLANs over an
* empty bond. The block will be removed once non-challenged
@@ -4776,6 +4799,10 @@ static void bond_work_cancel_all(struct bonding *bond)
delayed_work_pending(&bond->ad_work))
cancel_delayed_work(&bond->ad_work);
+ if (bond->params.mode == BOND_MODE_ML &&
+ delayed_work_pending(&bond->ml_work))
+ cancel_delayed_work(&bond->ml_work);
+
if (delayed_work_pending(&bond->mcast_work))
cancel_delayed_work(&bond->mcast_work);
}
@@ -4863,6 +4890,7 @@ static int bond_check_params(struct bond_params *params)
if (xmit_hash_policy) {
if ((bond_mode != BOND_MODE_XOR) &&
+ (bond_mode != BOND_MODE_ML) &&
(bond_mode != BOND_MODE_8023AD)) {
pr_info("xmit_hash_policy param is irrelevant in mode %s\n",
bond_mode_name(bond_mode));
diff --git a/drivers/net/bonding/bond_ml.c b/drivers/net/bonding/bond_ml.c
new file mode 100644
index 0000000..264df06
--- /dev/null
+++ b/drivers/net/bonding/bond_ml.c
@@ -0,0 +1,638 @@
+/*
+ * Multi-link mode support for bonding
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2010
+ *
+ * Author: Jay Vosburgh <fubar@us.ibm.com>
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/if_bonding.h>
+#include <linux/in.h>
+#include <net/arp.h>
+#include <net/route.h>
+#include <net/genetlink.h>
+
+#include "bonding.h"
+#include "bond_netlink.h"
+
+static u32 bond_ml_salt __read_mostly;
+
+static inline int bond_ml_hash(const __be32 mladdr)
+{
+ return jhash_1word(mladdr, bond_ml_salt) & (BOND_ML_HASH_SZ - 1);
+}
+
+/*
+ * Create new ml_route entry, insert into hash table.
+ *
+ * Caller holds bond->lock for write.
+ */
+static struct ml_route *bond_mlr_create(struct bonding *bond, __be32 mladdr)
+{
+ struct ml_route *mlr, *head;
+ int hash;
+
+ mlr = kzalloc(sizeof(*mlr), GFP_ATOMIC);
+ if (!mlr)
+ return NULL;
+
+ mlr->state = MLRT_EMPTY;
+ hash = bond_ml_hash(mladdr);
+
+ head = bond->ml_info.ml_rtable[hash];
+ mlr->next = head;
+ bond->ml_info.ml_rtable[hash] = mlr;
+
+ return mlr;
+}
+
+/*
+ * Destroy ml_route entry. Remove from hash table if necessary, then free.
+ * Caller responsible for freeing ml_dest table.
+ *
+ * Caller holds bond->lock for write.
+ */
+static void bond_mlr_destroy(struct bonding *bond, struct ml_route *mlr)
+{
+ struct ml_route *mlr_prev;
+ int hash;
+
+ hash = bond_ml_hash(mlr->ml_ipaddr.addr.s_addr);
+ pr_debug("bmd: ip %x h %x rt[h] %p\n", mlr->ml_ipaddr.addr.s_addr,
+ hash, bond->ml_info.ml_rtable[hash]);
+
+ if (bond->ml_info.ml_rtable[hash] == mlr) {
+ bond->ml_info.ml_rtable[hash] = mlr->next;
+ goto out;
+ }
+
+ mlr_prev = bond->ml_info.ml_rtable[hash];
+ while (mlr_prev) {
+ if (mlr_prev->next == mlr) {
+ mlr_prev->next = mlr->next;
+ goto out;
+ }
+ }
+
+ pr_err("%s: bond_mlr_destroy: mlr %p has next, but not in table\n",
+ bond->dev->name, mlr);
+
+out:
+ kfree(mlr);
+}
+
+/*
+ * Look up ml_route entry for supplied ML IP address.
+ *
+ * Caller holds bond->lock for read or better.
+ */
+static struct ml_route *bond_ml_route_output(struct bonding *bond,
+ __be32 mladdr)
+{
+ struct ml_route *mlr;
+ int hash;
+
+ hash = bond_ml_hash(mladdr);
+ mlr = bond->ml_info.ml_rtable[hash];
+
+ while (mlr) {
+ if (mlr->state == MLRT_COMPLETE &&
+ mlr->ml_ipaddr.addr.s_addr == mladdr)
+ return mlr;
+ mlr = mlr->next;
+ }
+
+ return NULL;
+}
+
+/*
+ * Find "nth" ml_dest in supplied ml_route, where nth is zero-based. Used
+ * by TX to find suitable slave to send on. N must be less than
+ * mlr->num_dest.
+ */
+static struct ml_dest *bond_mlr_dest_output(struct ml_route *mlr, int nth)
+{
+ int b;
+
+ b = find_next_bit(&mlr->ml_dest_map, BOND_ML_NDEST, 0);
+ while (nth--)
+ b = find_next_bit(&mlr->ml_dest_map, BOND_ML_NDEST, b + 1);
+
+ return mlr->ml_dest[b];
+}
+
+/*
+ * Find ml_dest in supplied ml_route. Also match against laddr or raddr
+ * if nonzero.
+ */
+static struct ml_dest *bond_mlr_dest_find(struct ml_route *mlr,
+ __be32 laddr, __be32 raddr)
+{
+ struct ml_dest *mld;
+ int i;
+
+ for (i = 0; i < BOND_ML_NDEST; i++) {
+ mld = mlr->ml_dest[i];
+ if (!mld)
+ continue;
+ if (laddr && (laddr != mld->laddr))
+ continue;
+ if (raddr && (raddr != mld->raddr))
+ continue;
+
+ return mld;
+ }
+ return NULL;
+}
+
+static void bond_mlr_dest_free(struct bonding *bond, struct ml_route *mlr,
+ struct ml_dest *mld)
+{
+ int i;
+
+ pr_debug("dest_free: s %s l %pI4 r %pI4 ml %pI4\n",
+ mld->slave->dev->name, &mld->laddr, &mld->raddr,
+ &mlr->ml_ipaddr.addr);
+
+ for (i = 0; i < BOND_ML_NDEST; i++) {
+ if (mlr->ml_dest[i] == mld)
+ break;
+ }
+
+ if (i == BOND_ML_NDEST) {
+ pr_debug("bond_mlr_dest_free: mld not found in mlr\n");
+ return;
+ }
+
+ mlr->ml_dest[i] = NULL;
+ mlr->num_dest--;
+
+ if (mld->neigh)
+ neigh_release(mld->neigh);
+
+ kfree(mld);
+
+ clear_bit(i, &mlr->ml_dest_map);
+ if (mlr->ml_dest_map)
+ return;
+
+ mlr->state = MLRT_INCOMPLETE;
+ mlr->ml_ipaddr.flag = MLDD_IF_DOWN;
+}
+
+static struct ml_dest *bond_mlr_dest_new(struct ml_route *mlr)
+{
+ struct ml_dest *mld;
+ int n;
+
+ n = find_first_zero_bit(&mlr->ml_dest_map, BOND_ML_NDEST);
+ if (n == BOND_ML_NDEST)
+ return NULL;
+
+ mld = kzalloc(sizeof(*mld), GFP_ATOMIC);
+ if (!mld)
+ return NULL;
+
+ set_bit(n, &mlr->ml_dest_map);
+
+ mlr->num_dest++;
+ mlr->ml_dest[n] = mld;
+ return mld;
+}
+
+int bond_ml_delrt(struct bonding *bond, struct in_addr laddr,
+ struct in_addr raddr, struct in_addr mladdr,
+ struct slave *slave)
+{
+ struct ml_route *mlr;
+ struct ml_dest *mld;
+ int rv = 0;
+
+ pr_debug("ml_delrt: l %pI4 r %pI4 ml %pI4\n", &laddr, &raddr, &mladdr);
+ write_lock_bh(&bond->lock);
+
+ mlr = bond_ml_route_output(bond, mladdr.s_addr);
+ if (!mlr) {
+ rv = -ENOENT;
+ goto out;
+ }
+ mld = bond_mlr_dest_find(mlr, laddr.s_addr, raddr.s_addr);
+ if (!mld) {
+ rv = -ENOENT;
+ goto out;
+ }
+
+ bond_mlr_dest_free(bond, mlr, mld);
+
+out:
+ write_unlock_bh(&bond->lock);
+ return rv;
+}
+
+int bond_ml_addrt(struct bonding *bond, struct in_addr laddr,
+ struct in_addr raddr, struct in_addr mladdr,
+ struct slave *slave)
+{
+ struct ml_route *mlr;
+ struct ml_dest *mld;
+ struct neighbour *n;
+ int rv = 0, alloc_mlr = 0;
+
+ pr_debug("ml_addrt: %s l %pI4 r %pI4 m %pI4 s %s\n", bond->dev->name,
+ &laddr, &raddr, &mladdr, slave->dev->name);
+
+ write_lock_bh(&bond->lock);
+
+ mlr = bond_ml_route_output(bond, mladdr.s_addr);
+ if (mlr) {
+ mld = bond_mlr_dest_find(mlr, laddr.s_addr, raddr.s_addr);
+ if (mld) {
+ rv = -EEXIST;
+ goto out;
+ }
+ }
+
+ if (!mlr) {
+ mlr = bond_mlr_create(bond, mladdr.s_addr);
+ if (!mlr) {
+ rv = -ENOMEM;
+ goto out;
+ }
+ alloc_mlr++;
+ }
+
+ mld = bond_mlr_dest_new(mlr);
+ if (!mld) {
+ rv = -ENOSPC;
+ goto out;
+ }
+
+ mld->slave = bond_get_slave_by_dev(bond, slave->dev);
+ if (!mld->slave) {
+ pr_debug("%s: %s not slave\n", bond->dev->name,
+ slave->dev->name);
+ rv = -EINVAL;
+ goto out;
+ }
+
+ mld->laddr = laddr.s_addr;
+ mld->raddr = raddr.s_addr;
+
+ n = __neigh_lookup(&arp_tbl, &mld->raddr, mld->slave->dev, 1);
+ if (!n) {
+ rv = -ENOMEM;
+ goto out;
+ }
+
+ n->used = jiffies;
+ neigh_event_send(n, NULL);
+ mld->neigh = n;
+
+ mlr->state = MLRT_COMPLETE;
+ mlr->ml_ipaddr.addr.s_addr = mladdr.s_addr;
+ mlr->ml_ipaddr.flag = MLDD_IF_UP;
+
+out:
+ if (rv && alloc_mlr)
+ bond_mlr_destroy(bond, mlr);
+
+ write_unlock_bh(&bond->lock);
+ return rv;
+}
+
+void bond_ml_rt_flush(struct bonding *bond)
+{
+ int i, j;
+ struct ml_route *mlr, *next;
+ struct ml_dest *mld;
+
+ write_lock_bh(&bond->lock);
+
+ for (i = 0; i < BOND_ML_HASH_SZ; i++) {
+ mlr = bond->ml_info.ml_rtable[i];
+
+ while (mlr) {
+ for (j = 0; j < BOND_ML_NDEST; j++) {
+ mld = mlr->ml_dest[j];
+ if (mld)
+ bond_mlr_dest_free(bond, mlr, mld);
+ }
+
+ next = mlr->next;
+ bond_mlr_destroy(bond, mlr);
+ mlr = next;
+ }
+ }
+
+ write_unlock_bh(&bond->lock);
+}
+
+
+/*
+ * Send DISCOVERY message to daemon
+ *
+ * For DISCOVERY, MLADDR is the remote MLADDR we need to resolve.
+ */
+static int bond_ml_discovery(struct bonding *bond, __be32 mladdr)
+{
+ struct sk_buff *skb;
+ void *msg;
+ int rv;
+
+ skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
+ if (!skb)
+ return -ENOMEM;
+
+ msg = genlmsg_put(skb, 0, bond_nl_seq++, &bond_genl_family, 0,
+ BOND_GENL_ML_CMD_DISCOVERY);
+ if (!msg)
+ goto nla_put_failure;
+
+ NLA_PUT_U32(skb, BOND_GENL_ATTR_ML_MLADDR, mladdr);
+ NLA_PUT_U32(skb, BOND_GENL_ATTR_MASTER_INDEX, bond->dev->ifindex);
+
+ rv = genlmsg_end(skb, msg);
+ if (rv < 0)
+ goto nla_put_failure;
+
+ return genlmsg_multicast(skb, 0, bond_genl_mcgrp.id, GFP_ATOMIC);
+
+nla_put_failure:
+ nlmsg_free(skb);
+ return -EMSGSIZE;
+}
+
+/*
+ * Look up skb's IP destination in ML route table
+ * If exists, send the packet via the found ML destination
+ * If not, initiate ML discovery
+ */
+int bond_xmit_ml(struct sk_buff *skb, struct net_device *bond_dev)
+{
+ struct bonding *bond = netdev_priv(bond_dev);
+ struct ml_route *mlr;
+ struct ml_dest *mld;
+ struct iphdr *iph;
+ struct neighbour *n;
+ struct net_device *slave_dev;
+ int rv = 1;
+ int sl;
+
+ read_lock(&bond->lock);
+
+ if (!BOND_IS_OK(bond))
+ goto out;
+
+ switch (skb->protocol) {
+ case htons(ETH_P_IP):
+ iph = ip_hdr(skb);
+ if (!iph) {
+ pr_debug("b_x_ml: no iph\n");
+ goto out;
+ }
+
+ mlr = bond_ml_route_output(bond, iph->daddr);
+ if (!mlr) {
+ rv = bond_ml_discovery(bond, iph->daddr);
+ pr_debug("b_x_ml: %s disco s %pI4 d %pI4 rv %d\n",
+ bond->dev->name, &iph->saddr, &iph->daddr, rv);
+ goto out;
+ }
+
+ sl = bond->xmit_hash_policy(skb, mlr->num_dest);
+ mld = bond_mlr_dest_output(mlr, sl);
+ if (!mld) {
+ pr_debug("b_x_ml: no mld sl %d n_d %d\n", sl,
+ mlr->num_dest);
+ goto out;
+ }
+ if (!mld->slave) {
+ pr_debug("b_x_ml: no slave\n");
+ goto out;
+ }
+
+ n = mld->neigh;
+ if (n) {
+ slave_dev = mld->slave->dev;
+ rv = dev_hard_header(skb, slave_dev,
+ ntohs(skb->protocol), n->ha,
+ slave_dev->dev_addr, skb->len);
+ } else {
+ pr_debug("b_x_ml: no n\n");
+ }
+
+ rv = bond_dev_queue_xmit(bond, skb, mld->slave->dev);
+ break;
+
+ case htons(ETH_P_ARP):
+ pr_debug("b_x_ml: UNEXPECTED ARP\n");
+ break;
+
+ default:
+ rv = bond_dev_queue_xmit(bond, skb, bond->first_slave->dev);
+ break;
+ }
+
+out:
+ read_unlock(&bond->lock);
+ if (rv) {
+ pr_debug("xmit_ml rv %d\n", rv);
+ dev_kfree_skb(skb);
+ }
+
+ return NETDEV_TX_OK;
+}
+
+static char *mlr_state_nm(int s)
+{
+ switch (s) {
+ case MLRT_COMPLETE:
+ return "C";
+ case MLRT_INCOMPLETE:
+ return "I";
+ case MLRT_EMPTY:
+ return "E";
+ default:
+ return "?";
+ }
+}
+
+static char *mlr_ipaddr_flag_nm(int f)
+{
+ switch (f) {
+ case MLDD_IF_UP:
+ return "UP";
+ case MLDD_IF_DOWN:
+ return "DN";
+ default:
+ return "??";
+ }
+}
+
+void bond_ml_show_proc_mlr(struct seq_file *seq, struct ml_route *mlr)
+{
+ struct ml_dest *mld;
+ int j;
+
+ for (j = 0; j < BOND_ML_NDEST; j++) {
+ mld = mlr->ml_dest[j];
+ if (mld)
+ seq_printf(seq, " D %02d s %s l %pI4 r %pI4\n",
+ j, mld->slave->dev->name,
+ &mld->laddr, &mld->raddr);
+ }
+}
+
+void bond_ml_show_proc(struct seq_file *seq, struct bonding *bond)
+{
+ struct ml_route *mlr;
+ int i;
+
+ read_lock(&bond->lock);
+
+ for (i = 0; i < BOND_ML_HASH_SZ; i++) {
+ mlr = bond->ml_info.ml_rtable[i];
+
+ while (mlr) {
+ seq_printf(seq, "%02d s %s ndest %d ml_i: f %s %pI4\n",
+ i, mlr_state_nm(mlr->state), mlr->num_dest,
+ mlr_ipaddr_flag_nm(mlr->ml_ipaddr.flag),
+ &mlr->ml_ipaddr.addr.s_addr);
+
+ if (mlr->state == MLRT_COMPLETE)
+ bond_ml_show_proc_mlr(seq, mlr);
+
+ mlr = mlr->next;
+ }
+ }
+
+ read_unlock(&bond->lock);
+}
+
+static const int ml_delta_in_ticks = HZ * 10;
+
+/*
+ * ML periodic monitor
+ *
+ * Walk the ML routing table. For each entry, check its state. Insure
+ * that ARP entries for ML routing entries are kept up to date.
+ */
+void bond_ml_monitor(struct work_struct *work)
+{
+ struct bonding *bond = container_of(work, struct bonding,
+ ml_work.work);
+ struct ml_route *mlr;
+ struct ml_dest *mld;
+ struct neighbour *n;
+ int i, j, rv;
+
+ read_lock(&bond->lock);
+
+ if (bond->kill_timers)
+ goto out;
+
+ for (i = 0; i < BOND_ML_HASH_SZ; i++) {
+ mlr = bond->ml_info.ml_rtable[i];
+
+ while (mlr) {
+ if (mlr->state == MLRT_EMPTY) {
+ mlr = mlr->next;
+ continue;
+ }
+
+ for (j = 0; j < BOND_ML_NDEST; j++) {
+ mld = mlr->ml_dest[j];
+ if (!mld)
+ break;
+
+ n = __neigh_lookup(&arp_tbl, &mld->raddr,
+ mld->slave->dev, 1);
+ if (n) {
+ n->used = jiffies;
+ rv = neigh_event_send(n, NULL);
+ neigh_release(n);
+ } else {
+ pr_debug("bmm: no n r %pI4 s %s\n",
+ &mld->raddr,
+ mld->slave->dev->name);
+ }
+ }
+
+ mlr = mlr->next;
+ }
+ }
+
+ queue_delayed_work(bond->wq, &bond->ml_work, ml_delta_in_ticks);
+out:
+ read_unlock(&bond->lock);
+}
+
+/*
+ * Use a limited set of header_ops. At packet transmit time, we'll use
+ * the selected slave's ops to fill in the hard_header.
+ */
+static const struct header_ops bond_ml_header_ops = {
+ .create = NULL,
+ .rebuild = eth_rebuild_header,
+ .parse = eth_header_parse,
+ .cache = NULL,
+ .cache_update = NULL,
+};
+
+/*
+ * called with bond->lock held for write
+ */
+void bond_ml_unbind_slave(struct bonding *bond, struct slave *slave)
+{
+ struct ml_route *mlr;
+ struct ml_dest *mld;
+ int i, j;
+
+ for (i = 0; i < BOND_ML_HASH_SZ; i++) {
+ mlr = bond->ml_info.ml_rtable[i];
+
+ while (mlr) {
+ for (j = 0; j < BOND_ML_NDEST; j++) {
+ mld = mlr->ml_dest[j];
+ if (mld && mld->slave == slave)
+ bond_mlr_dest_free(bond, mlr, mld);
+ }
+ mlr = mlr->next;
+ }
+ }
+}
+
+void bond_ml_init(struct bonding *bond)
+{
+ struct net_device *bond_dev = bond->dev;
+
+ memset(&bond->ml_info, 0, sizeof(bond->ml_info));
+
+ bond_dev->flags |= IFF_NOARP;
+ bond_dev->flags &= ~(IFF_MULTICAST | IFF_BROADCAST);
+ bond_dev->header_ops = &bond_ml_header_ops;
+
+ get_random_bytes(&bond_ml_salt, sizeof(bond_ml_salt));
+}
diff --git a/drivers/net/bonding/bond_ml.h b/drivers/net/bonding/bond_ml.h
new file mode 100644
index 0000000..0f7e417
--- /dev/null
+++ b/drivers/net/bonding/bond_ml.h
@@ -0,0 +1,88 @@
+/*
+ *
+ */
+#ifndef __BOND_ML_H__
+#define __BOND_ML_H__
+
+#define MLDD_IF_DOWN 0xc0
+#define MLDD_IF_UP 0xc1
+
+struct ml_ipaddr {
+ u8 ip_version;
+ u8 flag;
+ u16 tick;
+ struct in_addr addr;
+};
+
+#define MLDD_BCAST_REPLY 0xf0
+#define MLDD_UCAST_REPLY 0xf1
+#define MLDD_REQUEST 0xf2
+#define MLDD_LOOKUP 0xf3
+
+struct ml_msg {
+ u8 version;
+ u8 op;
+ u16 reserved1;
+ u32 num;
+ s32 request_index;
+ s32 reply_index;
+ struct ml_ipaddr ml_ipaddr;
+ u16 req_net;
+ u16 rep_net;
+};
+
+struct ml_dest {
+ struct slave *slave;
+ struct neighbour *neigh;
+ __be32 laddr;
+ __be32 raddr;
+};
+
+#define MLRT_COMPLETE 0xa0
+#define MLRT_INCOMPLETE 0xa1
+#define MLRT_EMPTY 0xa2
+
+/*
+ * The ML protocol is limited to 16 destinations per ML route.
+ */
+#define BOND_ML_NDEST 16
+
+/*
+ * An ML route contains one peer IP address, the "ML IP" address of the
+ * peer system. Within that route are one or more destination entries
+ * that specify the various possible paths to reach the ML IP peer. Each
+ * destination entry includes the local slave and the peer interface IP
+ * address at the destination.
+ */
+struct ml_route {
+ struct ml_route *next;
+ u16 state;
+ struct ml_ipaddr ml_ipaddr;
+ int num_dest;
+ unsigned long ml_dest_map;
+ struct ml_dest *ml_dest[BOND_ML_NDEST];
+};
+
+/*
+ * Hash by ML IP address
+ */
+#define BOND_ML_HASH_SZ 31
+
+struct ml_bond_info {
+ struct ml_route *ml_rtable[BOND_ML_HASH_SZ];
+};
+
+extern int bond_xmit_ml(struct sk_buff *skb, struct net_device *bond_dev);
+extern int bond_ml_changelink(struct bonding *bond, struct bond_ml_route *bmr);
+extern void bond_ml_monitor(struct work_struct *work);
+extern void bond_ml_show_proc(struct seq_file *, struct bonding *);
+extern void bond_ml_init(struct bonding *);
+extern int bond_ml_addrt(struct bonding *, struct in_addr, struct in_addr,
+ struct in_addr, struct slave *);
+extern int bond_ml_delrt(struct bonding *, struct in_addr, struct in_addr,
+ struct in_addr, struct slave *);
+extern void bond_ml_unbind_slave(struct bonding *bond, struct slave *slave);
+extern void bond_ml_rt_flush(struct bonding *bond);
+
+
+#endif /* __BOND_ML_H__ */
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index b77c772..754c475 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -57,6 +57,21 @@ static int bond_genl_validate(struct genl_info *info)
if (!info->attrs[BOND_GENL_ATTR_MASTER_INDEX])
return -EINVAL;
break;
+ case BOND_GENL_ML_CMD_RT_ADD:
+ case BOND_GENL_ML_CMD_RT_DEL:
+ if (!info->attrs[BOND_GENL_ATTR_ML_MLADDR])
+ return -EINVAL;
+ if (!info->attrs[BOND_GENL_ATTR_ML_LADDR])
+ return -EINVAL;
+ if (!info->attrs[BOND_GENL_ATTR_ML_RADDR])
+ return -EINVAL;
+ if (!info->attrs[BOND_GENL_ATTR_ML_INDEX])
+ return -EINVAL;
+ break;
+ case BOND_GENL_ML_CMD_RT_FLUSH:
+ if (!info->attrs[BOND_GENL_ATTR_MASTER_INDEX])
+ return -EINVAL;
+ break;
default:
return -EINVAL;
}
@@ -139,11 +154,115 @@ nla_put_failure:
return -EMSGSIZE;
}
+static int bond_genl_ml_flush_route(struct sk_buff *skb, struct genl_info *info)
+{
+ struct bonding *bond;
+ struct net_device *bond_dev;
+ struct sk_buff *rep_skb = NULL;
+ void *reply;
+ u32 m_idx;
+ int rv;
+
+ rv = bond_genl_validate(info);
+ if (rv)
+ return rv;
+
+ m_idx = nla_get_u32(info->attrs[BOND_GENL_ATTR_MASTER_INDEX]);
+ bond_dev = dev_get_by_index(&init_net, m_idx);
+ if (!bond_dev || !(bond_dev->flags & IFF_MASTER) ||
+ !(bond_dev->priv_flags & IFF_BONDING)) {
+ rv = -EINVAL;
+ goto out_err;
+ }
+
+ bond = netdev_priv(bond_dev);
+ bond_ml_rt_flush(bond);
+
+ rep_skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+ if (!rep_skb)
+ goto out_err;
+
+ reply = genlmsg_put_reply(rep_skb, info, &bond_genl_family, 0,
+ info->genlhdr->cmd);
+ if (!reply)
+ goto out_err;
+
+ rv = genlmsg_end(rep_skb, reply);
+ if (rv < 0)
+ goto out_err;
+
+ return genlmsg_reply(rep_skb, info);
+
+out_err:
+ if (bond_dev)
+ dev_put(bond_dev);
+ if (rep_skb)
+ nlmsg_free(rep_skb);
+
+ return rv;
+}
+
+static int bond_genl_ml_chg_route(struct sk_buff *skb, struct genl_info *info)
+{
+ struct in_addr laddr, raddr, mladdr;
+ u32 l_idx;
+ struct net_device *slave_dev, *bond_dev;
+ struct bonding *bond;
+ struct slave *slave;
+ int rv, cmd;
+
+ rv = bond_genl_validate(info);
+ if (rv)
+ return rv;
+
+ laddr.s_addr = nla_get_u32(info->attrs[BOND_GENL_ATTR_ML_LADDR]);
+ raddr.s_addr = nla_get_u32(info->attrs[BOND_GENL_ATTR_ML_RADDR]);
+ mladdr.s_addr = nla_get_u32(info->attrs[BOND_GENL_ATTR_ML_MLADDR]);
+ l_idx = nla_get_u32(info->attrs[BOND_GENL_ATTR_ML_INDEX]);
+
+ cmd = info->genlhdr->cmd;
+
+ pr_debug("ml_route: cmd %d l %pI4 r %pI4 m %pI4 i %u\n",
+ cmd, &laddr, &raddr, &mladdr, l_idx);
+
+ slave_dev = dev_get_by_index(&init_net, l_idx);
+ if (!slave_dev || !(slave_dev->priv_flags & IFF_BONDING))
+ return -EINVAL;
+
+ bond_dev = slave_dev->master;
+ if (!bond_dev || !(bond_dev->priv_flags & IFF_BONDING))
+ return -EINVAL;
+
+ bond = netdev_priv(bond_dev);
+
+ slave = bond_get_slave_by_dev(bond, slave_dev);
+ if (!slave)
+ return -EINVAL;
+
+ switch (cmd) {
+ case BOND_GENL_ML_CMD_RT_ADD:
+ rv = bond_ml_addrt(bond, laddr, raddr, mladdr, slave);
+ break;
+ case BOND_GENL_ML_CMD_RT_DEL:
+ rv = bond_ml_delrt(bond, laddr, raddr, mladdr, slave);
+ break;
+ default:
+ pr_debug("bond_genl_ml_route: impossible cmd %d\n", cmd);
+ return -EINVAL;
+ }
+
+ return rv;
+}
+
static struct nla_policy bond_genl_policy[BOND_GENL_ATTR_MAX + 1] = {
[BOND_GENL_ATTR_MASTER_INDEX] = { .type = NLA_U32 },
[BOND_GENL_ATTR_SLAVE_INDEX] = { .type = NLA_U32 },
[BOND_GENL_ATTR_MODE] = { .type = NLA_U32 },
[BOND_GENL_ATTR_SLAVE_LINK] = { .type = NLA_U32 },
+ [BOND_GENL_ATTR_ML_LADDR] = { .type = NLA_U32 },
+ [BOND_GENL_ATTR_ML_RADDR] = { .type = NLA_U32 },
+ [BOND_GENL_ATTR_ML_MLADDR] = { .type = NLA_U32 },
+ [BOND_GENL_ATTR_ML_INDEX] = { .type = NLA_U32 },
};
static struct genl_ops bond_genl_ops[] = {
@@ -152,6 +271,21 @@ static struct genl_ops bond_genl_ops[] = {
.doit = bond_genl_get_mode,
.policy = bond_genl_policy,
},
+ {
+ .cmd = BOND_GENL_ML_CMD_RT_ADD,
+ .doit = bond_genl_ml_chg_route,
+ .policy = bond_genl_policy,
+ },
+ {
+ .cmd = BOND_GENL_ML_CMD_RT_DEL,
+ .doit = bond_genl_ml_chg_route,
+ .policy = bond_genl_policy,
+ },
+ {
+ .cmd = BOND_GENL_ML_CMD_RT_FLUSH,
+ .doit = bond_genl_ml_flush_route,
+ .policy = bond_genl_policy,
+ },
};
static int bond_validate(struct nlattr *tb[], struct nlattr *data[])
diff --git a/drivers/net/bonding/bond_netlink.h b/drivers/net/bonding/bond_netlink.h
index 030c2af..c979cdd 100644
--- a/drivers/net/bonding/bond_netlink.h
+++ b/drivers/net/bonding/bond_netlink.h
@@ -1,6 +1,11 @@
+extern struct genl_family bond_genl_family;
+extern struct genl_multicast_group bond_genl_mcgrp;
+extern int bond_nl_seq;
+
extern int bond_nl_link_change(struct bonding *bond, struct slave *slave,
int state);
extern void bond_set_rtnl_link_ops(struct net_device *bond_dev);
extern int bond_netlink_init(void);
extern void bond_netlink_fini(void);
+
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index ed09a79..e6bbd07 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -23,6 +23,7 @@
#include <linux/in6.h>
#include "bond_3ad.h"
#include "bond_alb.h"
+#include "bond_ml.h"
#define DRV_VERSION "3.7.0"
#define DRV_RELDATE "June 2, 2010"
@@ -246,6 +247,7 @@ struct bonding {
u16 rr_tx_counter;
struct ad_bond_info ad_info;
struct alb_bond_info alb_info;
+ struct ml_bond_info ml_info;
struct bond_params params;
struct list_head vlan_list;
struct vlan_group *vlgrp;
@@ -255,6 +257,7 @@ struct bonding {
struct delayed_work arp_work;
struct delayed_work alb_work;
struct delayed_work ad_work;
+ struct delayed_work ml_work;
struct delayed_work mcast_work;
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
struct in6_addr master_ipv6;
@@ -365,6 +368,16 @@ static inline void bond_unset_master_alb_flags(struct bonding *bond)
bond->dev->priv_flags &= ~IFF_MASTER_ALB;
}
+static inline void bond_set_master_ml_flags(struct bonding *bond)
+{
+ bond->dev->priv_flags |= IFF_MASTER_ML;
+}
+
+static inline void bond_unset_master_ml_flags(struct bonding *bond)
+{
+ bond->dev->priv_flags &= ~IFF_MASTER_ML;
+}
+
struct vlan_entry *bond_next_vlan(struct bonding *bond, struct vlan_entry *curr);
int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, struct net_device *slave_dev);
int bond_create(struct net *net, const char *name);
diff --git a/include/linux/if.h b/include/linux/if.h
index 1239599..826b06f 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -77,6 +77,7 @@
#define IFF_BRIDGE_PORT 0x8000 /* device used as bridge port */
#define IFF_OVS_DATAPATH 0x10000 /* device used as Open vSwitch
* datapath port */
+#define IFF_MASTER_ML 0x20000 /* bonding master, multi-link */
#define IF_GET_IFACE 0x0001 /* for querying only */
#define IF_GET_PROTO 0x0002
diff --git a/include/linux/if_bonding.h b/include/linux/if_bonding.h
index b03d832..15c8773 100644
--- a/include/linux/if_bonding.h
+++ b/include/linux/if_bonding.h
@@ -70,6 +70,7 @@
#define BOND_MODE_8023AD 4
#define BOND_MODE_TLB 5
#define BOND_MODE_ALB 6 /* TLB + RLB (receive load balancing) */
+#define BOND_MODE_ML 7
/* each slave's link has 4 states */
#define BOND_LINK_UP 0 /* link is up and running */
@@ -114,12 +115,22 @@ struct ad_info {
__u8 partner_system[ETH_ALEN];
};
+struct bond_ml_route {
+ __u16 lif_index;
+ struct in_addr laddr;
+ struct in_addr raddr;
+};
+
enum {
BOND_GENL_ATTR_UNSPEC = 0,
BOND_GENL_ATTR_MASTER_INDEX,
BOND_GENL_ATTR_SLAVE_INDEX,
BOND_GENL_ATTR_MODE,
BOND_GENL_ATTR_SLAVE_LINK,
+ BOND_GENL_ATTR_ML_LADDR,
+ BOND_GENL_ATTR_ML_RADDR,
+ BOND_GENL_ATTR_ML_MLADDR,
+ BOND_GENL_ATTR_ML_INDEX,
__BOND_GENL_ATTR_MAX,
};
@@ -129,6 +140,10 @@ enum {
BOND_GENL_CMD_UNSPEC = 0,
BOND_GENL_CMD_GET_MODE,
BOND_GENL_SLAVE_LINK,
+ BOND_GENL_ML_CMD_RT_ADD,
+ BOND_GENL_ML_CMD_RT_DEL,
+ BOND_GENL_ML_CMD_RT_FLUSH,
+ BOND_GENL_ML_CMD_DISCOVERY,
__BOND_GENL_MAX,
};
diff --git a/net/core/dev.c b/net/core/dev.c
index d28b3a0..02b653b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2921,10 +2921,28 @@ static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
/* On bonding slaves other than the currently active slave, suppress
* duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
* ARP on active-backup slaves with arp_validate enabled.
+ * Additionally, set skb->dev appropriately for the mode / action.
*/
int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
{
struct net_device *dev = skb->dev;
+ struct iphdr *iph;
+
+ if (master->priv_flags & IFF_MASTER_ML) {
+ if (skb->protocol == htons(ETH_P_IP)) {
+ iph = ip_hdr(skb);
+ if (!iph)
+ goto out;
+
+ /* For ML, assign to master only if traffic is for
+ * master, as slaves keep their assigned IP addresses
+ */
+ if (!ip_route_input(skb, iph->daddr, iph->saddr, 0,
+ master))
+ skb->dev = master;
+ }
+ return 0;
+ }
if (master->priv_flags & IFF_MASTER_ARPMON)
dev->last_rx = jiffies;
@@ -2941,19 +2959,22 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
skb->protocol == __cpu_to_be16(ETH_P_ARP))
- return 0;
+ goto out;
if (master->priv_flags & IFF_MASTER_ALB) {
if (skb->pkt_type != PACKET_BROADCAST &&
skb->pkt_type != PACKET_MULTICAST)
- return 0;
+ goto out;
}
if (master->priv_flags & IFF_MASTER_8023AD &&
skb->protocol == __cpu_to_be16(ETH_P_SLOW))
- return 0;
+ goto out;
return 1;
}
+
+out:
+ skb->dev = master;
return 0;
}
EXPORT_SYMBOL(__skb_bond_should_drop);
@@ -2981,6 +3002,10 @@ static int __netif_receive_skb(struct sk_buff *skb)
if (!skb->skb_iif)
skb->skb_iif = skb->dev->ifindex;
+ skb_reset_network_header(skb);
+ skb_reset_transport_header(skb);
+ skb->mac_len = skb->network_header - skb->mac_header;
+
/*
* bonding note: skbs received on inactive slaves should only
* be delivered to pkt handlers that are exact matches. Also
@@ -2997,14 +3022,10 @@ static int __netif_receive_skb(struct sk_buff *skb)
if (skb_bond_should_drop(skb, master)) {
skb->deliver_no_wcard = 1;
null_or_orig = orig_dev; /* deliver only exact match */
- } else
- skb->dev = master;
+ }
}
__this_cpu_inc(softnet_data.processed);
- skb_reset_network_header(skb);
- skb_reset_transport_header(skb);
- skb->mac_len = skb->network_header - skb->mac_header;
pt_prev = NULL;
--
1.6.0.2
^ permalink raw reply related
* Re: [RFC] ipv6: don't flush routes when setting loopback down
From: Eric W. Biederman @ 2010-12-17 1:18 UTC (permalink / raw)
To: Stephen Hemminger
Cc: David Miller, brian.haley, netdev, maheshkelkar, lorenzo,
yoshfuji, stable
In-Reply-To: <20101216132812.2d7fd885@nehalam>
Stephen Hemminger <shemminger@vyatta.com> writes:
> When loopback device is being brought down, then keep the route table
> entries because they are special. The entries in the local table for
> linklocal routes and ::1 address should not be purged.
>
> This is a sub optimal solution to the problem and should be replaced
> by a better fix in future.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Stephen thanks for this. This patch looks good to me. I just tested
this against 2.6.37-rc6 and my simple tests show it to be working
without problems.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> patch versus current net-next tree, but if this acceptable
> it should be applied to net-2.6 as well.
>
> --- a/net/ipv6/addrconf.c 2010-12-16 10:29:34.035408392 -0800
> +++ b/net/ipv6/addrconf.c 2010-12-16 10:30:37.366834482 -0800
> @@ -2669,7 +2669,9 @@ static int addrconf_ifdown(struct net_de
>
> ASSERT_RTNL();
>
> - rt6_ifdown(net, dev);
> + /* Flush routes if device is being removed or it is not loopback */
> + if (how || !(dev->flags & IFF_LOOPBACK))
> + rt6_ifdown(net, dev);
> neigh_ifdown(&nd_tbl, dev);
>
> idev = __in6_dev_get(dev);
^ permalink raw reply
* Re: [PATCH] net: increase skb->users instead of skb_clone()
From: Junchang Wang @ 2010-12-17 0:24 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, David S. Miller, Tom Herbert, Jiri Pirko, Fenghua Yu,
Xinan Tang, netdev
In-Reply-To: <1292510518.2883.207.camel@edumazet-laptop>
On Thu, Dec 16, 2010 at 10:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> But, if we have N receivers, we get only the last one win - the first N-1 will
>> call deliver_skb().
>>
>
> Yes, but you want to, because each receiver has to make a private copy
> of the skb.
>
> The big win is that if packet if filtered out (not accepted by the
> socket filter), you end with no extra skb_clone() at all.
>
> Say you have 8 receivers, with a filter matching some hash/cpu, and only
> one af_packet socket will take the message.
>
> Before patch : 8 skb_clones()
>
> After patch : one skb_clone()
>
Now I understand. :)
The patch is fine. Thanks Changli and Eric.
--
--Junchang
^ permalink raw reply
* Re: [PATCH] Staging: batman-adv: Remove batman-adv from staging
From: Sven Eckelmann @ 2010-12-16 23:39 UTC (permalink / raw)
To: Greg KH
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r,
davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <20101216232301.GA19390-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
[-- Attachment #1: Type: Text/Plain, Size: 1093 bytes --]
Greg KH wrote:
> On Thu, Dec 16, 2010 at 11:28:17PM +0100, Sven Eckelmann wrote:
> > batman-adv is now moved to net/batman-adv/ and can be removed from
> > staging.
> >
> > Signed-off-by: Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org>
> > Cc: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> > ---
> > This is actually for Greg because he has some patches in his tree for
> > 2.6.38.
>
> Wonderful, that worked great, and it's now applied.
>
> > I just wanted to thank Greg and David for her work. Thanks :)
>
> I used to have long enough hair that for one whole year my university
> thought I was female, sending me things in the mail that started out,
> "Dear Ms. Greg..."
>
> Unfortunately, those days for me are long gone, but it is nice to see
> that my feminine side still shines through :)
I am really sorry about that. Blame the bad weather for my inability to press
quite important letters at the correct moment. I never wanted to say that you
are female or that David and you are the same person. :)
thanks,
Sven
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 07/12] vlan: convert VLAN devices to use ndo_fix_features()
From: Ben Hutchings @ 2010-12-16 23:36 UTC (permalink / raw)
To: Michał Mirosław; +Cc: netdev
In-Reply-To: <34dbc2d3d83d82f506f0f073dbf00444885e4f81.1292451560.git.mirq-linux@rere.qmqm.pl>
On Wed, 2010-12-15 at 23:24 +0100, Michał Mirosław wrote:
> Note: get_flags was actually broken, because it should return the
> flags capped with vlan_features. This is now done implicitly by
> limiting netdev->hw_features.
>
> RX checksumming offload control is (and was) broken, as there was no way
> before to say whether it's done for tagged packets.
>
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> ---
> net/8021q/vlan.c | 3 +-
> net/8021q/vlan_dev.c | 51 ++++++++++++++-----------------------------------
> 2 files changed, 16 insertions(+), 38 deletions(-)
>
> diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
> index 6e64f7c..583d47b 100644
> --- a/net/8021q/vlan.c
> +++ b/net/8021q/vlan.c
> @@ -329,8 +329,7 @@ static void vlan_transfer_features(struct net_device *dev,
> {
> unsigned long old_features = vlandev->features;
>
> - vlandev->features &= ~dev->vlan_features;
> - vlandev->features |= dev->features & dev->vlan_features;
> + netdev_update_features(vlandev);
> vlandev->gso_max_size = dev->gso_max_size;
>
> if (dev->features & NETIF_F_HW_VLAN_TX)
> diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
> index be73753..468c899 100644
> --- a/net/8021q/vlan_dev.c
> +++ b/net/8021q/vlan_dev.c
> @@ -691,8 +691,8 @@ static int vlan_dev_init(struct net_device *dev)
> (1<<__LINK_STATE_DORMANT))) |
> (1<<__LINK_STATE_PRESENT);
>
> - dev->features |= real_dev->features & real_dev->vlan_features;
> - dev->features |= NETIF_F_LLTX;
> + dev->hw_features = real_dev->vlan_features;
[...]
net_device::hw_features is supposed to represent features that can be
toggled, but the inclusion of a flag in net_device::vlan_features does
not mean the feature can be toggled.
If this is to be a straight conversion, that line should be:
dev->hw_features = real_dev->vlan_features & NETIF_F_TSO;
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: force a fresh timestamp for ingress packets
From: Eric Dumazet @ 2010-12-16 23:31 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: David Miller, Changli Gao, netdev, Patrick McHardy
In-Reply-To: <20101216224237.GC2191@del.dom.local>
Le jeudi 16 décembre 2010 à 23:42 +0100, Jarek Poplawski a écrit :
> On Thu, Dec 16, 2010 at 11:26:03PM +0100, Eric Dumazet wrote:
> > Le jeudi 16 décembre 2010 ?? 23:08 +0100, Jarek Poplawski a écrit :
> >
> > > Hmm... Do you expect more people start debugging SFQ or I missed your
> > > point? ;-) Maybe such a change would be reasonable on a cloned skb?
> >
> > My point was to permit an admin to check his ingress shaping works or
> > not. In this respect, netem was a specialization of a general problem.
> >
> > We had a prior discussion on timestamping skb in RX path, when RPS came
> > in : Should we take timestamp before RPS or after. At that time we added
> > a sysctl. Not sure it was the right choice :-(
> >
> > I feel tcpdump (on TX side) should really display time of packet right
> > before being given to device, but this is probably a matter of taste.
>
> It might be reasonable unless it changes data for later users. That's
> why I mentioned cloning.
>
> >
> > Before commit 8caf153974f2 (net: sch_netem: Fix an inconsistency in
> > ingress netem timestamps.), this is what was done.
>
> Then why don't you try to let turn it off in netem, where it only
> matters?
>
Because, if you read again my patch, you'll see :
#ifdef CONFIG_NET_CLS_ACT
if (!skb->tstamp.tv64 || (G_TC_FROM(skb->tc_verd) & AT_INGRESS))
net_timestamp_set(skb);
#else
So :
If we are handling an INGRESS packet, we force a timestamp renew.
Therefore :
We dont need in netem_dequeue() to force :
-#ifdef CONFIG_NET_CLS_ACT
- /*
- * If it's at ingress let's pretend the delay is
- * from the network (tstamp will be updated).
- */
- if (G_TC_FROM(skb->tc_verd) & AT_INGRESS)
- skb->tstamp.tv64 = 0;
-#endif
Since :
We are going to renew timestamp anyway.
Conclusion :
I am eliminating dead code.
Is that OK ?
Thanks
^ permalink raw reply
* Re: [RFC PATCH 04/12] net: introduce NETIF_F_RXCSUM
From: Ben Hutchings @ 2010-12-16 23:27 UTC (permalink / raw)
To: Michał Mirosław; +Cc: netdev
In-Reply-To: <fcee74e8b71430c0e0c267ac180f5861c471dc41.1292451560.git.mirq-linux@rere.qmqm.pl>
On Wed, 2010-12-15 at 23:24 +0100, Michał Mirosław wrote:
> Introduce NETIF_F_RXCSUM to replace device-private flags for RX checksum
> offload. Integrate it with ndo_fix_features.
>
> ethtool_op_get_rx_csum() is removed altogether as nothing in-tree uses it.
Not explicitly, but what about drivers that turn TX and RX checksumming
on and off together? They are implicitly covered by the case which
you're removing:
[...]
> - case ETHTOOL_GRXCSUM:
> - rc = ethtool_get_value(dev, useraddr, ethcmd,
> - (dev->ethtool_ops->get_rx_csum ?
> - dev->ethtool_ops->get_rx_csum :
> - ethtool_op_get_rx_csum));
[...]
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [RFC PATCH 03/12] net: ethtool: use ndo_fix_features for offload setting
From: Ben Hutchings @ 2010-12-16 23:23 UTC (permalink / raw)
To: Michał Mirosław; +Cc: netdev
In-Reply-To: <b4f8764846cebb2ebaad8c5c3fb86457e11cc8a4.1292451559.git.mirq-linux@rere.qmqm.pl>
On Wed, 2010-12-15 at 23:24 +0100, Michał Mirosław wrote:
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> ---
> include/linux/netdevice.h | 2 +
> net/core/dev.c | 8 ++
> net/core/ethtool.c | 252 +++++++++++++++++++-------------------------
> 3 files changed, 119 insertions(+), 143 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4b20944..7634cab 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -941,6 +941,8 @@ struct net_device {
> #define NETIF_F_V6_CSUM (NETIF_F_GEN_CSUM | NETIF_F_IPV6_CSUM)
> #define NETIF_F_ALL_CSUM (NETIF_F_V4_CSUM | NETIF_F_V6_CSUM)
>
> +#define NETIF_F_ALL_TSO (NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_TSO_ECN)
> +
> /*
> * If one device supports one of these features, then enable them
> * for all in netdev_increment_features.
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1e616bb..95d0a49 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5054,6 +5054,14 @@ unsigned long netdev_fix_features(unsigned long features, const char *name)
> features &= ~NETIF_F_TSO;
> }
>
> + /* Software GSO depends on SG. */
> + if ((features & NETIF_F_GSO) && !(features & NETIF_F_SG)) {
> + if (name)
> + printk(KERN_NOTICE "%s: Dropping NETIF_F_GSO since no "
> + "SG feature.\n", name);
> + features &= ~NETIF_F_GSO;
> + }
> +
> if (features & NETIF_F_UFO) {
> /* maybe split UFO into V4 and V6? */
> if (!((features & NETIF_F_GEN_CSUM) ||
The severity of these messages will need to be reduced if ethtool relies
on this function to propagate feature changes. (And I wonder why some
of them are ERR and some NOTICE.)
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index f08e7f1..b068738 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -55,6 +55,7 @@ int ethtool_op_set_tx_csum(struct net_device *dev, u32 data)
>
> return 0;
> }
> +EXPORT_SYMBOL(ethtool_op_set_tx_csum);
>
> int ethtool_op_set_tx_hw_csum(struct net_device *dev, u32 data)
> {
> @@ -220,6 +221,85 @@ static int ethtool_set_features(struct net_device *dev, void __user *useraddr)
> return 0;
> }
>
> +static unsigned long ethtool_get_feature_mask(u32 eth_cmd)
> +{
> + switch (eth_cmd) {
> + case ETHTOOL_GTXCSUM:
> + case ETHTOOL_STXCSUM:
> + return NETIF_F_ALL_CSUM|NETIF_F_SCTP_CSUM;
I wonder whether this should cover NETIF_F_FCOE_CRC as well (ixgbe
currently doesn't seem to allow toggling it).
There should be spaces around the '|' operator.
> +static int __ethtool_set_tx_csum(struct net_device *dev, u32 data);
> +static int __ethtool_set_sg(struct net_device *dev, u32 data);
> +static int __ethtool_set_tso(struct net_device *dev, u32 data);
> +static int __ethtool_set_ufo(struct net_device *dev, u32 data);
> +
> +static int ethtool_set_one_feature(struct net_device *dev,
> + void __user *useraddr, u32 ethcmd)
> +{
> + struct ethtool_value edata;
> + unsigned long mask;
> +
> + if (copy_from_user(&edata, useraddr, sizeof(edata)))
> + return -EFAULT;
> +
> + mask = ethtool_get_feature_mask(ethcmd);
> + mask &= dev->hw_features | NETIF_F_SOFT_FEATURES;
> + if (mask) {
> + if (edata.data)
> + dev->wanted_features |= mask;
> + else
> + dev->wanted_features &= ~mask;
> +
> + netdev_update_features(dev);
> + return 0;
> + }
> +
> + switch (ethcmd) {
> + case ETHTOOL_STXCSUM:
> + return __ethtool_set_tx_csum(dev, edata.data);
> + case ETHTOOL_SSG:
> + return __ethtool_set_sg(dev, edata.data);
> + case ETHTOOL_STSO:
> + return __ethtool_set_tso(dev, edata.data);
> + case ETHTOOL_SUFO:
> + return __ethtool_set_ufo(dev, edata.data);
> + default:
> + return -EOPNOTSUPP;
> + }
> +}
[...]
This deserves some comments.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH] Staging: batman-adv: Remove batman-adv from staging
From: Greg KH @ 2010-12-16 23:23 UTC (permalink / raw)
To: Sven Eckelmann
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r,
davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1292538497-13604-1-git-send-email-sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org>
On Thu, Dec 16, 2010 at 11:28:17PM +0100, Sven Eckelmann wrote:
> batman-adv is now moved to net/batman-adv/ and can be removed from
> staging.
>
> Signed-off-by: Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org>
> Cc: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> ---
> This is actually for Greg because he has some patches in his tree for 2.6.38.
Wonderful, that worked great, and it's now applied.
> I just wanted to thank Greg and David for her work. Thanks :)
I used to have long enough hair that for one whole year my university
thought I was female, sending me things in the mail that started out,
"Dear Ms. Greg..."
Unfortunately, those days for me are long gone, but it is nice to see
that my feminine side still shines through :)
greg k-h
^ permalink raw reply
* Re: [PATCH V7 4/8] posix clocks: hook dynamic clocks into system calls
From: Thomas Gleixner @ 2010-12-16 23:20 UTC (permalink / raw)
To: Richard Cochran
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
John Stultz, Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti
In-Reply-To: <6241238a1df55033e50b151ec9d35ba957e43d53.1292512461.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>
On Thu, 16 Dec 2010, Richard Cochran wrote:
Can you please split this into infrastructure and conversion of
posix-timer.c ?
> diff --git a/include/linux/posix-clock.h b/include/linux/posix-clock.h
> index 1ce7fb7..7ea94f5 100644
> --- a/include/linux/posix-clock.h
> +++ b/include/linux/posix-clock.h
> @@ -108,4 +108,29 @@ struct posix_clock *posix_clock_create(struct posix_clock_operations *cops,
> */
> void posix_clock_destroy(struct posix_clock *clk);
>
> +/**
> + * clockid_to_posix_clock() - obtain clock device pointer from a clock id
> + * @id: user space clock id
> + *
> + * Returns a pointer to the clock device, or NULL if the id is not a
> + * dynamic clock id.
> + */
> +struct posix_clock *clockid_to_posix_clock(clockid_t id);
> +
> +/*
> + * The following functions are only to be called from posix-timers.c
> + */
Then they should be in a header file which is not consumable by the
general public. e.g. kernel/time/posix-clock.h
> +int pc_clock_adjtime(struct posix_clock *clk, struct timex *tx);
> +int pc_clock_gettime(struct posix_clock *clk, struct timespec *ts);
> +int pc_clock_getres(struct posix_clock *clk, struct timespec *ts);
> +int pc_clock_settime(struct posix_clock *clk, const struct timespec *ts);
> +
> +int pc_timer_create(struct posix_clock *clk, struct k_itimer *kit);
> +int pc_timer_delete(struct posix_clock *clk, struct k_itimer *kit);
> +
> +void pc_timer_gettime(struct posix_clock *clk, struct k_itimer *kit,
> + struct itimerspec *ts);
> +int pc_timer_settime(struct posix_clock *clk, struct k_itimer *kit,
> + int flags, struct itimerspec *ts, struct itimerspec *old);
> +
> #endif
> +
> +static inline bool clock_is_static(const clockid_t id)
> +{
> + if (0 == (id & ~CLOCKFD_MASK))
> + return true;
> + if (CLOCKFD == (id & CLOCKFD_MASK))
> + return false;
Please use the usual kernel notation.
> + return true;
> +}
> +
> /* POSIX.1b interval timer structure. */
> struct k_itimer {
> struct list_head list; /* free/ allocate list */
> diff --git a/include/linux/time.h b/include/linux/time.h
> index 9f15ac7..914c48d 100644
> --- a/include/linux/time.h
> +++ b/include/linux/time.h
> @@ -299,6 +299,8 @@ struct itimerval {
> #define CLOCKS_MASK (CLOCK_REALTIME | CLOCK_MONOTONIC)
> #define CLOCKS_MONO CLOCK_MONOTONIC
>
> +#define CLOCK_INVALID -1
> +
> @@ -712,6 +720,7 @@ common_timer_get(struct k_itimer *timr, struct itimerspec *cur_setting)
> SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id,
> struct itimerspec __user *, setting)
> {
> + struct posix_clock *clk_dev;
> struct k_itimer *timr;
> struct itimerspec cur_setting;
> unsigned long flags;
> @@ -720,7 +729,15 @@ SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id,
> if (!timr)
> return -EINVAL;
>
> - CLOCK_DISPATCH(timr->it_clock, timer_get, (timr, &cur_setting));
> + if (likely(clock_is_static(timr->it_clock))) {
> +
> + CLOCK_DISPATCH(timr->it_clock, timer_get, (timr, &cur_setting));
> +
> + } else {
> + clk_dev = clockid_to_posix_clock(timr->it_clock);
> + if (clk_dev)
> + pc_timer_gettime(clk_dev, timr, &cur_setting);
Why this extra step ? Why can't you call
pc_timer_gettime(timr, &cur_setting);
You already established that the timer is fd based, so let the
pc_timer_* functions deal with it.
> + }
>
> unlock_timer(timr, flags);
>
> @@ -811,6 +828,7 @@ SYSCALL_DEFINE4(timer_settime, timer_t, timer_id, int, flags,
> const struct itimerspec __user *, new_setting,
> struct itimerspec __user *, old_setting)
> {
> + struct posix_clock *clk_dev;
> struct k_itimer *timr;
> struct itimerspec new_spec, old_spec;
> int error = 0;
> @@ -831,8 +849,19 @@ retry:
> if (!timr)
> return -EINVAL;
>
> - error = CLOCK_DISPATCH(timr->it_clock, timer_set,
> - (timr, flags, &new_spec, rtn));
> + if (likely(clock_is_static(timr->it_clock))) {
> +
> + error = CLOCK_DISPATCH(timr->it_clock, timer_set,
> + (timr, flags, &new_spec, rtn));
> +
> + } else {
> + clk_dev = clockid_to_posix_clock(timr->it_clock);
> + if (clk_dev)
> + error = pc_timer_settime(clk_dev, timr,
> + flags, &new_spec, rtn);
> + else
> + error = -EINVAL;
Ditto. pc_timer_settime() can return -EINVAL when there is no valid fd.
> @@ -957,26 +993,51 @@ EXPORT_SYMBOL_GPL(do_posix_clock_nonanosleep);
> SYSCALL_DEFINE2(clock_settime, const clockid_t, which_clock,
> const struct timespec __user *, tp)
> {
> + struct posix_clock *clk_dev = NULL;
> struct timespec new_tp;
>
> - if (invalid_clockid(which_clock))
> - return -EINVAL;
> + if (likely(clock_is_static(which_clock))) {
> +
> + if (invalid_clockid(which_clock))
> + return -EINVAL;
> + } else {
> + clk_dev = clockid_to_posix_clock(which_clock);
> + if (!clk_dev)
> + return -EINVAL;
> + }
It's not a problem to check the validity of that clock fd after the
copy_from_user. That's an error case and we don't care about whether
we return EINVAL here or later. And with your current code this can
happen anyway as you don't hold a reference to the fd. And we do the
same thing for posix_cpu_timers as well. See invalid_clockid()
> if (copy_from_user(&new_tp, tp, sizeof (*tp)))
> return -EFAULT;
>
> + if (unlikely(clk_dev))
> + return pc_clock_settime(clk_dev, &new_tp);
> +
> return CLOCK_DISPATCH(which_clock, clock_set, (which_clock, &new_tp));
I really start to wonder whether we should cleanup that whole
CLOCK_DISPATCH macro magic and provide real inline functions for
each of the clock functions instead.
static inline int dispatch_clock_settime(const clockid_t id, struct timespec *tp)
{
if (id >= 0) {
return posix_clocks[id].clock_set ?
posic_clocks[id].clock_set(id, tp) : -EINVAL;
}
if (posix_cpu_clock(id))
return -EINVAL;
return pc_timers_clock_set(id, tp);
}
That is a bit of work, but the code will be simpler and we just do the
normal thing of function pointer structures. Stuff which is not
implemented does not magically become called via some common
function. There is no point in doing that. We just have to fill in the
various k_clock structs with the correct pointers in the first place
and let the NULL case return a sensible error value. The data
structure does not become larger that way. It's a little bit more init
code, but that's fine if we make the code better in general. In that
case it's not even more init code, it's just filling the data
structures which we register.
That needs to be done in two steps:
1) cleanup CLOCK_DISPATCH
2) add the pc_timer_* extras
That will make the thing way more palatable than working around the
restrictions of CLOCK_DISPATCH and adding the hell to each syscall.
The second step would be a patch consisting of exactly nine lines:
{
if (id >= 0)
return .....
if (posix_cpu_clock_id(id))
return ....
- return -EINVAL;
+ return pc_timer_xxx(...);
}
Please don't tell me now that we could even hack this into
CLOCK_DISPATCH. *shudder*
> +
> +struct posix_clock *clockid_to_posix_clock(const clockid_t id)
> +{
> + struct posix_clock *clk = NULL;
> + struct file *fp;
> +
> + if (clock_is_static(id))
> + return NULL;
> +
> + fp = fget(CLOCKID_TO_FD(id));
> + if (!fp)
> + return NULL;
> +
> + if (fp->f_op->open == posix_clock_open)
> + clk = fp->private_data;
> +
> + fput(fp);
> + return clk;
> +}
> +
> +int pc_clock_adjtime(struct posix_clock *clk, struct timex *tx)
> +{
> + int err;
> +
> + mutex_lock(&clk->mux);
> +
> + if (clk->zombie)
Uuurgh. That explains the zombie thing. That's really horrible and we
can be more clever. When we leave the file lookup to the pc_timer_*
functions, then we can simply do:
struct posix_clock_descr {
struct file *fp;
struct posix_clock *clk;
};
static int get_posix_clock(const clockid_t id, struct posix_clock_descr *cd)
{
struct file *fp = fget(CLOCKID_TO_FD(id));
if (!fp || fp->f_op->open != posix_clock_open || !fp->private_data)
return -ENODEV;
cd->fp = fp;
cd->clk = fp->private_data;
return 0;
}
static void put_posix_clock(struct posix_clock_descr *cd)
{
fput(cd->fp);
}
int pc_timer_****(....)
{
struct posix_clock_descr cd;
ret = -EOPNOTSUPP;
if (get_posix_clock(id, &cd))
return -ENODEV;
if (cd.clk->ops->whatever)
ret = cd.clk->ops->whatever(....);
put_posix_clock(&cd);
return ret;
}
That get's rid of your life time problems, of clk->mutex, clock->zombie
and makes the code simpler and more robust. And it avoids the whole
mess in posix-timers.c as well.
> + err = -ENODEV;
> + else if (!clk->ops->clock_adjtime)
> + err = -EOPNOTSUPP;
> + else
> + err = clk->ops->clock_adjtime(clk->priv, tx);
> +
> + mutex_unlock(&clk->mux);
> + return err;
> +}
Though all this feels still a bit backwards.
The whole point of this exercise is to provide dynamic clock ids and
make them available via the standard posix timer interface and aside
of that have standard file operations for these clocks to provide
special purpose ioctls, mmap and whatever. And to make it a bit more
complex these must be removable modules.
I need to read back the previous discussions, but maybe someone can
fill me in why we can't make these dynamic things not just live in
posix_clocks[MAX_CLOCKS ... MAX_DYNAMIC_CLOCKS] (there is no
requirement to have an unlimited number of those) and just get a
mapping from the fd to the clock_id? That creates a different set of
life time problems, but no real horrible ones.
Thanks,
tglx
^ permalink raw reply
* Re: [RFC] ipv6: don't flush routes when setting loopback down
From: Eric W. Biederman @ 2010-12-16 23:17 UTC (permalink / raw)
To: Stephen Hemminger
Cc: David Miller, brian.haley, netdev, maheshkelkar, lorenzo,
yoshfuji, stable
In-Reply-To: <20101216132812.2d7fd885@nehalam>
Stephen Hemminger <shemminger@vyatta.com> writes:
> When loopback device is being brought down, then keep the route table
> entries because they are special. The entries in the local table for
> linklocal routes and ::1 address should not be purged.
>
> This is a sub optimal solution to the problem and should be replaced
> by a better fix in future.
I will test this and let you know how it goes from my side.
I would really like to get a fix for this in before -rc7,
as this bug makes the kernel unusable for me.
Eric
^ permalink raw reply
* Re: [RFC PATCH 02/12] net: Introduce new feature setting ops
From: Ben Hutchings @ 2010-12-16 23:13 UTC (permalink / raw)
To: Michał Mirosław; +Cc: netdev
In-Reply-To: <822f5776f99cab9889cd72d658d5fe50c56bb247.1292451559.git.mirq-linux@rere.qmqm.pl>
On Wed, 2010-12-15 at 23:24 +0100, Michał Mirosław wrote:
> This introduces a new framework to handle device features setting.
> It consists of:
> - new fields in struct net_device:
> + hw_features - features that hw/driver supports toggling
> + wanted_features - features that user wants enabled, when possible
> - new netdev_ops:
> + feat = ndo_fix_features(dev, feat) - API checking constraints for
> enabling features or their combinations
> + ndo_set_features(dev) - API updating hardware state to match
> changed dev->features
> - new ethtool commands:
> + ETHTOOL_GFEATURES/ETHTOOL_SFEATURES: get/set dev->wanted_features
> and trigger device reconfiguration if resulting dev->features
> changed
> [TODO: this might be extended to support device-specific flags, and
> keep NETIF_F flags from becoming part of ABI by using GET_STRINGS
> for describing the bits]
We already have ETHTOOL_{G,S}PFLAGS for that, though.
> [Note: ETHTOOL_GFEATURES and ETHTOOL_SFEATURES' data is supposed to
> be 'compatible', so that you can R/M/W without additional copying]
But if you expect userland to do that, what is the point of the 'valid'
mask? Shouldn't userland do something like:
struct ethtool_features feat = { ETHTOOL_SFEATURES };
...
if (off_tso_wanted >= 0)
feat.features[0].valid |= NETIF_F_TSO;
if (off_tso_wanted > 0)
feat.features[0].requested |= NETIF_F_TSO;
...
[...]
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -523,6 +523,31 @@ struct ethtool_flash {
> char data[ETHTOOL_FLASH_MAX_FILENAME];
> };
>
> +/* for returning feature sets */
> +#define ETHTOOL_DEV_FEATURE_WORDS 1
> +
> +struct ethtool_get_features_block {
> + __u32 available; /* features togglable */
> + __u32 requested; /* features requested to be enabled */
> + __u32 active; /* features currently enabled */
> + __u32 __pad[1];
> +};
> +
> +struct ethtool_set_features_block {
> + __u32 valid; /* bits valid in .requested */
> + __u32 requested; /* features requested */
> + __u32 __pad[2];
> +};
> +
> +struct ethtool_features {
> + __u32 cmd;
> + __u32 count; /* blocks */
> + union {
> + struct ethtool_get_features_block get;
> + struct ethtool_set_features_block set;
> + } features[0];
> +};
I want kernel-doc comments with a proper description of semantics.
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
[...]
> @@ -934,6 +949,14 @@ struct net_device {
> NETIF_F_SG | NETIF_F_HIGHDMA | \
> NETIF_F_FRAGLIST)
>
> + /* toggable features with no driver requirements */
> +#define NETIF_F_SOFT_FEATURES (NETIF_F_GSO | NETIF_F_GRO)
> +
> + /* ethtool-toggable features */
The verb is 'toggle' so this adjective should be either 'togglable' or
'toggleable'. Or you could choose a different adjective.
> + unsigned long hw_features;
> + /* ethtool-requested features */
> + unsigned long wanted_features;
> +
While you're at it, you could change all these 'features' fields and
parameters to u32, since we presumably won't be defining features that
can only be enabled on 64-bit architectures.
[...]
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index 1774178..f08e7f1 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -171,6 +171,55 @@ EXPORT_SYMBOL(ethtool_ntuple_flush);
>
> /* Handlers for each ethtool command */
>
> +static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
> +{
> + struct ethtool_features cmd = {
> + .cmd = ETHTOOL_GFEATURES,
> + .count = ETHTOOL_DEV_FEATURE_WORDS,
> + };
> + struct ethtool_get_features_block features[ETHTOOL_DEV_FEATURE_WORDS] = {
> + {
> + .available = dev->hw_features,
> + .requested = dev->wanted_features,
> + .active = dev->features,
> + },
> + };
> +
> + if (copy_to_user(useraddr, &cmd, sizeof(cmd)))
> + return -EFAULT;
> + useraddr += sizeof(cmd);
> + if (copy_to_user(useraddr, features, sizeof(features)))
> + return -EFAULT;
If ETHTOOL_DEV_FEATURE_WORDS increases, how do you know the user buffer
will be big enough?
> + return 0;
> +}
> +
> +static int ethtool_set_features(struct net_device *dev, void __user *useraddr)
> +{
> + struct ethtool_features cmd;
> + struct ethtool_set_features_block features[ETHTOOL_DEV_FEATURE_WORDS];
> +
> + if (copy_from_user(&cmd, useraddr, sizeof(cmd)))
> + return -EFAULT;
> + useraddr += sizeof(cmd);
> +
> + if (cmd.count > ETHTOOL_DEV_FEATURE_WORDS)
> + cmd.count = ETHTOOL_DEV_FEATURE_WORDS;
So additional feature words will be silently ignored...
> + if (copy_from_user(features, useraddr, sizeof(*features) * cmd.count))
> + return -EFAULT;
> + memset(features + cmd.count, 0,
> + sizeof(features) - sizeof(*features) * cmd.count);
> +
> + features[0].valid &= dev->hw_features | NETIF_F_SOFT_FEATURES;
[...]
...as will any other unsupported features. This is not a good idea.
(However, remembering which features are wanted does seem like a good
idea.)
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH v2 1/2] bonding: generic netlink infrastructure
From: Stephen Hemminger @ 2010-12-16 23:12 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: netdev, Andy Gospodarek
In-Reply-To: <1292540003-9465-2-git-send-email-fubar@us.ibm.com>
On Thu, 16 Dec 2010 14:53:22 -0800
Jay Vosburgh <fubar@us.ibm.com> wrote:
> Generic netlink infrastructure for bonding. Includes two
> netlink operations: notification for slave link state change, and a
> "get mode" netlink command.
> ---
1. The printk() for error needs to either be dropped or turned into
a real loggable message (ie add more info and printk level).
2. Don't use C99 comments to comment out stuff
3. checkpatch has other whitespace complaints.
4. Missing Signed-off-by:
--
^ permalink raw reply
* Re: [PATCH net-next-2.6] ifb: fix a lockdep splat
From: David Miller @ 2010-12-16 22:55 UTC (permalink / raw)
To: xiaosuo; +Cc: eric.dumazet, netdev, hadi
In-Reply-To: <AANLkTinbeR68+BsHYC-0yMAkbZ20TS1EiSg=cGHKQ-f2@mail.gmail.com>
From: Changli Gao <xiaosuo@gmail.com>
Date: Thu, 16 Dec 2010 18:00:50 +0800
> On Thu, Dec 16, 2010 at 5:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> After recent ifb changes, we must use lockless __skb_dequeue() since
>> lock is not anymore initialized.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Cc: Jamal Hadi Salim <hadi@cyberus.ca>
>> Cc: Changli Gao <xiaosuo@gmail.com>
>
> Acked-by: Changli Gao <xiaosuo@gmail.com>
Applied, thanks.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox