Netdev List
 help / color / mirror / Atom feed
* caif BUG() with network namespaces
From: Woodhouse, David @ 2011-10-21 20:51 UTC (permalink / raw)
  To: Sjur Braendeland; +Cc: netdev@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2687 bytes --]

When Chrome initialises its sandbox, we get a BUG:

[   63.674528] ------------[ cut here ]------------
[   63.674540] kernel BUG at net/caif/caif_dev.c:66!
[   63.674547] invalid opcode: 0000 [#1] PREEMPT SMP 
[   63.674556] Modules linked in: iwlagn serio_raw [last unloaded: battery]
[   63.674568] 
[   63.674575] Pid: 801, comm: chrome-sandbox Not tainted 3.0.0-4.1-adaptation-pc #1 Intel Corporation Cedartrail platform/To be filled by O.E.M.
[   63.674589] EIP: 0060:-[<c0baaf8c>] EFLAGS: 00210246 CPU: 1
[   63.674602] EIP is at caif_device_list+0x4c/0x50
[   63.674608] EAX: 00000000 EBX: 00000000 ECX: e1482800 EDX: 00000010
[   63.674614] ESI: e14133c0 EDI: 00000010 EBP: e141dda0 ESP: e141dd98
[   63.674620]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[   63.674627] Process chrome-sandbox (pid: 801, ti=e141c000 task=e52a8ff0 task.ti=e141c000)
[   63.674632] Stack:
[   63.674636]  e1482800 00000000 e141ddd8 c0bab291 c04da339 e141ddd8 c0ad64ff 00000011
[   63.674655]  e14133c0 e141ddd8 c0b335cd e1482800 00000010 c0f612bc 00000000 fffffff3
[   63.674672]  e141de08 c046e6e7 e141dde8 c0bcda38 e141de20 e1482800 00000010 c0f58ac0
[   63.674690] Call Trace:
[   63.674700]  [<c0bab291>] caif_device_notify+0x21/0x2d0
[   63.674710]  [<c04da339>] ? pcpu_alloc_area+0x109/0x250
[   63.674720]  [<c0ad64ff>] ? inetdev_event+0x1f/0x3a0
[   63.674728]  [<c0b335cd>] ? packet_notifier+0x8d/0x180
[   63.674738]  [<c046e6e7>] notifier_call_chain+0x47/0x90
[   63.674747]  [<c0bcda38>] ? mutex_unlock+0x8/0x10
[   63.674756]  [<c046e77b>] raw_notifier_call_chain+0x1b/0x20
[   63.674766]  [<c0a7d938>] call_netdevice_notifiers+0x28/0x60
[   63.674773]  [<c04dac6a>] ? __alloc_percpu+0xa/0x10
[   63.674782]  [<c0a80c6d>] register_netdevice+0xed/0x210
[   63.674790]  [<c0bcdd0e>] ? mutex_lock+0x1e/0x30
[   63.674797]  [<c0a80da2>] register_netdev+0x12/0x20
[   63.674807]  [<c0868ec3>] loopback_net_init+0x43/0x90
[   63.674815]  [<c0a7864f>] ops_init+0x2f/0x80
[   63.674822]  [<c0a7877f>] setup_net+0x4f/0xe0
[   63.674830]  [<c0a78ccc>] copy_net_ns+0x6c/0xe0
[   63.674838]  [<c046de31>] create_new_namespaces+0xc1/0x150
[   63.674847]  [<c046df82>] copy_namespaces+0x72/0xb0
[   63.674856]  [<c0448efe>] copy_process+0x60e/0xc70
[   63.674864]  [<c04df3c1>] ? handle_mm_fault+0x141/0x250
[   63.674872]  [<c04495ef>] do_fork+0x5f/0x300

Is this already known/fixed?

https://bugs.meego.com/show_bug.cgi?id=23540

-- 
                   Sent with MeeGo's ActiveSync support.

David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4370 bytes --]

^ permalink raw reply

* Re: [patch net-next V2] net: introduce ethernet teaming device
From: Jay Vosburgh @ 2011-10-21 18:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, eric.dumazet, bhutchings, shemminger, andy, tgraf,
	ebiederm, mirqus, kaber, greearb, jesse, fbl, benjamin.poirier,
	jzupka
In-Reply-To: <1319200747-2508-1-git-send-email-jpirko@redhat.com>

Jiri Pirko <jpirko@redhat.com> wrote:

>This patch introduces new network device called team. It supposes to be
>very fast, simple, userspace-driven alternative to existing bonding
>driver.
>
>Userspace library called libteam with couple of demo apps is available
>here:
>https://github.com/jpirko/libteam
>Note it's still in its dipers atm.
>
>team<->libteam use generic netlink for communication. That and rtnl
>suppose to be the only way to configure team device, no sysfs etc.
>
>Python binding basis for libteam was recently introduced (some need
>still need to be done on it though). Daemon providing arpmon/miimon
>active-backup functionality will be introduced shortly.
>All what's necessary is already implemented in kernel team driver.
>
>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>
>v1->v2:
>	- modes are made as modules. Makes team more modular and
>	  extendable.
>	- several commenters' nitpicks found on v1 were fixed
>	- several other bugs were fixed.
>	- note I ignored Eric's comment about roundrobin port selector
>	  as Eric's way may be easily implemented as another mode (mode
>	  "random") in future.
>---

[...]

>+static int team_port_add(struct team *team, struct net_device *port_dev)
>+{
>+	struct net_device *dev = team->dev;
>+	struct team_port *port;
>+	char *portname = port_dev->name;
>+	char tmp_addr[ETH_ALEN];
>+	int err;
>+
>+	if (port_dev->flags & IFF_LOOPBACK ||
>+	    port_dev->type != ARPHRD_ETHER) {
>+		netdev_err(dev, "Device %s is of an unsupported type\n",
>+			   portname);
>+		return -EINVAL;
>+	}
>+
>+	if (team_port_exists(port_dev)) {
>+		netdev_err(dev, "Device %s is already a port "
>+				"of a team device\n", portname);
>+		return -EBUSY;
>+	}
>+
>+	if (port_dev->flags & IFF_UP) {
>+		netdev_err(dev, "Device %s is up. Set it down before adding it as a team port\n",
>+			   portname);
>+		return -EBUSY;
>+	}
>+
>+	port = kzalloc(sizeof(struct team_port), GFP_KERNEL);
>+	if (!port)
>+		return -ENOMEM;
>+
>+	port->dev = port_dev;
>+	port->team = team;
>+
>+	port->orig.mtu = port_dev->mtu;
>+	err = dev_set_mtu(port_dev, dev->mtu);
>+	if (err) {
>+		netdev_dbg(dev, "Error %d calling dev_set_mtu\n", err);
>+		goto err_set_mtu;
>+	}
>+
>+	memcpy(port->orig.dev_addr, port_dev->dev_addr, ETH_ALEN);
>+	random_ether_addr(tmp_addr);
>+	err = __set_port_mac(port_dev, tmp_addr);
>+	if (err) {
>+		netdev_dbg(dev, "Device %s mac addr set failed\n",
>+			   portname);
>+		goto err_set_mac_rand;
>+	}
>+
>+	err = dev_open(port_dev);
>+	if (err) {
>+		netdev_dbg(dev, "Device %s opening failed\n",
>+			   portname);
>+		goto err_dev_open;
>+	}
>+
>+	err = team_port_set_orig_mac(port);
>+	if (err) {
>+		netdev_dbg(dev, "Device %s mac addr set failed - Device does not support addr change when it's opened\n",
>+			   portname);
>+		goto err_set_mac_opened;
>+	}

	This will exclude a number of devices that bonding currently
provides at least partial support for.

	Most of those are older 10 or 10/100 Ethernet drivers (anything
that uses eth_mac_addr for its ndo_set_mac_address, I think; there look
to be about 140 or so of those), but it also includes Infiniband (which
is excluded explicitly elsewhere).

	Another small set of Ethernet devices (those that currently need
bonding's fail_over_mac option) do permit setting the MAC while open,
but will misbehave if multiple ports are set to the same MAC.  The usual
suspects here are ehea and qeth, which are partition-aware devices for
IBM's pseries and zseries hardware, but there may be others I'm not
familiar with.

	If these will be permanent limitations of the team driver, then
this should (eventually) be in the documentation.

	Also, from looking at the code, it's not obvious if nesting of
teams is supported or not.  I'm not seeing anything in the code that
would prohibit adding a team device as a port to another team.  If
nesting of teams is undesirable, it should probably be explicitly tested
for and disallowed.

[...]

>+static int __init team_module_init(void)
>+{
>+	int err;
>+
>+	register_netdevice_notifier(&team_notifier_block);
>+
>+	err = rtnl_link_register(&team_link_ops);
>+	if (err)
>+		goto err_rtln_reg;
>+
>+	err = team_nl_init();
>+	if (err)
>+		goto err_nl_init;
>+
>+	return 0;
>+
>+err_nl_init:
>+	rtnl_link_unregister(&team_link_ops);
>+
>+err_rtln_reg:
>+	unregister_netdevice_notifier(&team_notifier_block);

	Minor nit: I suspect you meant "err_rtnl_reg" here, and in the
goto above.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH] route: fix ICMP redirect validation
From: Flavio Leitner @ 2011-10-21 18:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20111020.161929.97626808571871075.davem@davemloft.net>

On Thu, 20 Oct 2011 16:19:29 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Flavio Leitner <fbl@redhat.com>
> Date: Thu, 20 Oct 2011 15:47:02 -0200
> 
> > I was reviewing this again and instead of doing the above, it would
> > be better to use rt_bind_peer() to update rt->peer as well.
> > 
> >                         if (!rt->peer)
> >                                 rt_bind_peer(rt, rt->rt_dst, 1);
> > 
> >                         peer = rt->peer;
> >                         if (peer) {
> >                                 peer->redirect_learned.a4 = new_gw;
> >                                 atomic_inc(&__rt_peer_genid);
> >                         }
> > 
> > 
> > but I am not sure if I understood you completely when you say
> > to do such that only an inetpeer cache probe is necessary.
> 
> If you have the route entry available already and you're doing the
> inetpeer lookup anyways, you might as well use rt_bind_peer() since
> all of the expensive work has to be done anyways.
> 
> So yes, using rt_bind_peer() would be the best thing to do here.
>

just posted patch v3. iirc, you prefer to receive patches as new
posts rather than replies to old threads.
"Subject: [PATCH net-next v3] route: fix ICMP redirect validation"

thanks again,
fbl

^ permalink raw reply

* [PATCH net-next v3] route: fix ICMP redirect validation
From: Flavio Leitner @ 2011-10-21 16:44 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Flavio Leitner

The commit f39925dbde7788cfb96419c0f092b086aa325c0f
(ipv4: Cache learned redirect information in inetpeer.)
removed some ICMP packet validations which are required by
RFC 1122, section 3.2.2.2:
...
  A Redirect message SHOULD be silently discarded if the new
  gateway address it specifies is not on the same connected
  (sub-) net through which the Redirect arrived [INTRO:2,
  Appendix A], or if the source of the Redirect is not the
  current first-hop gateway for the specified destination (see
  Section 3.3.1).

Signed-off-by: Flavio Leitner <fbl@redhat.com>
---
 net/ipv4/route.c |   36 +++++++++++++++++++++++++++++++-----
 1 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 26c77e1..1082460 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1308,7 +1308,12 @@ static void rt_del(unsigned hash, struct rtable *rt)
 void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 		    __be32 saddr, struct net_device *dev)
 {
+	int s, i;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
+	struct rtable *rt;
+	__be32 skeys[2] = { saddr, 0 };
+	int    ikeys[2] = { dev->ifindex, 0 };
+	struct flowi4 fl4;
 	struct inet_peer *peer;
 	struct net *net;
 
@@ -1331,13 +1336,34 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 			goto reject_redirect;
 	}
 
-	peer = inet_getpeer_v4(daddr, 1);
-	if (peer) {
-		peer->redirect_learned.a4 = new_gw;
+	memset(&fl4, 0, sizeof(fl4));
+	fl4.daddr = daddr;
+	for (s = 0; s < 2; s++) {
+		for (i = 0; i < 2; i++) {
+			fl4.flowi4_oif = ikeys[i];
+			fl4.saddr = skeys[s];
+			rt = __ip_route_output_key(net, &fl4);
+			if (IS_ERR(rt))
+				continue;
 
-		inet_putpeer(peer);
+			if (rt->dst.error || rt->dst.dev != dev ||
+			    rt->rt_gateway != old_gw) {
+				ip_rt_put(rt);
+				continue;
+			}
 
-		atomic_inc(&__rt_peer_genid);
+			if (!rt->peer)
+				rt_bind_peer(rt, rt->rt_dst, 1);
+
+			peer = rt->peer;
+			if (peer) {
+				peer->redirect_learned.a4 = new_gw;
+				atomic_inc(&__rt_peer_genid);
+			}
+
+			ip_rt_put(rt);
+			return;
+		}
 	}
 	return;
 
-- 
1.7.6

^ permalink raw reply related

* Re: IPsec performance bug
From: Kim Phillips @ 2011-10-21 17:54 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: netdev, davem, hirofumi
In-Reply-To: <CAAM7YAkxP06_=y1wg4P1JhPDWhZgRM6+wbFQRG5PFkAq8vgsTw@mail.gmail.com>

On Fri, 21 Oct 2011 16:28:30 +0800
"Yan, Zheng" <zheng.z.yan@linux.intel.com> wrote:

> On Thu, Oct 20, 2011 at 10:22 AM, Kim Phillips
> <kim.phillips@freescale.com> wrote:
> > (b) any ideas how to fix?  I don't know much about routing
> > internals, but in ip_route_input_common(), if I remove the input
> > interface comparison (rth->rt_route_iif ^ iif), I get some
> > performance back, but the system becomes unstable (it's booted over
> > nfs).
> 
> Looks like xfrm4_fill_dst() reset rt->rt_route_iif to 0, it makes the
> comparison (rth->rt_route_iif ^ iif) in
> ip_route_input_common() return false.
> 
> Please try patch below. It improves the performance of 3.1-rc10
> kernel.

yes, thanks, ~50kpps performance is restored when applying this diff
to current net-next.

> (I'm not sure the patch is harmless)

the system appears to be more stable, but this is still concerning.

Kim

^ permalink raw reply

* Re: [PATCH V2 2/4] MIPS: Add board support for Loongson1B
From: Wu Zhangjin @ 2011-10-21 17:33 UTC (permalink / raw)
  To: keguang.zhang; +Cc: linux-mips, linux-kernel, ralf, r0bertz, netdev
In-Reply-To: <1319192888-21465-2-git-send-email-keguang.zhang@gmail.com>

On Fri, Oct 21, 2011 at 6:28 PM,  <keguang.zhang@gmail.com> wrote:
> From: Kelvin Cheung <keguang.zhang@gmail.com>
>
> This patch adds basic platform support for Loongson1B
> including serial port, ethernet, and interrupt handler.
>
> Loongson1B UART is compatible with NS16550A.
> Loongson1B GMAC is built around Synopsys IP Core.
>

Perhaps you'd better split out the GMAC support to its own patch and
send it to the net/ maintainer and the authors of the original files.

> diff --git a/drivers/net/stmmac/descs.h b/drivers/net/stmmac/descs.h
> index 63a03e2..4db27d0 100644
> --- a/drivers/net/stmmac/descs.h
> +++ b/drivers/net/stmmac/descs.h
> @@ -53,6 +53,38 @@ struct dma_desc {
>                        u32 reserved3:5;
>                        u32 disable_ic:1;
>                } rx;
> +#ifdef CONFIG_MACH_LOONGSON1
> +               struct {
> +                       /* RDES0 */
> +                       u32 payload_csum_error:1;
> +                       u32 crc_error:1;
> +                       u32 dribbling:1;
> +                       u32 error_gmii:1;
> +                       u32 receive_watchdog:1;
> +                       u32 frame_type:1;
> +                       u32 late_collision:1;
> +                       u32 ipc_csum_error:1;
> +                       u32 last_descriptor:1;
> +                       u32 first_descriptor:1;
> +                       u32 vlan_tag:1;
> +                       u32 overflow_error:1;
> +                       u32 length_error:1;
> +                       u32 sa_filter_fail:1;
> +                       u32 descriptor_error:1;
> +                       u32 error_summary:1;
> +                       u32 frame_length:14;
> +                       u32 da_filter_fail:1;
> +                       u32 own:1;
> +                       /* RDES1 */
> +                       u32 buffer1_size:11;
> +                       u32 buffer2_size:11;
> +                       u32 reserved1:2;
> +                       u32 second_address_chained:1;
> +                       u32 end_ring:1;
> +                       u32 reserved2:5;
> +                       u32 disable_ic:1;
> +               } erx;          /* -- enhanced -- */
> +#else
>                struct {
>                        /* RDES0 */
>                        u32 payload_csum_error:1;
> @@ -83,6 +115,7 @@ struct dma_desc {
>                        u32 reserved2:2;
>                        u32 disable_ic:1;
>                } erx;          /* -- enhanced -- */
> +#endif
>
>                /* Transmit descriptor */
>                struct {
> @@ -113,6 +146,40 @@ struct dma_desc {
>                        u32 last_segment:1;
>                        u32 interrupt:1;
>                } tx;
> +#ifdef CONFIG_MACH_LOONGSON1
> +               struct {
> +                       /* TDES0 */
> +                       u32 deferred:1;
> +                       u32 underflow_error:1;
> +                       u32 excessive_deferral:1;
> +                       u32 collision_count:4;
> +                       u32 vlan_frame:1;
> +                       u32 excessive_collisions:1;
> +                       u32 late_collision:1;
> +                       u32 no_carrier:1;
> +                       u32 loss_carrier:1;
> +                       u32 payload_error:1;
> +                       u32 frame_flushed:1;
> +                       u32 jabber_timeout:1;
> +                       u32 error_summary:1;
> +                       u32 ip_header_error:1;
> +                       u32 time_stamp_status:1;
> +                       u32 reserved1:13;
> +                       u32 own:1;
> +                       /* TDES1 */
> +                       u32 buffer1_size:11;
> +                       u32 buffer2_size:11;
> +                       u32 time_stamp_enable:1;
> +                       u32 disable_padding:1;
> +                       u32 second_address_chained:1;
> +                       u32 end_ring:1;
> +                       u32 crc_disable:1;
> +                       u32 checksum_insertion:2;
> +                       u32 first_segment:1;
> +                       u32 last_segment:1;
> +                       u32 interrupt:1;
> +               } etx;          /* -- enhanced -- */
> +#else
>                struct {
>                        /* TDES0 */
>                        u32 deferred:1;
> @@ -148,6 +215,7 @@ struct dma_desc {
>                        u32 buffer2_size:13;
>                        u32 reserved4:3;
>                } etx;          /* -- enhanced -- */
> +#endif
>        } des01;
>        unsigned int des2;
>        unsigned int des3;


If the difference is very much, perhaps a new dma_desc struct can be
defined instead.

> diff --git a/drivers/net/stmmac/enh_desc.c b/drivers/net/stmmac/enh_desc.c
> index e5dfb6a..3b5e4f1 100644
> --- a/drivers/net/stmmac/enh_desc.c
> +++ b/drivers/net/stmmac/enh_desc.c
> @@ -108,6 +108,7 @@ static int enh_desc_get_tx_len(struct dma_desc *p)
>  static int enh_desc_coe_rdes0(int ipc_err, int type, int payload_err)
>  {
>        int ret = good_frame;
> +#ifndef CONFIG_MACH_LOONGSON1
>        u32 status = (type << 2 | ipc_err << 1 | payload_err) & 0x7;
>
>        /* bits 5 7 0 | Frame status
> @@ -145,6 +146,7 @@ static int enh_desc_coe_rdes0(int ipc_err, int type, int payload_err)
>                CHIP_DBG(KERN_ERR "RX Des0 status: No IPv4, IPv6 frame.\n");
>                ret = discard_frame;
>        }
> +#endif
>        return ret;
>  }

>
> @@ -232,9 +234,17 @@ static void enh_desc_init_rx_desc(struct dma_desc *p, unsigned int ring_size,
>        int i;
>        for (i = 0; i < ring_size; i++) {
>                p->des01.erx.own = 1;
> +#ifdef CONFIG_MACH_LOONGSON1
> +               p->des01.erx.buffer1_size = BUF_SIZE_2KiB - 1;
> +#else
>                p->des01.erx.buffer1_size = BUF_SIZE_8KiB - 1;
> +#endif
>                /* To support jumbo frames */
> +#ifdef CONFIG_MACH_LOONGSON1
> +               p->des01.erx.buffer2_size = BUF_SIZE_2KiB - 1;
> +#else
>                p->des01.erx.buffer2_size = BUF_SIZE_8KiB - 1;
> +#endif
>                if (i == ring_size - 1)
>                        p->des01.erx.end_ring = 1;
>                if (disable_rx_ic)
> @@ -292,9 +302,15 @@ static void enh_desc_prepare_tx_desc(struct dma_desc *p, int is_fs, int len,
>                                     int csum_flag)
>  {
>        p->des01.etx.first_segment = is_fs;
> +#ifdef CONFIG_MACH_LOONGSON1
> +       if (unlikely(len > BUF_SIZE_2KiB)) {
> +               p->des01.etx.buffer1_size = BUF_SIZE_2KiB - 1;
> +               p->des01.etx.buffer2_size = len - BUF_SIZE_2KiB + 1;
> +#else
>        if (unlikely(len > BUF_SIZE_4KiB)) {
>                p->des01.etx.buffer1_size = BUF_SIZE_4KiB;
>                p->des01.etx.buffer2_size = len - BUF_SIZE_4KiB;
> +#endif
>        } else {
>                p->des01.etx.buffer1_size = len;
>        }

Is it possible to add two new macros RX_BUF_SIZE and TX_BUF_SIZE to .h
instead? which may reduce code duplication.

Regards,
Wu Zhangjin

> --
> 1.7.1
>
>

^ permalink raw reply

* Re: [PATCH] dev: use name hash for dev_seq_ops
From: Stephen Hemminger @ 2011-10-21 17:07 UTC (permalink / raw)
  To: Mihai Maruseac
  Cc: davem, eric.dumazet, mirq-linux, therbert, jpirko, netdev,
	linux-kernel, dbaluta, Mihai Maruseac
In-Reply-To: <1319179510-10715-1-git-send-email-mmaruseac@ixiacom.com>

On Fri, 21 Oct 2011 09:45:10 +0300
Mihai Maruseac <mihai.maruseac@gmail.com> wrote:

> Instead of using the dev->next chain and trying to resync at each call to
> dev_seq_start, use the name hash, keeping the bucket and the offset in
> seq->private field.
> 
> Tests revealed the following results for ifconfig > /dev/null
> 	* 1000 interfaces:
> 		* 0.114s without patch
> 		* 0.089s with patch
> 	* 3000 interfaces:
> 		* 0.489s without patch
> 		* 0.110s with patch
> 	* 5000 interfaces:
> 		* 1.363s without patch
> 		* 0.250s with patch
> 	* 128000 interfaces (other setup):
> 		* ~100s without patch
> 		* ~30s with patch
> 
> Signed-off-by: Mihai Maruseac <mmaruseac@ixiacom.com>
> ---
>  net/core/dev.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++----------
>  1 files changed, 69 insertions(+), 15 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 70ecb86..6edbcc5 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4041,6 +4041,60 @@ static int dev_ifconf(struct net *net, char __user *arg)
>  }
>  
>  #ifdef CONFIG_PROC_FS
> +
> +#define BUCKET_SPACE (32 - NETDEV_HASHBITS)
> +
> +struct dev_iter_state {
> +	struct seq_net_private p;
> +	unsigned int pos; /* bucket << BUCKET_SPACE + offset */
> +};
> +
> +#define get_bucket(x) ((x) >> BUCKET_SPACE)
> +#define get_offset(x) ((x) & ((1 << BUCKET_SPACE) - 1))
> +#define set_bucket_offset(b, o) ((b) << BUCKET_SPACE | (o))
> +
> +static inline struct net_device *dev_from_same_bucket(struct seq_file *seq)
>

Why are all these function marked inline? They are big, hardly hot path
and better to not continue the bad practice of inlining too much code.

^ permalink raw reply

* [PATCH] rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces
From: Eric W. Biederman @ 2011-10-21 16:24 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, kaber, David Lamparter, Renato Westphal, Patrick McHardy
In-Reply-To: <CAChaegnHchLT0BV-_RaPT2-J3ZLin_U1x8X0KBi7ku1MArug1g@mail.gmail.com>


Renato Westphal noticed that since commit a2835763e130c343ace5320c20d33c281e7097b7
"rtnetlink: handle rtnl_link netlink notifications manually" was merged
we no longer send a netlink message when a networking device is moved
from one network namespace to another.

Fix this by adding the missing manual notification in dev_change_net_namespaces.

Since all network devices that are processed by dev_change_net_namspaces are
in the initialized state the complicated tests that guard the manual
rtmsg_ifinfo calls in rollback_registered and register_netdevice are
unnecessary and we can just perform a plain notification.

Cc: stable@kernel.org
Tested-by: Renato Westphal <renatowestphal@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 net/core/dev.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index ad5d702..b7ba81a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6266,6 +6266,7 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
 	*/
 	call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
 	call_netdevice_notifiers(NETDEV_UNREGISTER_BATCH, dev);
+	rtmsg_ifinfo(RTM_DELLINK, dev, ~0U);
 
 	/*
 	 *	Flush the unicast and multicast chains
-- 
1.7.2.5

^ permalink raw reply related

* Re: [patch net-next V2] net: introduce ethernet teaming device
From: Eric Dumazet @ 2011-10-21 15:31 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, bhutchings, shemminger, fubar, andy, tgraf,
	ebiederm, mirqus, kaber, greearb, jesse, fbl, benjamin.poirier,
	jzupka
In-Reply-To: <20111021150250.GB10076@minipsycho>

Le vendredi 21 octobre 2011 à 17:02 +0200, Jiri Pirko a écrit :
> :( What do you suggest? Set these pointers one-by-one?

Yep, this is the right way to have atomicity guarantee.

^ permalink raw reply

* Re: [patch net-next V2] net: introduce ethernet teaming device
From: Jiri Pirko @ 2011-10-21 15:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, bhutchings, shemminger, fubar, andy, tgraf,
	ebiederm, mirqus, kaber, greearb, jesse, fbl, benjamin.poirier,
	jzupka
In-Reply-To: <1319208237.32161.14.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Fri, Oct 21, 2011 at 04:43:57PM CEST, eric.dumazet@gmail.com wrote:
>Le vendredi 21 octobre 2011 à 14:39 +0200, Jiri Pirko a écrit :
>> This patch introduces new network device called team. It supposes to be
>> very fast, simple, userspace-driven alternative to existing bonding
>> driver.
>> 
>> Userspace library called libteam with couple of demo apps is available
>> here:
>> https://github.com/jpirko/libteam
>> Note it's still in its dipers atm.
>> 
>> team<->libteam use generic netlink for communication. That and rtnl
>> suppose to be the only way to configure team device, no sysfs etc.
>> 
>> Python binding basis for libteam was recently introduced (some need
>> still need to be done on it though). Daemon providing arpmon/miimon
>> active-backup functionality will be introduced shortly.
>> All what's necessary is already implemented in kernel team driver.
>> 
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>> 
>> v1->v2:
>> 	- modes are made as modules. Makes team more modular and
>> 	  extendable.
>> 	- several commenters' nitpicks found on v1 were fixed
>> 	- several other bugs were fixed.
>> 	- note I ignored Eric's comment about roundrobin port selector
>> 	  as Eric's way may be easily implemented as another mode (mode
>> 	  "random") in future.
>> ---
>
>Very nice work !
>
>> +
>> +
>> +/*
>> + * We can benefit from the fact that it's ensured no port is present
>> + * at the time of mode change.
>> + */
>> +static int __team_change_mode(struct team *team,
>> +			      const struct team_mode *new_mode)
>> +{
>> +	/* Check if mode was previously set and do cleanup if so */
>> +	if (team->mode_kind) {
>> +		void (*exit_op)(struct team *team) = team->mode_ops.exit;
>> +
>> +		/* Clear ops area so no callback is called any longer */
>> +		memset(&team->mode_ops, 0, sizeof(struct team_mode_ops));
>
>	Hmm, memset() has no atomicity guarantee about 'longs' or 'pointers'.
>
>	You must make sure mode_ops.receive (and other pointers) is set not
>byte per byte, but in one go.

:( What do you suggest? Set these pointers one-by-one?

>
>> +
>> +		synchronize_rcu();
>> +
>> +		if (exit_op)
>> +			exit_op(team);
>> +		team_mode_put(team->mode_kind);
>> +		team->mode_kind = NULL;
>> +		/* zero private data area */
>> +		memset(&team->mode_priv, 0,
>> +		       sizeof(struct team) - offsetof(struct team, mode_priv));
>> +	}
>> +
>> +	if (!new_mode)
>> +		return 0;
>> +
>> +	if (new_mode->ops->init) {
>> +		int err;
>> +
>> +		err = new_mode->ops->init(team);
>> +		if (err)
>> +			return err;
>> +	}
>> +
>> +	team->mode_kind = new_mode->kind;
>> +	memcpy(&team->mode_ops, new_mode->ops, sizeof(struct team_mode_ops));
>> +
>> +	return 0;
>> +}
>> +
>
>> +
>> +/************************
>> + * Rx path frame handler
>> + ************************/
>> +
>> +/* note: already called with rcu_read_lock */
>> +static rx_handler_result_t team_handle_frame(struct sk_buff **pskb)
>> +{
>> +	struct sk_buff *skb = *pskb;
>> +	struct team_port *port;
>> +	struct team *team;
>> +	rx_handler_result_t res = RX_HANDLER_ANOTHER;
>> +
>> +	skb = skb_share_check(skb, GFP_ATOMIC);
>> +	if (!skb)
>> +		return RX_HANDLER_CONSUMED;
>> +
>> +	*pskb = skb;
>> +
>> +	port = team_port_get_rcu(skb->dev);
>> +	team = port->team;
>> +
>> +	if (team->mode_ops.receive)
>
>Hmm, you need ACCESS_ONCE() here or rcu_dereference()
>
>See commit 4d97480b1806e883eb (bonding: use local function pointer of
>bond->recv_probe in bond_handle_frame) for reference

Will do

>
>> +		res = team->mode_ops.receive(team, port, skb);
>> +
>> +	if (res == RX_HANDLER_ANOTHER) {
>> +		struct team_pcpu_stats *pcpu_stats;
>> +
>> +		pcpu_stats = this_cpu_ptr(team->pcpu_stats);
>> +		u64_stats_update_begin(&pcpu_stats->syncp);
>> +		pcpu_stats->rx_packets++;
>> +		pcpu_stats->rx_bytes += skb->len;
>> +		if (skb->pkt_type == PACKET_MULTICAST)
>> +			pcpu_stats->rx_multicast++;
>> +		u64_stats_update_end(&pcpu_stats->syncp);
>> +
>> +		skb->dev = team->dev;
>> +	} else {
>> +		this_cpu_inc(team->pcpu_stats->rx_dropped);
>> +	}
>> +
>> +	return res;
>> +}
>> +
>
>
>> +
>> +static int team_port_enter(struct team *team, struct team_port *port)
>> +{
>> +	int err = 0;
>> +
>> +	dev_hold(team->dev);
>> +	port->dev->priv_flags |= IFF_TEAM_PORT;
>> +	if (team->mode_ops.port_enter) {
>> +		err = team->mode_ops.port_enter(team, port);
>> +		if (err)
>> +			netdev_err(team->dev, "Device %s failed to enter team mode\n",
>> +				   port->dev->name);
>
>		Not sure if you need to unset IFF_TEAM_PORT;

I do. I will add it.

>
>> +	}
>> +	return err;
>> +}
>> +
>
>
>> diff --git a/include/linux/if_team.h b/include/linux/if_team.h
>> new file mode 100644
>> index 0000000..21581a7
>> --- /dev/null
>> +++ b/include/linux/if_team.h
>> @@ -0,0 +1,233 @@
>> +/*
>> + * include/linux/if_team.h - Network team device driver header
>> + * Copyright (c) 2011 Jiri Pirko <jpirko@redhat.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + */
>> +
>> +#ifndef _LINUX_IF_TEAM_H_
>> +#define _LINUX_IF_TEAM_H_
>> +
>> +#ifdef __KERNEL__
>> +
>> +struct team_pcpu_stats {
>> +	u64			rx_packets;
>> +	u64			rx_bytes;
>> +	u64			rx_multicast;
>> +	u64			tx_packets;
>> +	u64			tx_bytes;
>> +	struct u64_stats_sync	syncp;
>> +	u32			rx_dropped;
>> +	u32			tx_dropped;
>
>	"unsigned long" for these two fields ?

I copied these from some other driver (macvlan I think). I will change
this to unsigned long.

>
>> +};
>> +
>> +struct team;
>> +
>> +struct team_port {
>> +	struct net_device *dev;
>> +	struct hlist_node hlist; /* node in hash list */
>> +	struct list_head list; /* node in ordinary list */
>> +	struct team *team;
>> +	int index;
>> +
>> +	/*
>> +	 * A place for storing original values of the device before it
>> +	 * become a port.
>> +	 */
>> +	struct {
>> +		unsigned char dev_addr[MAX_ADDR_LEN];
>> +		unsigned int mtu;
>> +	} orig;
>> +
>> +	bool linkup;
>> +	u32 speed;
>> +	u8 duplex;
>> +
>> +	struct rcu_head rcu;
>> +};
>> +
>> +struct team_mode_ops {
>> +	int (*init)(struct team *team);
>> +	void (*exit)(struct team *team);
>> +	rx_handler_result_t (*receive)(struct team *team,
>> +				       struct team_port *port,
>> +				       struct sk_buff *skb);
>> +	bool (*transmit)(struct team *team, struct sk_buff *skb);
>> +	int (*port_enter)(struct team *team, struct team_port *port);
>> +	void (*port_leave)(struct team *team, struct team_port *port);
>> +	void (*port_change_mac)(struct team *team, struct team_port *port);
>> +};
>> +
>> +enum team_option_type {
>> +	TEAM_OPTION_TYPE_U32,
>> +	TEAM_OPTION_TYPE_STRING,
>> +};
>> +
>> +struct team_option {
>> +	struct list_head list;
>> +	const char *name;
>> +	enum team_option_type type;
>> +	int (*getter)(struct team *team, void *arg);
>> +	int (*setter)(struct team *team, void *arg);
>> +};
>> +
>> +struct team_mode {
>> +	struct list_head list;
>> +	const char *kind;
>> +	struct module *owner;
>> +	size_t priv_size;
>> +	const struct team_mode_ops *ops;
>> +};
>> +
>> +#define TEAM_MODE_PRIV_LONGS 4
>> +#define TEAM_MODE_PRIV_SIZE (sizeof(long) * TEAM_MODE_PRIV_LONGS)
>> +
>> +struct team {
>> +	struct net_device *dev; /* associated netdevice */
>
>
>> +	struct team_pcpu_stats __percpu *pcpu_stats;
>
>	I believe you can use net_device->anonymous_union, the one with
>ml_priv :
>	struct pcpu_lstats __percpu *lstats; /* loopback stats */
>	struct pcpu_tstats __percpu *tstats; /* tunnel stats */
>	struct pcpu_dstats __percpu *dstats; /* dummy stats */
>and add here
>	struct team_pcpu_stats __percpu *team_stats;

I this the right way? I must say I do not like it too much :(


>
>> +
>> +	spinlock_t lock; /* used for overall locking, e.g. port lists write */
>> +
>> +	/*
>> +	 * port lists with port count
>> +	 */
>> +	int port_count;
>> +	struct hlist_head *port_hlist;
>
>	I am not sure why you want an external hash table, with 16 pointers...
>This could be embedded here to remove one dereference ?

Good point! Will change this.


Thanks Eric!

Jirka

>
>> +	struct list_head port_list;
>> +
>> +	struct list_head option_list;
>> +
>> +	const char *mode_kind;
>> +	struct team_mode_ops mode_ops;
>> +	long mode_priv[TEAM_MODE_PRIV_LONGS];
>> +};
>> +
>
>

^ permalink raw reply

* Re: [PATCH net] vlan:make mtu of vlan equal to physical dev
From: Herbert Xu @ 2011-10-21 14:48 UTC (permalink / raw)
  To: WeipingPan; +Cc: open list:NETWORKING [GENERAL]
In-Reply-To: <4E9E30B8.5030904@gmail.com>

On Wed, Oct 19, 2011 at 10:06:48AM +0800, WeipingPan wrote:
>
> What do you think of this patch ?

I think the current behaviour is fine as it is.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [patch net-next V2] net: introduce ethernet teaming device
From: Eric Dumazet @ 2011-10-21 14:43 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, bhutchings, shemminger, fubar, andy, tgraf,
	ebiederm, mirqus, kaber, greearb, jesse, fbl, benjamin.poirier,
	jzupka
In-Reply-To: <1319200747-2508-1-git-send-email-jpirko@redhat.com>

Le vendredi 21 octobre 2011 à 14:39 +0200, Jiri Pirko a écrit :
> This patch introduces new network device called team. It supposes to be
> very fast, simple, userspace-driven alternative to existing bonding
> driver.
> 
> Userspace library called libteam with couple of demo apps is available
> here:
> https://github.com/jpirko/libteam
> Note it's still in its dipers atm.
> 
> team<->libteam use generic netlink for communication. That and rtnl
> suppose to be the only way to configure team device, no sysfs etc.
> 
> Python binding basis for libteam was recently introduced (some need
> still need to be done on it though). Daemon providing arpmon/miimon
> active-backup functionality will be introduced shortly.
> All what's necessary is already implemented in kernel team driver.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> 
> v1->v2:
> 	- modes are made as modules. Makes team more modular and
> 	  extendable.
> 	- several commenters' nitpicks found on v1 were fixed
> 	- several other bugs were fixed.
> 	- note I ignored Eric's comment about roundrobin port selector
> 	  as Eric's way may be easily implemented as another mode (mode
> 	  "random") in future.
> ---

Very nice work !

> +
> +
> +/*
> + * We can benefit from the fact that it's ensured no port is present
> + * at the time of mode change.
> + */
> +static int __team_change_mode(struct team *team,
> +			      const struct team_mode *new_mode)
> +{
> +	/* Check if mode was previously set and do cleanup if so */
> +	if (team->mode_kind) {
> +		void (*exit_op)(struct team *team) = team->mode_ops.exit;
> +
> +		/* Clear ops area so no callback is called any longer */
> +		memset(&team->mode_ops, 0, sizeof(struct team_mode_ops));

	Hmm, memset() has no atomicity guarantee about 'longs' or 'pointers'.

	You must make sure mode_ops.receive (and other pointers) is set not
byte per byte, but in one go.

> +
> +		synchronize_rcu();
> +
> +		if (exit_op)
> +			exit_op(team);
> +		team_mode_put(team->mode_kind);
> +		team->mode_kind = NULL;
> +		/* zero private data area */
> +		memset(&team->mode_priv, 0,
> +		       sizeof(struct team) - offsetof(struct team, mode_priv));
> +	}
> +
> +	if (!new_mode)
> +		return 0;
> +
> +	if (new_mode->ops->init) {
> +		int err;
> +
> +		err = new_mode->ops->init(team);
> +		if (err)
> +			return err;
> +	}
> +
> +	team->mode_kind = new_mode->kind;
> +	memcpy(&team->mode_ops, new_mode->ops, sizeof(struct team_mode_ops));
> +
> +	return 0;
> +}
> +

> +
> +/************************
> + * Rx path frame handler
> + ************************/
> +
> +/* note: already called with rcu_read_lock */
> +static rx_handler_result_t team_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +	struct team_port *port;
> +	struct team *team;
> +	rx_handler_result_t res = RX_HANDLER_ANOTHER;
> +
> +	skb = skb_share_check(skb, GFP_ATOMIC);
> +	if (!skb)
> +		return RX_HANDLER_CONSUMED;
> +
> +	*pskb = skb;
> +
> +	port = team_port_get_rcu(skb->dev);
> +	team = port->team;
> +
> +	if (team->mode_ops.receive)

Hmm, you need ACCESS_ONCE() here or rcu_dereference()

See commit 4d97480b1806e883eb (bonding: use local function pointer of
bond->recv_probe in bond_handle_frame) for reference

> +		res = team->mode_ops.receive(team, port, skb);
> +
> +	if (res == RX_HANDLER_ANOTHER) {
> +		struct team_pcpu_stats *pcpu_stats;
> +
> +		pcpu_stats = this_cpu_ptr(team->pcpu_stats);
> +		u64_stats_update_begin(&pcpu_stats->syncp);
> +		pcpu_stats->rx_packets++;
> +		pcpu_stats->rx_bytes += skb->len;
> +		if (skb->pkt_type == PACKET_MULTICAST)
> +			pcpu_stats->rx_multicast++;
> +		u64_stats_update_end(&pcpu_stats->syncp);
> +
> +		skb->dev = team->dev;
> +	} else {
> +		this_cpu_inc(team->pcpu_stats->rx_dropped);
> +	}
> +
> +	return res;
> +}
> +


> +
> +static int team_port_enter(struct team *team, struct team_port *port)
> +{
> +	int err = 0;
> +
> +	dev_hold(team->dev);
> +	port->dev->priv_flags |= IFF_TEAM_PORT;
> +	if (team->mode_ops.port_enter) {
> +		err = team->mode_ops.port_enter(team, port);
> +		if (err)
> +			netdev_err(team->dev, "Device %s failed to enter team mode\n",
> +				   port->dev->name);

		Not sure if you need to unset IFF_TEAM_PORT;

> +	}
> +	return err;
> +}
> +


> diff --git a/include/linux/if_team.h b/include/linux/if_team.h
> new file mode 100644
> index 0000000..21581a7
> --- /dev/null
> +++ b/include/linux/if_team.h
> @@ -0,0 +1,233 @@
> +/*
> + * include/linux/if_team.h - Network team device driver header
> + * Copyright (c) 2011 Jiri Pirko <jpirko@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#ifndef _LINUX_IF_TEAM_H_
> +#define _LINUX_IF_TEAM_H_
> +
> +#ifdef __KERNEL__
> +
> +struct team_pcpu_stats {
> +	u64			rx_packets;
> +	u64			rx_bytes;
> +	u64			rx_multicast;
> +	u64			tx_packets;
> +	u64			tx_bytes;
> +	struct u64_stats_sync	syncp;
> +	u32			rx_dropped;
> +	u32			tx_dropped;

	"unsigned long" for these two fields ?

> +};
> +
> +struct team;
> +
> +struct team_port {
> +	struct net_device *dev;
> +	struct hlist_node hlist; /* node in hash list */
> +	struct list_head list; /* node in ordinary list */
> +	struct team *team;
> +	int index;
> +
> +	/*
> +	 * A place for storing original values of the device before it
> +	 * become a port.
> +	 */
> +	struct {
> +		unsigned char dev_addr[MAX_ADDR_LEN];
> +		unsigned int mtu;
> +	} orig;
> +
> +	bool linkup;
> +	u32 speed;
> +	u8 duplex;
> +
> +	struct rcu_head rcu;
> +};
> +
> +struct team_mode_ops {
> +	int (*init)(struct team *team);
> +	void (*exit)(struct team *team);
> +	rx_handler_result_t (*receive)(struct team *team,
> +				       struct team_port *port,
> +				       struct sk_buff *skb);
> +	bool (*transmit)(struct team *team, struct sk_buff *skb);
> +	int (*port_enter)(struct team *team, struct team_port *port);
> +	void (*port_leave)(struct team *team, struct team_port *port);
> +	void (*port_change_mac)(struct team *team, struct team_port *port);
> +};
> +
> +enum team_option_type {
> +	TEAM_OPTION_TYPE_U32,
> +	TEAM_OPTION_TYPE_STRING,
> +};
> +
> +struct team_option {
> +	struct list_head list;
> +	const char *name;
> +	enum team_option_type type;
> +	int (*getter)(struct team *team, void *arg);
> +	int (*setter)(struct team *team, void *arg);
> +};
> +
> +struct team_mode {
> +	struct list_head list;
> +	const char *kind;
> +	struct module *owner;
> +	size_t priv_size;
> +	const struct team_mode_ops *ops;
> +};
> +
> +#define TEAM_MODE_PRIV_LONGS 4
> +#define TEAM_MODE_PRIV_SIZE (sizeof(long) * TEAM_MODE_PRIV_LONGS)
> +
> +struct team {
> +	struct net_device *dev; /* associated netdevice */


> +	struct team_pcpu_stats __percpu *pcpu_stats;

	I believe you can use net_device->anonymous_union, the one with
ml_priv :
	struct pcpu_lstats __percpu *lstats; /* loopback stats */
	struct pcpu_tstats __percpu *tstats; /* tunnel stats */
	struct pcpu_dstats __percpu *dstats; /* dummy stats */
and add here
	struct team_pcpu_stats __percpu *team_stats;

> +
> +	spinlock_t lock; /* used for overall locking, e.g. port lists write */
> +
> +	/*
> +	 * port lists with port count
> +	 */
> +	int port_count;
> +	struct hlist_head *port_hlist;

	I am not sure why you want an external hash table, with 16 pointers...
This could be embedded here to remove one dereference ?

> +	struct list_head port_list;
> +
> +	struct list_head option_list;
> +
> +	const char *mode_kind;
> +	struct team_mode_ops mode_ops;
> +	long mode_priv[TEAM_MODE_PRIV_LONGS];
> +};
> +

^ permalink raw reply

* Re: [patch net-next V2] net: introduce ethernet teaming device
From: Jiri Pirko @ 2011-10-21 14:18 UTC (permalink / raw)
  To: Benjamin Poirier
  Cc: netdev, davem, eric.dumazet, bhutchings, shemminger, fubar, andy,
	tgraf, ebiederm, mirqus, kaber, greearb, jesse, fbl, jzupka
In-Reply-To: <20111021140005.GA29815@synalogic.ca>

Fri, Oct 21, 2011 at 04:00:05PM CEST, benjamin.poirier@gmail.com wrote:
>On 11/10/21 14:39, Jiri Pirko wrote:
>> This patch introduces new network device called team. It supposes to be
>> very fast, simple, userspace-driven alternative to existing bonding
>> driver.
>> 
>> Userspace library called libteam with couple of demo apps is available
>> here:
>> https://github.com/jpirko/libteam
>> Note it's still in its dipers atm.
>> 
>> team<->libteam use generic netlink for communication. That and rtnl
>> suppose to be the only way to configure team device, no sysfs etc.
>> 
>> Python binding basis for libteam was recently introduced (some need
>> still need to be done on it though). Daemon providing arpmon/miimon
>> active-backup functionality will be introduced shortly.
>> All what's necessary is already implemented in kernel team driver.
>> 
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>> 
>> v1->v2:
>> 	- modes are made as modules. Makes team more modular and
>> 	  extendable.
>> 	- several commenters' nitpicks found on v1 were fixed
>> 	- several other bugs were fixed.
>> 	- note I ignored Eric's comment about roundrobin port selector
>> 	  as Eric's way may be easily implemented as another mode (mode
>> 	  "random") in future.
>> ---
>>  Documentation/networking/team.txt         |    2 +
>>  MAINTAINERS                               |    7 +
>>  drivers/net/Kconfig                       |    2 +
>>  drivers/net/Makefile                      |    1 +
>>  drivers/net/team/Kconfig                  |   38 +
>>  drivers/net/team/Makefile                 |    7 +
>>  drivers/net/team/team.c                   | 1593 +++++++++++++++++++++++++++++
>>  drivers/net/team/team_mode_activebackup.c |  152 +++
>>  drivers/net/team/team_mode_roundrobin.c   |  107 ++
>>  include/linux/Kbuild                      |    1 +
>>  include/linux/if.h                        |    1 +
>>  include/linux/if_team.h                   |  233 +++++
>>  12 files changed, 2144 insertions(+), 0 deletions(-)
>>  create mode 100644 Documentation/networking/team.txt
>>  create mode 100644 drivers/net/team/Kconfig
>>  create mode 100644 drivers/net/team/Makefile
>>  create mode 100644 drivers/net/team/team.c
>>  create mode 100644 drivers/net/team/team_mode_activebackup.c
>>  create mode 100644 drivers/net/team/team_mode_roundrobin.c
>>  create mode 100644 include/linux/if_team.h
>> 
>> diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.txt
>> new file mode 100644
>> index 0000000..5a01368
>> --- /dev/null
>> +++ b/Documentation/networking/team.txt
>> @@ -0,0 +1,2 @@
>> +Team devices are driven from userspace via libteam library which is here:
>> +	https://github.com/jpirko/libteam
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 5008b08..c33400d 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -6372,6 +6372,13 @@ W:	http://tcp-lp-mod.sourceforge.net/
>>  S:	Maintained
>>  F:	net/ipv4/tcp_lp.c
>>  
>> +TEAM DRIVER
>> +M:	Jiri Pirko <jpirko@redhat.com>
>> +L:	netdev@vger.kernel.org
>> +S:	Supported
>> +F:	drivers/net/team/
>> +F:	include/linux/if_team.h
>> +
>>  TEGRA SUPPORT
>>  M:	Colin Cross <ccross@android.com>
>>  M:	Erik Gilling <konkers@android.com>
>> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>> index 583f66c..b3020be 100644
>> --- a/drivers/net/Kconfig
>> +++ b/drivers/net/Kconfig
>> @@ -125,6 +125,8 @@ config IFB
>>  	  'ifb1' etc.
>>  	  Look at the iproute2 documentation directory for usage etc
>>  
>> +source "drivers/net/team/Kconfig"
>> +
>>  config MACVLAN
>>  	tristate "MAC-VLAN support (EXPERIMENTAL)"
>>  	depends on EXPERIMENTAL
>> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
>> index fa877cd..4e4ebfe 100644
>> --- a/drivers/net/Makefile
>> +++ b/drivers/net/Makefile
>> @@ -17,6 +17,7 @@ obj-$(CONFIG_NET) += Space.o loopback.o
>>  obj-$(CONFIG_NETCONSOLE) += netconsole.o
>>  obj-$(CONFIG_PHYLIB) += phy/
>>  obj-$(CONFIG_RIONET) += rionet.o
>> +obj-$(CONFIG_NET_TEAM) += team/
>>  obj-$(CONFIG_TUN) += tun.o
>>  obj-$(CONFIG_VETH) += veth.o
>>  obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
>> diff --git a/drivers/net/team/Kconfig b/drivers/net/team/Kconfig
>> new file mode 100644
>> index 0000000..70a43a6
>> --- /dev/null
>> +++ b/drivers/net/team/Kconfig
>> @@ -0,0 +1,38 @@
>> +menuconfig NET_TEAM
>> +	tristate "Ethernet team driver support (EXPERIMENTAL)"
>> +	depends on EXPERIMENTAL
>> +	---help---
>> +	  This allows one to create virtual interfaces that teams together
>> +	  multiple ethernet devices.
>> +
>> +	  Team devices can be added using the "ip" command from the
>> +	  iproute2 package:
>> +
>> +	  "ip link add link [ address MAC ] [ NAME ] type team"
>> +
>> +	  To compile this driver as a module, choose M here: the module
>> +	  will be called team.
>> +
>> +if NET_TEAM
>> +
>> +config NET_TEAM_MODE_ROUNDROBIN
>> +	tristate "Round-robin mode support"
>> +	depends on NET_TEAM
>> +	---help---
>> +	  Basic mode where port used for transmitting packets is selected in
>> +	  round-robin fashion using packet counter.
>> +
>> +	  To compile this team mode as a module, choose M here: the module
>> +	  will be called team_mode_roundrobin.
>> +
>> +config NET_TEAM_MODE_ACTIVEBACKUP
>> +	tristate "Active-backup mode support"
>> +	depends on NET_TEAM
>> +	---help---
>> +	  Only one port is active at a time and the rest of ports are used
>> +	  for backup.
>> +
>> +	  To compile this team mode as a module, choose M here: the module
>> +	  will be called team_mode_activebackup.
>> +
>> +endif # NET_TEAM
>> diff --git a/drivers/net/team/Makefile b/drivers/net/team/Makefile
>> new file mode 100644
>> index 0000000..85f2028
>> --- /dev/null
>> +++ b/drivers/net/team/Makefile
>> @@ -0,0 +1,7 @@
>> +#
>> +# Makefile for the network team driver
>> +#
>> +
>> +obj-$(CONFIG_NET_TEAM) += team.o
>> +obj-$(CONFIG_NET_TEAM_MODE_ROUNDROBIN) += team_mode_roundrobin.o
>> +obj-$(CONFIG_NET_TEAM_MODE_ACTIVEBACKUP) += team_mode_activebackup.o
>> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>> new file mode 100644
>> index 0000000..398be58
>> --- /dev/null
>> +++ b/drivers/net/team/team.c
>> @@ -0,0 +1,1593 @@
>> +/*
>> + * net/drivers/team/team.c - Network team device driver
>> + * Copyright (c) 2011 Jiri Pirko <jpirko@redhat.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/types.h>
>> +#include <linux/module.h>
>> +#include <linux/init.h>
>> +#include <linux/slab.h>
>> +#include <linux/rcupdate.h>
>> +#include <linux/errno.h>
>> +#include <linux/ctype.h>
>> +#include <linux/notifier.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/if_arp.h>
>> +#include <linux/socket.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/rtnetlink.h>
>> +#include <net/rtnetlink.h>
>> +#include <net/genetlink.h>
>> +#include <net/netlink.h>
>> +#include <linux/if_team.h>
>> +
>> +#define DRV_NAME "team"
>> +
>> +
>> +/**********
>> + * Helpers
>> + **********/
>> +
>> +#define team_port_exists(dev) (dev->priv_flags & IFF_TEAM_PORT)
>> +
>> +static struct team_port *team_port_get_rcu(const struct net_device *dev)
>> +{
>> +	struct team_port *port = rcu_dereference(dev->rx_handler_data);
>> +
>> +	return team_port_exists(dev) ? port : NULL;
>> +}
>> +
>> +static struct team_port *team_port_get_rtnl(const struct net_device *dev)
>> +{
>> +	struct team_port *port = rtnl_dereference(dev->rx_handler_data);
>> +
>> +	return team_port_exists(dev) ? port : NULL;
>> +}
>> +
>> +/*
>> + * Since the ability to change mac address for open port device is tested in
>> + * team_port_add, this function can be called without control of return value
>> + */
>> +static int __set_port_mac(struct net_device *port_dev,
>> +			  const unsigned char *dev_addr)
>> +{
>> +	struct sockaddr addr;
>> +
>> +	memcpy(addr.sa_data, dev_addr, ETH_ALEN);
>> +	addr.sa_family = ARPHRD_ETHER;
>> +	return dev_set_mac_address(port_dev, &addr);
>> +}
>> +
>> +int team_port_set_orig_mac(struct team_port *port)
>> +{
>> +	return __set_port_mac(port->dev, port->orig.dev_addr);
>> +}
>> +EXPORT_SYMBOL(team_port_set_orig_mac);
>> +
>> +int team_port_set_team_mac(struct team_port *port)
>> +{
>> +	return __set_port_mac(port->dev, port->team->dev->dev_addr);
>> +}
>> +EXPORT_SYMBOL(team_port_set_team_mac);
>> +
>> +
>> +/*******************
>> + * Options handling
>> + *******************/
>> +
>> +void team_options_register(struct team *team, struct team_option *option,
>> +			   size_t option_count)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < option_count; i++, option++)
>> +		list_add_tail(&option->list, &team->option_list);
>> +}
>> +EXPORT_SYMBOL(team_options_register);
>> +
>> +static void __team_options_change_check(struct team *team,
>> +					struct team_option *changed_option);
>> +
>> +static void __team_options_unregister(struct team *team,
>> +				      struct team_option *option,
>> +				      size_t option_count)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < option_count; i++, option++)
>> +		list_del(&option->list);
>> +}
>> +
>> +void team_options_unregister(struct team *team, struct team_option *option,
>> +			     size_t option_count)
>> +{
>> +	__team_options_unregister(team, option, option_count);
>> +	__team_options_change_check(team, NULL);
>> +}
>> +EXPORT_SYMBOL(team_options_unregister);
>> +
>> +static int team_option_get(struct team *team, struct team_option *option,
>> +			   void *arg)
>> +{
>> +	return option->getter(team, arg);
>> +}
>> +
>> +static int team_option_set(struct team *team, struct team_option *option,
>> +			   void *arg)
>> +{
>> +	int err;
>> +
>> +	err = option->setter(team, arg);
>> +	if (err)
>> +		return err;
>> +
>> +	__team_options_change_check(team, option);
>> +	return err;
>> +}
>> +
>> +/****************
>> + * Mode handling
>> + ****************/
>> +
>> +static LIST_HEAD(mode_list);
>> +static DEFINE_SPINLOCK(mode_list_lock);
>> +
>> +static struct team_mode *__find_mode(const char *kind)
>> +{
>> +	struct team_mode *mode;
>> +
>> +	list_for_each_entry(mode, &mode_list, list) {
>> +		if (strcmp(mode->kind, kind) == 0)
>> +			return mode;
>> +	}
>> +	return NULL;
>> +}
>> +
>> +static bool is_good_mode_name(const char *name)
>> +{
>> +	while (*name != '\0') {
>> +		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
>> +			return false;
>> +		name++;
>> +	}
>> +	return true;
>> +}
>> +
>> +int team_mode_register(struct team_mode *mode)
>> +{
>> +	int err = 0;
>> +
>> +	if (!is_good_mode_name(mode->kind) ||
>> +	    mode->priv_size > TEAM_MODE_PRIV_SIZE)
>> +		return -EINVAL;
>> +	spin_lock(&mode_list_lock);
>> +	if (__find_mode(mode->kind)) {
>> +		err = -EEXIST;
>> +		goto unlock;
>> +	}
>> +	list_add_tail(&mode->list, &mode_list);
>> +unlock:
>> +	spin_unlock(&mode_list_lock);
>> +	return err;
>> +}
>> +EXPORT_SYMBOL(team_mode_register);
>> +
>> +int team_mode_unregister(struct team_mode *mode)
>> +{
>> +	spin_lock(&mode_list_lock);
>> +	list_del_init(&mode->list);
>> +	spin_unlock(&mode_list_lock);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(team_mode_unregister);
>> +
>> +static struct team_mode *team_mode_get(const char *kind)
>> +{
>> +	struct team_mode *mode;
>> +
>> +	spin_lock(&mode_list_lock);
>> +	mode = __find_mode(kind);
>> +	if (!mode) {
>> +		spin_unlock(&mode_list_lock);
>> +		request_module("team-mode-%s", kind);
>> +		spin_lock(&mode_list_lock);
>> +		mode = __find_mode(kind);
>> +	}
>> +	if (mode)
>> +		if (!try_module_get(mode->owner))
>> +			mode = NULL;
>> +
>> +	spin_unlock(&mode_list_lock);
>> +	return mode;
>> +}
>> +
>> +static void team_mode_put(const char *kind)
>> +{
>> +	struct team_mode *mode;
>> +
>> +	spin_lock(&mode_list_lock);
>> +	mode = __find_mode(kind);
>> +	BUG_ON(!mode);
>> +	module_put(mode->owner);
>> +	spin_unlock(&mode_list_lock);
>> +}
>> +
>> +/*
>> + * We can benefit from the fact that it's ensured no port is present
>> + * at the time of mode change.
>> + */
>> +static int __team_change_mode(struct team *team,
>> +			      const struct team_mode *new_mode)
>> +{
>> +	/* Check if mode was previously set and do cleanup if so */
>> +	if (team->mode_kind) {
>> +		void (*exit_op)(struct team *team) = team->mode_ops.exit;
>> +
>> +		/* Clear ops area so no callback is called any longer */
>> +		memset(&team->mode_ops, 0, sizeof(struct team_mode_ops));
>> +
>> +		synchronize_rcu();
>> +
>> +		if (exit_op)
>> +			exit_op(team);
>> +		team_mode_put(team->mode_kind);
>> +		team->mode_kind = NULL;
>> +		/* zero private data area */
>> +		memset(&team->mode_priv, 0,
>> +		       sizeof(struct team) - offsetof(struct team, mode_priv));
>> +	}
>> +
>> +	if (!new_mode)
>> +		return 0;
>> +
>> +	if (new_mode->ops->init) {
>> +		int err;
>> +
>> +		err = new_mode->ops->init(team);
>> +		if (err)
>> +			return err;
>> +	}
>> +
>> +	team->mode_kind = new_mode->kind;
>> +	memcpy(&team->mode_ops, new_mode->ops, sizeof(struct team_mode_ops));
>> +
>> +	return 0;
>> +}
>> +
>> +static int team_change_mode(struct team *team, const char *kind)
>> +{
>> +	struct team_mode *new_mode;
>> +	struct net_device *dev = team->dev;
>> +	int err;
>> +
>> +	if (!list_empty(&team->port_list)) {
>> +		netdev_err(dev, "No ports can be present during mode change\n");
>> +		return -EBUSY;
>> +	}
>> +
>> +	if (team->mode_kind && strcmp(team->mode_kind, kind) == 0) {
>> +		netdev_err(dev, "Unable to change to the same mode the team is in\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	new_mode = team_mode_get(kind);
>> +	if (!new_mode) {
>> +		netdev_err(dev, "Mode \"%s\" not found\n", kind);
>> +		return -EINVAL;
>> +	}
>> +
>> +	err = __team_change_mode(team, new_mode);
>> +	if (err) {
>> +		netdev_err(dev, "Failed to change to mode \"%s\"\n", kind);
>> +		team_mode_put(kind);
>> +		return err;
>> +	}
>> +
>> +	netdev_info(dev, "Mode changed to \"%s\"\n", kind);
>> +	return 0;
>> +}
>> +
>> +
>> +/************************
>> + * Rx path frame handler
>> + ************************/
>> +
>> +/* note: already called with rcu_read_lock */
>> +static rx_handler_result_t team_handle_frame(struct sk_buff **pskb)
>> +{
>> +	struct sk_buff *skb = *pskb;
>> +	struct team_port *port;
>> +	struct team *team;
>> +	rx_handler_result_t res = RX_HANDLER_ANOTHER;
>> +
>> +	skb = skb_share_check(skb, GFP_ATOMIC);
>> +	if (!skb)
>> +		return RX_HANDLER_CONSUMED;
>> +
>> +	*pskb = skb;
>> +
>> +	port = team_port_get_rcu(skb->dev);
>> +	team = port->team;
>> +
>> +	if (team->mode_ops.receive)
>> +		res = team->mode_ops.receive(team, port, skb);
>> +
>> +	if (res == RX_HANDLER_ANOTHER) {
>> +		struct team_pcpu_stats *pcpu_stats;
>> +
>> +		pcpu_stats = this_cpu_ptr(team->pcpu_stats);
>> +		u64_stats_update_begin(&pcpu_stats->syncp);
>> +		pcpu_stats->rx_packets++;
>> +		pcpu_stats->rx_bytes += skb->len;
>> +		if (skb->pkt_type == PACKET_MULTICAST)
>> +			pcpu_stats->rx_multicast++;
>> +		u64_stats_update_end(&pcpu_stats->syncp);
>> +
>> +		skb->dev = team->dev;
>> +	} else {
>> +		this_cpu_inc(team->pcpu_stats->rx_dropped);
>> +	}
>> +
>> +	return res;
>> +}
>> +
>> +
>> +/****************
>> + * Port handling
>> + ****************/
>> +
>> +static bool team_port_find(const struct team *team,
>> +			   const struct team_port *port)
>> +{
>> +	struct team_port *cur;
>> +
>> +	list_for_each_entry(cur, &team->port_list, list)
>> +		if (cur == port)
>> +			return true;
>> +	return false;
>> +}
>> +
>> +static int team_port_list_init(struct team *team)
>> +{
>> +	int i;
>> +	struct hlist_head *hash;
>> +
>> +	hash = kmalloc(sizeof(*hash) * TEAM_PORT_HASHENTRIES, GFP_KERNEL);
>> +	if (!hash)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < TEAM_PORT_HASHENTRIES; i++)
>> +		INIT_HLIST_HEAD(&hash[i]);
>> +	team->port_hlist = hash;
>> +	INIT_LIST_HEAD(&team->port_list);
>> +	return 0;
>> +}
>> +
>> +static void team_port_list_fini(struct team *team)
>> +{
>> +	kfree(team->port_hlist);
>> +}
>> +
>> +/*
>> + * Add/delete port to the team port list. Write guarded by rtnl_lock.
>> + * Takes care of correct port->index setup (might be racy).
>> + */
>> +static void team_port_list_add_port(struct team *team,
>> +				    struct team_port *port)
>> +{
>> +	port->index = team->port_count++;
>> +	hlist_add_head_rcu(&port->hlist,
>> +			   team_port_index_hash(team, port->index));
>> +	list_add_tail_rcu(&port->list, &team->port_list);
>> +}
>> +
>> +static void __reconstruct_port_hlist(struct team *team, int rm_index)
>> +{
>> +	int i;
>> +	struct team_port *port;
>> +
>> +	for (i = rm_index + 1; i < team->port_count; i++) {
>> +		port = team_get_port_by_index_rcu(team, i);
>> +		hlist_del_rcu(&port->hlist);
>> +		port->index--;
>> +		hlist_add_head_rcu(&port->hlist,
>> +				   team_port_index_hash(team, port->index));
>> +	}
>> +}
>> +
>> +static void team_port_list_del_port(struct team *team,
>> +				   struct team_port *port)
>> +{
>> +	int rm_index = port->index;
>> +
>> +	hlist_del_rcu(&port->hlist);
>> +	list_del_rcu(&port->list);
>> +	__reconstruct_port_hlist(team, rm_index);
>> +	team->port_count--;
>> +}
>> +
>> +#define TEAM_VLAN_FEATURES (NETIF_F_ALL_CSUM | NETIF_F_SG | \
>> +			    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
>> +			    NETIF_F_HIGHDMA | NETIF_F_LRO)
>> +
>> +static void __team_compute_features(struct team *team)
>> +{
>> +	struct team_port *port;
>> +	u32 vlan_features = TEAM_VLAN_FEATURES;
>> +	unsigned short max_hard_header_len = ETH_HLEN;
>> +
>> +	list_for_each_entry(port, &team->port_list, list) {
>> +		vlan_features = netdev_increment_features(vlan_features,
>> +					port->dev->vlan_features,
>> +					TEAM_VLAN_FEATURES);
>> +
>> +		if (port->dev->hard_header_len > max_hard_header_len)
>> +			max_hard_header_len = port->dev->hard_header_len;
>> +	}
>> +
>> +	team->dev->vlan_features = vlan_features;
>> +	team->dev->hard_header_len = max_hard_header_len;
>> +
>> +	netdev_change_features(team->dev);
>> +}
>> +
>> +static void team_compute_features(struct team *team)
>> +{
>> +	spin_lock(&team->lock);
>> +	__team_compute_features(team);
>> +	spin_unlock(&team->lock);
>> +}
>> +
>> +static int team_port_enter(struct team *team, struct team_port *port)
>> +{
>> +	int err = 0;
>> +
>> +	dev_hold(team->dev);
>> +	port->dev->priv_flags |= IFF_TEAM_PORT;
>> +	if (team->mode_ops.port_enter) {
>> +		err = team->mode_ops.port_enter(team, port);
>> +		if (err)
>> +			netdev_err(team->dev, "Device %s failed to enter team mode\n",
>> +				   port->dev->name);
>> +	}
>> +	return err;
>> +}
>> +
>> +static void team_port_leave(struct team *team, struct team_port *port)
>> +{
>> +	if (team->mode_ops.port_leave)
>> +		team->mode_ops.port_leave(team, port);
>> +	port->dev->priv_flags &= ~IFF_TEAM_PORT;
>> +	dev_put(team->dev);
>> +}
>> +
>> +static void __team_port_change_check(struct team_port *port, bool linkup);
>> +
>> +static int team_port_add(struct team *team, struct net_device *port_dev)
>> +{
>> +	struct net_device *dev = team->dev;
>> +	struct team_port *port;
>> +	char *portname = port_dev->name;
>> +	char tmp_addr[ETH_ALEN];
>> +	int err;
>> +
>> +	if (port_dev->flags & IFF_LOOPBACK ||
>> +	    port_dev->type != ARPHRD_ETHER) {
>> +		netdev_err(dev, "Device %s is of an unsupported type\n",
>> +			   portname);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (team_port_exists(port_dev)) {
>> +		netdev_err(dev, "Device %s is already a port "
>> +				"of a team device\n", portname);
>> +		return -EBUSY;
>> +	}
>> +
>> +	if (port_dev->flags & IFF_UP) {
>> +		netdev_err(dev, "Device %s is up. Set it down before adding it as a team port\n",
>> +			   portname);
>> +		return -EBUSY;
>> +	}
>> +
>> +	port = kzalloc(sizeof(struct team_port), GFP_KERNEL);
>> +	if (!port)
>> +		return -ENOMEM;
>> +
>> +	port->dev = port_dev;
>> +	port->team = team;
>> +
>> +	port->orig.mtu = port_dev->mtu;
>> +	err = dev_set_mtu(port_dev, dev->mtu);
>> +	if (err) {
>> +		netdev_dbg(dev, "Error %d calling dev_set_mtu\n", err);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	memcpy(port->orig.dev_addr, port_dev->dev_addr, ETH_ALEN);
>> +	random_ether_addr(tmp_addr);
>> +	err = __set_port_mac(port_dev, tmp_addr);
>> +	if (err) {
>> +		netdev_dbg(dev, "Device %s mac addr set failed\n",
>> +			   portname);
>> +		goto err_set_mac_rand;
>> +	}
>> +
>> +	err = dev_open(port_dev);
>> +	if (err) {
>> +		netdev_dbg(dev, "Device %s opening failed\n",
>> +			   portname);
>> +		goto err_dev_open;
>> +	}
>> +
>> +	err = team_port_set_orig_mac(port);
>> +	if (err) {
>> +		netdev_dbg(dev, "Device %s mac addr set failed - Device does not support addr change when it's opened\n",
>> +			   portname);
>> +		goto err_set_mac_opened;
>> +	}
>> +
>> +	err = team_port_enter(team, port);
>> +	if (err) {
>> +		netdev_err(dev, "Device %s failed to enter team mode\n",
>> +			   portname);
>> +		goto err_port_enter;
>> +	}
>> +
>> +	err = netdev_set_master(port_dev, dev);
>> +	if (err) {
>> +		netdev_err(dev, "Device %s failed to set master\n", portname);
>> +		goto err_set_master;
>> +	}
>> +
>> +	err = netdev_rx_handler_register(port_dev, team_handle_frame,
>> +					 port);
>> +	if (err) {
>> +		netdev_err(dev, "Device %s failed to register rx_handler\n",
>> +			   portname);
>> +		goto err_handler_register;
>> +	}
>> +
>> +	team_port_list_add_port(team, port);
>> +	__team_compute_features(team);
>> +	__team_port_change_check(port, !!netif_carrier_ok(port_dev));
>> +
>> +	netdev_info(dev, "Port device %s added\n", portname);
>> +
>> +	return 0;
>> +
>> +err_handler_register:
>> +	netdev_set_master(port_dev, NULL);
>> +
>> +err_set_master:
>> +	team_port_leave(team, port);
>> +
>> +err_port_enter:
>> +err_set_mac_opened:
>> +	dev_close(port_dev);
>> +
>> +err_dev_open:
>> +	team_port_set_orig_mac(port);
>> +
>> +err_set_mac_rand:
>> +	dev_set_mtu(port_dev, port->orig.mtu);
>> +
>> +err_set_mtu:
>> +	kfree(port);
>> +
>> +	return err;
>> +}
>> +
>> +static int team_port_del(struct team *team, struct net_device *port_dev)
>> +{
>> +	struct net_device *dev = team->dev;
>> +	struct team_port *port;
>> +	char *portname = port_dev->name;
>> +
>> +	port = team_port_get_rtnl(port_dev);
>> +	if (!port || !team_port_find(team, port)) {
>> +		netdev_err(dev, "Device %s does not act as a port of this team\n",
>> +			   portname);
>> +		return -ENOENT;
>> +	}
>> +
>> +	__team_port_change_check(port, false);
>> +	team_port_list_del_port(team, port);
>> +	netdev_rx_handler_unregister(port_dev);
>> +	netdev_set_master(port_dev, NULL);
>> +	team_port_leave(team, port);
>> +	dev_close(port_dev);
>> +	team_port_set_orig_mac(port);
>> +	dev_set_mtu(port_dev, port->orig.mtu);
>> +	synchronize_rcu();
>> +	kfree(port);
>> +	netdev_info(dev, "Port device %s removed\n", portname);
>> +	__team_compute_features(team);
>> +
>> +	return 0;
>> +}
>> +
>> +
>> +/*****************
>> + * Net device ops
>> + *****************/
>> +
>> +static const char team_no_mode_kind[] = "*NOMODE*";
>> +
>> +static int team_mode_option_get(struct team *team, void *arg)
>> +{
>> +	const char **str = arg;
>> +
>> +	*str = team->mode_kind ? team->mode_kind : team_no_mode_kind;
>> +	return 0;
>> +}
>> +
>> +static int team_mode_option_set(struct team *team, void *arg)
>> +{
>> +	const char **str = arg;
>> +
>> +	return team_change_mode(team, *str);
>> +}
>> +
>> +static struct team_option team_options[] = {
>> +	{
>> +		.name = "mode",
>> +		.type = TEAM_OPTION_TYPE_STRING,
>> +		.getter = team_mode_option_get,
>> +		.setter = team_mode_option_set,
>> +	},
>> +};
>> +
>> +static int team_init(struct net_device *dev)
>> +{
>> +	struct team *team = netdev_priv(dev);
>> +	int err;
>> +
>> +	team->dev = dev;
>> +	spin_lock_init(&team->lock);
>> +
>> +	team->pcpu_stats = alloc_percpu(struct team_pcpu_stats);
>> +	if (!team->pcpu_stats)
>> +		return -ENOMEM;
>> +
>> +	err = team_port_list_init(team);
>> +	if (err)
>> +		goto err_port_list_init;
>> +
>> +	INIT_LIST_HEAD(&team->option_list);
>> +	team_options_register(team, team_options, ARRAY_SIZE(team_options));
>> +	netif_carrier_off(dev);
>> +
>> +	return 0;
>> +
>> +err_port_list_init:
>> +
>> +	free_percpu(team->pcpu_stats);
>> +
>> +	return err;
>> +}
>> +
>> +static void team_uninit(struct net_device *dev)
>> +{
>> +	struct team *team = netdev_priv(dev);
>> +	struct team_port *port;
>> +	struct team_port *tmp;
>> +
>> +	spin_lock(&team->lock);
>> +	list_for_each_entry_safe(port, tmp, &team->port_list, list)
>> +		team_port_del(team, port->dev);
>> +
>> +	__team_change_mode(team, NULL); /* cleanup */
>> +	__team_options_unregister(team, team_options, ARRAY_SIZE(team_options));
>> +	spin_unlock(&team->lock);
>> +}
>> +
>> +static void team_destructor(struct net_device *dev)
>> +{
>> +	struct team *team = netdev_priv(dev);
>> +
>> +	team_port_list_fini(team);
>> +	free_percpu(team->pcpu_stats);
>> +	free_netdev(dev);
>> +}
>> +
>> +static int team_open(struct net_device *dev)
>> +{
>> +	netif_carrier_on(dev);
>> +	return 0;
>> +}
>> +
>> +static int team_close(struct net_device *dev)
>> +{
>> +	netif_carrier_off(dev);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * note: already called with rcu_read_lock
>> + */
>> +static netdev_tx_t team_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	struct team *team = netdev_priv(dev);
>> +	bool tx_success = false;
>> +	unsigned int len = skb->len;
>> +
>> +	/*
>> +	 * Ensure transmit function is called only in case there is at least
>> +	 * one port present.
>> +	 */
>> +	if (likely(!list_empty(&team->port_list) && team->mode_ops.transmit))
>> +		tx_success = team->mode_ops.transmit(team, skb);
>> +	if (tx_success) {
>> +		struct team_pcpu_stats *pcpu_stats;
>> +
>> +		pcpu_stats = this_cpu_ptr(team->pcpu_stats);
>> +		u64_stats_update_begin(&pcpu_stats->syncp);
>> +		pcpu_stats->tx_packets++;
>> +		pcpu_stats->tx_bytes += len;
>> +		u64_stats_update_end(&pcpu_stats->syncp);
>> +	} else {
>> +		this_cpu_inc(team->pcpu_stats->tx_dropped);
>> +	}
>> +
>> +	return NETDEV_TX_OK;
>> +}
>> +
>> +static void team_change_rx_flags(struct net_device *dev, int change)
>> +{
>> +	struct team *team = netdev_priv(dev);
>> +	struct team_port *port;
>> +	int inc;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(port, &team->port_list, list) {
>> +		if (change & IFF_PROMISC) {
>> +			inc = dev->flags & IFF_PROMISC ? 1 : -1;
>> +			dev_set_promiscuity(port->dev, inc);
>> +		}
>> +		if (change & IFF_ALLMULTI) {
>> +			inc = dev->flags & IFF_ALLMULTI ? 1 : -1;
>> +			dev_set_allmulti(port->dev, inc);
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +}
>> +
>> +static void team_set_rx_mode(struct net_device *dev)
>> +{
>> +	struct team *team = netdev_priv(dev);
>> +	struct team_port *port;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(port, &team->port_list, list) {
>> +		dev_uc_sync(port->dev, dev);
>> +		dev_mc_sync(port->dev, dev);
>> +	}
>> +	rcu_read_unlock();
>> +}
>> +
>> +static int team_set_mac_address(struct net_device *dev, void *p)
>> +{
>> +	struct team *team = netdev_priv(dev);
>> +	struct team_port *port;
>> +	struct sockaddr *addr = p;
>> +
>> +	memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(port, &team->port_list, list)
>> +		if (team->mode_ops.port_change_mac)
>> +			team->mode_ops.port_change_mac(team, port);
>> +	rcu_read_unlock();
>> +	return 0;
>> +}
>> +
>> +static int team_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +	struct team *team = netdev_priv(dev);
>> +	struct team_port *port;
>> +	int err;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(port, &team->port_list, list) {
>> +		err = dev_set_mtu(port->dev, new_mtu);
>> +		if (err) {
>> +			netdev_err(dev, "Device %s failed to change mtu",
>> +				   port->dev->name);
>> +			goto unwind;
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	dev->mtu = new_mtu;
>> +
>> +	return 0;
>> +
>> +unwind:
>> +	list_for_each_entry_continue_reverse(port, &team->port_list, list)
>> +		dev_set_mtu(port->dev, dev->mtu);
>
>It may be worth noting that backwards list traversal is not rcu safe.
>Under rcu_read_lock() list elements will not be freed but the list may
>be modified. Moreover, list_del_rcu() poisons ->prev pointers.
>
>In this case it doesn't really matter though. As Eric pointed out
>previously, we are under rtnl protection and no rcu is needed. Perhaps
>all the extra rcu locking should be removed?

Thanks for catching this. I will perhaps fix this in incremental patch
(in case there will be no more problems with this one).
list_for_each_entry_continue_reverse_rcu needs to be added ro rculist.h

And regarding rtnl. As I stated in v1 of this patch I do not want to
depend on rtnl. Team has spinlock to protect multiple writer access and
in this case, team_change_mtu code is reader.

Thanks.
Jirka


>
>-Ben
>
>> +
>> +	rcu_read_unlock();
>> +	return err;
>> +}
>> +

^ permalink raw reply

* Re: [patch net-next V2] net: introduce ethernet teaming device
From: Benjamin Poirier @ 2011-10-21 14:00 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, eric.dumazet, bhutchings, shemminger, fubar, andy,
	tgraf, ebiederm, mirqus, kaber, greearb, jesse, fbl, jzupka
In-Reply-To: <1319200747-2508-1-git-send-email-jpirko@redhat.com>

On 11/10/21 14:39, Jiri Pirko wrote:
> This patch introduces new network device called team. It supposes to be
> very fast, simple, userspace-driven alternative to existing bonding
> driver.
> 
> Userspace library called libteam with couple of demo apps is available
> here:
> https://github.com/jpirko/libteam
> Note it's still in its dipers atm.
> 
> team<->libteam use generic netlink for communication. That and rtnl
> suppose to be the only way to configure team device, no sysfs etc.
> 
> Python binding basis for libteam was recently introduced (some need
> still need to be done on it though). Daemon providing arpmon/miimon
> active-backup functionality will be introduced shortly.
> All what's necessary is already implemented in kernel team driver.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> 
> v1->v2:
> 	- modes are made as modules. Makes team more modular and
> 	  extendable.
> 	- several commenters' nitpicks found on v1 were fixed
> 	- several other bugs were fixed.
> 	- note I ignored Eric's comment about roundrobin port selector
> 	  as Eric's way may be easily implemented as another mode (mode
> 	  "random") in future.
> ---
>  Documentation/networking/team.txt         |    2 +
>  MAINTAINERS                               |    7 +
>  drivers/net/Kconfig                       |    2 +
>  drivers/net/Makefile                      |    1 +
>  drivers/net/team/Kconfig                  |   38 +
>  drivers/net/team/Makefile                 |    7 +
>  drivers/net/team/team.c                   | 1593 +++++++++++++++++++++++++++++
>  drivers/net/team/team_mode_activebackup.c |  152 +++
>  drivers/net/team/team_mode_roundrobin.c   |  107 ++
>  include/linux/Kbuild                      |    1 +
>  include/linux/if.h                        |    1 +
>  include/linux/if_team.h                   |  233 +++++
>  12 files changed, 2144 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/networking/team.txt
>  create mode 100644 drivers/net/team/Kconfig
>  create mode 100644 drivers/net/team/Makefile
>  create mode 100644 drivers/net/team/team.c
>  create mode 100644 drivers/net/team/team_mode_activebackup.c
>  create mode 100644 drivers/net/team/team_mode_roundrobin.c
>  create mode 100644 include/linux/if_team.h
> 
> diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.txt
> new file mode 100644
> index 0000000..5a01368
> --- /dev/null
> +++ b/Documentation/networking/team.txt
> @@ -0,0 +1,2 @@
> +Team devices are driven from userspace via libteam library which is here:
> +	https://github.com/jpirko/libteam
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 5008b08..c33400d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6372,6 +6372,13 @@ W:	http://tcp-lp-mod.sourceforge.net/
>  S:	Maintained
>  F:	net/ipv4/tcp_lp.c
>  
> +TEAM DRIVER
> +M:	Jiri Pirko <jpirko@redhat.com>
> +L:	netdev@vger.kernel.org
> +S:	Supported
> +F:	drivers/net/team/
> +F:	include/linux/if_team.h
> +
>  TEGRA SUPPORT
>  M:	Colin Cross <ccross@android.com>
>  M:	Erik Gilling <konkers@android.com>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 583f66c..b3020be 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -125,6 +125,8 @@ config IFB
>  	  'ifb1' etc.
>  	  Look at the iproute2 documentation directory for usage etc
>  
> +source "drivers/net/team/Kconfig"
> +
>  config MACVLAN
>  	tristate "MAC-VLAN support (EXPERIMENTAL)"
>  	depends on EXPERIMENTAL
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index fa877cd..4e4ebfe 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -17,6 +17,7 @@ obj-$(CONFIG_NET) += Space.o loopback.o
>  obj-$(CONFIG_NETCONSOLE) += netconsole.o
>  obj-$(CONFIG_PHYLIB) += phy/
>  obj-$(CONFIG_RIONET) += rionet.o
> +obj-$(CONFIG_NET_TEAM) += team/
>  obj-$(CONFIG_TUN) += tun.o
>  obj-$(CONFIG_VETH) += veth.o
>  obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> diff --git a/drivers/net/team/Kconfig b/drivers/net/team/Kconfig
> new file mode 100644
> index 0000000..70a43a6
> --- /dev/null
> +++ b/drivers/net/team/Kconfig
> @@ -0,0 +1,38 @@
> +menuconfig NET_TEAM
> +	tristate "Ethernet team driver support (EXPERIMENTAL)"
> +	depends on EXPERIMENTAL
> +	---help---
> +	  This allows one to create virtual interfaces that teams together
> +	  multiple ethernet devices.
> +
> +	  Team devices can be added using the "ip" command from the
> +	  iproute2 package:
> +
> +	  "ip link add link [ address MAC ] [ NAME ] type team"
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called team.
> +
> +if NET_TEAM
> +
> +config NET_TEAM_MODE_ROUNDROBIN
> +	tristate "Round-robin mode support"
> +	depends on NET_TEAM
> +	---help---
> +	  Basic mode where port used for transmitting packets is selected in
> +	  round-robin fashion using packet counter.
> +
> +	  To compile this team mode as a module, choose M here: the module
> +	  will be called team_mode_roundrobin.
> +
> +config NET_TEAM_MODE_ACTIVEBACKUP
> +	tristate "Active-backup mode support"
> +	depends on NET_TEAM
> +	---help---
> +	  Only one port is active at a time and the rest of ports are used
> +	  for backup.
> +
> +	  To compile this team mode as a module, choose M here: the module
> +	  will be called team_mode_activebackup.
> +
> +endif # NET_TEAM
> diff --git a/drivers/net/team/Makefile b/drivers/net/team/Makefile
> new file mode 100644
> index 0000000..85f2028
> --- /dev/null
> +++ b/drivers/net/team/Makefile
> @@ -0,0 +1,7 @@
> +#
> +# Makefile for the network team driver
> +#
> +
> +obj-$(CONFIG_NET_TEAM) += team.o
> +obj-$(CONFIG_NET_TEAM_MODE_ROUNDROBIN) += team_mode_roundrobin.o
> +obj-$(CONFIG_NET_TEAM_MODE_ACTIVEBACKUP) += team_mode_activebackup.o
> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> new file mode 100644
> index 0000000..398be58
> --- /dev/null
> +++ b/drivers/net/team/team.c
> @@ -0,0 +1,1593 @@
> +/*
> + * net/drivers/team/team.c - Network team device driver
> + * Copyright (c) 2011 Jiri Pirko <jpirko@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/slab.h>
> +#include <linux/rcupdate.h>
> +#include <linux/errno.h>
> +#include <linux/ctype.h>
> +#include <linux/notifier.h>
> +#include <linux/netdevice.h>
> +#include <linux/if_arp.h>
> +#include <linux/socket.h>
> +#include <linux/etherdevice.h>
> +#include <linux/rtnetlink.h>
> +#include <net/rtnetlink.h>
> +#include <net/genetlink.h>
> +#include <net/netlink.h>
> +#include <linux/if_team.h>
> +
> +#define DRV_NAME "team"
> +
> +
> +/**********
> + * Helpers
> + **********/
> +
> +#define team_port_exists(dev) (dev->priv_flags & IFF_TEAM_PORT)
> +
> +static struct team_port *team_port_get_rcu(const struct net_device *dev)
> +{
> +	struct team_port *port = rcu_dereference(dev->rx_handler_data);
> +
> +	return team_port_exists(dev) ? port : NULL;
> +}
> +
> +static struct team_port *team_port_get_rtnl(const struct net_device *dev)
> +{
> +	struct team_port *port = rtnl_dereference(dev->rx_handler_data);
> +
> +	return team_port_exists(dev) ? port : NULL;
> +}
> +
> +/*
> + * Since the ability to change mac address for open port device is tested in
> + * team_port_add, this function can be called without control of return value
> + */
> +static int __set_port_mac(struct net_device *port_dev,
> +			  const unsigned char *dev_addr)
> +{
> +	struct sockaddr addr;
> +
> +	memcpy(addr.sa_data, dev_addr, ETH_ALEN);
> +	addr.sa_family = ARPHRD_ETHER;
> +	return dev_set_mac_address(port_dev, &addr);
> +}
> +
> +int team_port_set_orig_mac(struct team_port *port)
> +{
> +	return __set_port_mac(port->dev, port->orig.dev_addr);
> +}
> +EXPORT_SYMBOL(team_port_set_orig_mac);
> +
> +int team_port_set_team_mac(struct team_port *port)
> +{
> +	return __set_port_mac(port->dev, port->team->dev->dev_addr);
> +}
> +EXPORT_SYMBOL(team_port_set_team_mac);
> +
> +
> +/*******************
> + * Options handling
> + *******************/
> +
> +void team_options_register(struct team *team, struct team_option *option,
> +			   size_t option_count)
> +{
> +	int i;
> +
> +	for (i = 0; i < option_count; i++, option++)
> +		list_add_tail(&option->list, &team->option_list);
> +}
> +EXPORT_SYMBOL(team_options_register);
> +
> +static void __team_options_change_check(struct team *team,
> +					struct team_option *changed_option);
> +
> +static void __team_options_unregister(struct team *team,
> +				      struct team_option *option,
> +				      size_t option_count)
> +{
> +	int i;
> +
> +	for (i = 0; i < option_count; i++, option++)
> +		list_del(&option->list);
> +}
> +
> +void team_options_unregister(struct team *team, struct team_option *option,
> +			     size_t option_count)
> +{
> +	__team_options_unregister(team, option, option_count);
> +	__team_options_change_check(team, NULL);
> +}
> +EXPORT_SYMBOL(team_options_unregister);
> +
> +static int team_option_get(struct team *team, struct team_option *option,
> +			   void *arg)
> +{
> +	return option->getter(team, arg);
> +}
> +
> +static int team_option_set(struct team *team, struct team_option *option,
> +			   void *arg)
> +{
> +	int err;
> +
> +	err = option->setter(team, arg);
> +	if (err)
> +		return err;
> +
> +	__team_options_change_check(team, option);
> +	return err;
> +}
> +
> +/****************
> + * Mode handling
> + ****************/
> +
> +static LIST_HEAD(mode_list);
> +static DEFINE_SPINLOCK(mode_list_lock);
> +
> +static struct team_mode *__find_mode(const char *kind)
> +{
> +	struct team_mode *mode;
> +
> +	list_for_each_entry(mode, &mode_list, list) {
> +		if (strcmp(mode->kind, kind) == 0)
> +			return mode;
> +	}
> +	return NULL;
> +}
> +
> +static bool is_good_mode_name(const char *name)
> +{
> +	while (*name != '\0') {
> +		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
> +			return false;
> +		name++;
> +	}
> +	return true;
> +}
> +
> +int team_mode_register(struct team_mode *mode)
> +{
> +	int err = 0;
> +
> +	if (!is_good_mode_name(mode->kind) ||
> +	    mode->priv_size > TEAM_MODE_PRIV_SIZE)
> +		return -EINVAL;
> +	spin_lock(&mode_list_lock);
> +	if (__find_mode(mode->kind)) {
> +		err = -EEXIST;
> +		goto unlock;
> +	}
> +	list_add_tail(&mode->list, &mode_list);
> +unlock:
> +	spin_unlock(&mode_list_lock);
> +	return err;
> +}
> +EXPORT_SYMBOL(team_mode_register);
> +
> +int team_mode_unregister(struct team_mode *mode)
> +{
> +	spin_lock(&mode_list_lock);
> +	list_del_init(&mode->list);
> +	spin_unlock(&mode_list_lock);
> +	return 0;
> +}
> +EXPORT_SYMBOL(team_mode_unregister);
> +
> +static struct team_mode *team_mode_get(const char *kind)
> +{
> +	struct team_mode *mode;
> +
> +	spin_lock(&mode_list_lock);
> +	mode = __find_mode(kind);
> +	if (!mode) {
> +		spin_unlock(&mode_list_lock);
> +		request_module("team-mode-%s", kind);
> +		spin_lock(&mode_list_lock);
> +		mode = __find_mode(kind);
> +	}
> +	if (mode)
> +		if (!try_module_get(mode->owner))
> +			mode = NULL;
> +
> +	spin_unlock(&mode_list_lock);
> +	return mode;
> +}
> +
> +static void team_mode_put(const char *kind)
> +{
> +	struct team_mode *mode;
> +
> +	spin_lock(&mode_list_lock);
> +	mode = __find_mode(kind);
> +	BUG_ON(!mode);
> +	module_put(mode->owner);
> +	spin_unlock(&mode_list_lock);
> +}
> +
> +/*
> + * We can benefit from the fact that it's ensured no port is present
> + * at the time of mode change.
> + */
> +static int __team_change_mode(struct team *team,
> +			      const struct team_mode *new_mode)
> +{
> +	/* Check if mode was previously set and do cleanup if so */
> +	if (team->mode_kind) {
> +		void (*exit_op)(struct team *team) = team->mode_ops.exit;
> +
> +		/* Clear ops area so no callback is called any longer */
> +		memset(&team->mode_ops, 0, sizeof(struct team_mode_ops));
> +
> +		synchronize_rcu();
> +
> +		if (exit_op)
> +			exit_op(team);
> +		team_mode_put(team->mode_kind);
> +		team->mode_kind = NULL;
> +		/* zero private data area */
> +		memset(&team->mode_priv, 0,
> +		       sizeof(struct team) - offsetof(struct team, mode_priv));
> +	}
> +
> +	if (!new_mode)
> +		return 0;
> +
> +	if (new_mode->ops->init) {
> +		int err;
> +
> +		err = new_mode->ops->init(team);
> +		if (err)
> +			return err;
> +	}
> +
> +	team->mode_kind = new_mode->kind;
> +	memcpy(&team->mode_ops, new_mode->ops, sizeof(struct team_mode_ops));
> +
> +	return 0;
> +}
> +
> +static int team_change_mode(struct team *team, const char *kind)
> +{
> +	struct team_mode *new_mode;
> +	struct net_device *dev = team->dev;
> +	int err;
> +
> +	if (!list_empty(&team->port_list)) {
> +		netdev_err(dev, "No ports can be present during mode change\n");
> +		return -EBUSY;
> +	}
> +
> +	if (team->mode_kind && strcmp(team->mode_kind, kind) == 0) {
> +		netdev_err(dev, "Unable to change to the same mode the team is in\n");
> +		return -EINVAL;
> +	}
> +
> +	new_mode = team_mode_get(kind);
> +	if (!new_mode) {
> +		netdev_err(dev, "Mode \"%s\" not found\n", kind);
> +		return -EINVAL;
> +	}
> +
> +	err = __team_change_mode(team, new_mode);
> +	if (err) {
> +		netdev_err(dev, "Failed to change to mode \"%s\"\n", kind);
> +		team_mode_put(kind);
> +		return err;
> +	}
> +
> +	netdev_info(dev, "Mode changed to \"%s\"\n", kind);
> +	return 0;
> +}
> +
> +
> +/************************
> + * Rx path frame handler
> + ************************/
> +
> +/* note: already called with rcu_read_lock */
> +static rx_handler_result_t team_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +	struct team_port *port;
> +	struct team *team;
> +	rx_handler_result_t res = RX_HANDLER_ANOTHER;
> +
> +	skb = skb_share_check(skb, GFP_ATOMIC);
> +	if (!skb)
> +		return RX_HANDLER_CONSUMED;
> +
> +	*pskb = skb;
> +
> +	port = team_port_get_rcu(skb->dev);
> +	team = port->team;
> +
> +	if (team->mode_ops.receive)
> +		res = team->mode_ops.receive(team, port, skb);
> +
> +	if (res == RX_HANDLER_ANOTHER) {
> +		struct team_pcpu_stats *pcpu_stats;
> +
> +		pcpu_stats = this_cpu_ptr(team->pcpu_stats);
> +		u64_stats_update_begin(&pcpu_stats->syncp);
> +		pcpu_stats->rx_packets++;
> +		pcpu_stats->rx_bytes += skb->len;
> +		if (skb->pkt_type == PACKET_MULTICAST)
> +			pcpu_stats->rx_multicast++;
> +		u64_stats_update_end(&pcpu_stats->syncp);
> +
> +		skb->dev = team->dev;
> +	} else {
> +		this_cpu_inc(team->pcpu_stats->rx_dropped);
> +	}
> +
> +	return res;
> +}
> +
> +
> +/****************
> + * Port handling
> + ****************/
> +
> +static bool team_port_find(const struct team *team,
> +			   const struct team_port *port)
> +{
> +	struct team_port *cur;
> +
> +	list_for_each_entry(cur, &team->port_list, list)
> +		if (cur == port)
> +			return true;
> +	return false;
> +}
> +
> +static int team_port_list_init(struct team *team)
> +{
> +	int i;
> +	struct hlist_head *hash;
> +
> +	hash = kmalloc(sizeof(*hash) * TEAM_PORT_HASHENTRIES, GFP_KERNEL);
> +	if (!hash)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < TEAM_PORT_HASHENTRIES; i++)
> +		INIT_HLIST_HEAD(&hash[i]);
> +	team->port_hlist = hash;
> +	INIT_LIST_HEAD(&team->port_list);
> +	return 0;
> +}
> +
> +static void team_port_list_fini(struct team *team)
> +{
> +	kfree(team->port_hlist);
> +}
> +
> +/*
> + * Add/delete port to the team port list. Write guarded by rtnl_lock.
> + * Takes care of correct port->index setup (might be racy).
> + */
> +static void team_port_list_add_port(struct team *team,
> +				    struct team_port *port)
> +{
> +	port->index = team->port_count++;
> +	hlist_add_head_rcu(&port->hlist,
> +			   team_port_index_hash(team, port->index));
> +	list_add_tail_rcu(&port->list, &team->port_list);
> +}
> +
> +static void __reconstruct_port_hlist(struct team *team, int rm_index)
> +{
> +	int i;
> +	struct team_port *port;
> +
> +	for (i = rm_index + 1; i < team->port_count; i++) {
> +		port = team_get_port_by_index_rcu(team, i);
> +		hlist_del_rcu(&port->hlist);
> +		port->index--;
> +		hlist_add_head_rcu(&port->hlist,
> +				   team_port_index_hash(team, port->index));
> +	}
> +}
> +
> +static void team_port_list_del_port(struct team *team,
> +				   struct team_port *port)
> +{
> +	int rm_index = port->index;
> +
> +	hlist_del_rcu(&port->hlist);
> +	list_del_rcu(&port->list);
> +	__reconstruct_port_hlist(team, rm_index);
> +	team->port_count--;
> +}
> +
> +#define TEAM_VLAN_FEATURES (NETIF_F_ALL_CSUM | NETIF_F_SG | \
> +			    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
> +			    NETIF_F_HIGHDMA | NETIF_F_LRO)
> +
> +static void __team_compute_features(struct team *team)
> +{
> +	struct team_port *port;
> +	u32 vlan_features = TEAM_VLAN_FEATURES;
> +	unsigned short max_hard_header_len = ETH_HLEN;
> +
> +	list_for_each_entry(port, &team->port_list, list) {
> +		vlan_features = netdev_increment_features(vlan_features,
> +					port->dev->vlan_features,
> +					TEAM_VLAN_FEATURES);
> +
> +		if (port->dev->hard_header_len > max_hard_header_len)
> +			max_hard_header_len = port->dev->hard_header_len;
> +	}
> +
> +	team->dev->vlan_features = vlan_features;
> +	team->dev->hard_header_len = max_hard_header_len;
> +
> +	netdev_change_features(team->dev);
> +}
> +
> +static void team_compute_features(struct team *team)
> +{
> +	spin_lock(&team->lock);
> +	__team_compute_features(team);
> +	spin_unlock(&team->lock);
> +}
> +
> +static int team_port_enter(struct team *team, struct team_port *port)
> +{
> +	int err = 0;
> +
> +	dev_hold(team->dev);
> +	port->dev->priv_flags |= IFF_TEAM_PORT;
> +	if (team->mode_ops.port_enter) {
> +		err = team->mode_ops.port_enter(team, port);
> +		if (err)
> +			netdev_err(team->dev, "Device %s failed to enter team mode\n",
> +				   port->dev->name);
> +	}
> +	return err;
> +}
> +
> +static void team_port_leave(struct team *team, struct team_port *port)
> +{
> +	if (team->mode_ops.port_leave)
> +		team->mode_ops.port_leave(team, port);
> +	port->dev->priv_flags &= ~IFF_TEAM_PORT;
> +	dev_put(team->dev);
> +}
> +
> +static void __team_port_change_check(struct team_port *port, bool linkup);
> +
> +static int team_port_add(struct team *team, struct net_device *port_dev)
> +{
> +	struct net_device *dev = team->dev;
> +	struct team_port *port;
> +	char *portname = port_dev->name;
> +	char tmp_addr[ETH_ALEN];
> +	int err;
> +
> +	if (port_dev->flags & IFF_LOOPBACK ||
> +	    port_dev->type != ARPHRD_ETHER) {
> +		netdev_err(dev, "Device %s is of an unsupported type\n",
> +			   portname);
> +		return -EINVAL;
> +	}
> +
> +	if (team_port_exists(port_dev)) {
> +		netdev_err(dev, "Device %s is already a port "
> +				"of a team device\n", portname);
> +		return -EBUSY;
> +	}
> +
> +	if (port_dev->flags & IFF_UP) {
> +		netdev_err(dev, "Device %s is up. Set it down before adding it as a team port\n",
> +			   portname);
> +		return -EBUSY;
> +	}
> +
> +	port = kzalloc(sizeof(struct team_port), GFP_KERNEL);
> +	if (!port)
> +		return -ENOMEM;
> +
> +	port->dev = port_dev;
> +	port->team = team;
> +
> +	port->orig.mtu = port_dev->mtu;
> +	err = dev_set_mtu(port_dev, dev->mtu);
> +	if (err) {
> +		netdev_dbg(dev, "Error %d calling dev_set_mtu\n", err);
> +		goto err_set_mtu;
> +	}
> +
> +	memcpy(port->orig.dev_addr, port_dev->dev_addr, ETH_ALEN);
> +	random_ether_addr(tmp_addr);
> +	err = __set_port_mac(port_dev, tmp_addr);
> +	if (err) {
> +		netdev_dbg(dev, "Device %s mac addr set failed\n",
> +			   portname);
> +		goto err_set_mac_rand;
> +	}
> +
> +	err = dev_open(port_dev);
> +	if (err) {
> +		netdev_dbg(dev, "Device %s opening failed\n",
> +			   portname);
> +		goto err_dev_open;
> +	}
> +
> +	err = team_port_set_orig_mac(port);
> +	if (err) {
> +		netdev_dbg(dev, "Device %s mac addr set failed - Device does not support addr change when it's opened\n",
> +			   portname);
> +		goto err_set_mac_opened;
> +	}
> +
> +	err = team_port_enter(team, port);
> +	if (err) {
> +		netdev_err(dev, "Device %s failed to enter team mode\n",
> +			   portname);
> +		goto err_port_enter;
> +	}
> +
> +	err = netdev_set_master(port_dev, dev);
> +	if (err) {
> +		netdev_err(dev, "Device %s failed to set master\n", portname);
> +		goto err_set_master;
> +	}
> +
> +	err = netdev_rx_handler_register(port_dev, team_handle_frame,
> +					 port);
> +	if (err) {
> +		netdev_err(dev, "Device %s failed to register rx_handler\n",
> +			   portname);
> +		goto err_handler_register;
> +	}
> +
> +	team_port_list_add_port(team, port);
> +	__team_compute_features(team);
> +	__team_port_change_check(port, !!netif_carrier_ok(port_dev));
> +
> +	netdev_info(dev, "Port device %s added\n", portname);
> +
> +	return 0;
> +
> +err_handler_register:
> +	netdev_set_master(port_dev, NULL);
> +
> +err_set_master:
> +	team_port_leave(team, port);
> +
> +err_port_enter:
> +err_set_mac_opened:
> +	dev_close(port_dev);
> +
> +err_dev_open:
> +	team_port_set_orig_mac(port);
> +
> +err_set_mac_rand:
> +	dev_set_mtu(port_dev, port->orig.mtu);
> +
> +err_set_mtu:
> +	kfree(port);
> +
> +	return err;
> +}
> +
> +static int team_port_del(struct team *team, struct net_device *port_dev)
> +{
> +	struct net_device *dev = team->dev;
> +	struct team_port *port;
> +	char *portname = port_dev->name;
> +
> +	port = team_port_get_rtnl(port_dev);
> +	if (!port || !team_port_find(team, port)) {
> +		netdev_err(dev, "Device %s does not act as a port of this team\n",
> +			   portname);
> +		return -ENOENT;
> +	}
> +
> +	__team_port_change_check(port, false);
> +	team_port_list_del_port(team, port);
> +	netdev_rx_handler_unregister(port_dev);
> +	netdev_set_master(port_dev, NULL);
> +	team_port_leave(team, port);
> +	dev_close(port_dev);
> +	team_port_set_orig_mac(port);
> +	dev_set_mtu(port_dev, port->orig.mtu);
> +	synchronize_rcu();
> +	kfree(port);
> +	netdev_info(dev, "Port device %s removed\n", portname);
> +	__team_compute_features(team);
> +
> +	return 0;
> +}
> +
> +
> +/*****************
> + * Net device ops
> + *****************/
> +
> +static const char team_no_mode_kind[] = "*NOMODE*";
> +
> +static int team_mode_option_get(struct team *team, void *arg)
> +{
> +	const char **str = arg;
> +
> +	*str = team->mode_kind ? team->mode_kind : team_no_mode_kind;
> +	return 0;
> +}
> +
> +static int team_mode_option_set(struct team *team, void *arg)
> +{
> +	const char **str = arg;
> +
> +	return team_change_mode(team, *str);
> +}
> +
> +static struct team_option team_options[] = {
> +	{
> +		.name = "mode",
> +		.type = TEAM_OPTION_TYPE_STRING,
> +		.getter = team_mode_option_get,
> +		.setter = team_mode_option_set,
> +	},
> +};
> +
> +static int team_init(struct net_device *dev)
> +{
> +	struct team *team = netdev_priv(dev);
> +	int err;
> +
> +	team->dev = dev;
> +	spin_lock_init(&team->lock);
> +
> +	team->pcpu_stats = alloc_percpu(struct team_pcpu_stats);
> +	if (!team->pcpu_stats)
> +		return -ENOMEM;
> +
> +	err = team_port_list_init(team);
> +	if (err)
> +		goto err_port_list_init;
> +
> +	INIT_LIST_HEAD(&team->option_list);
> +	team_options_register(team, team_options, ARRAY_SIZE(team_options));
> +	netif_carrier_off(dev);
> +
> +	return 0;
> +
> +err_port_list_init:
> +
> +	free_percpu(team->pcpu_stats);
> +
> +	return err;
> +}
> +
> +static void team_uninit(struct net_device *dev)
> +{
> +	struct team *team = netdev_priv(dev);
> +	struct team_port *port;
> +	struct team_port *tmp;
> +
> +	spin_lock(&team->lock);
> +	list_for_each_entry_safe(port, tmp, &team->port_list, list)
> +		team_port_del(team, port->dev);
> +
> +	__team_change_mode(team, NULL); /* cleanup */
> +	__team_options_unregister(team, team_options, ARRAY_SIZE(team_options));
> +	spin_unlock(&team->lock);
> +}
> +
> +static void team_destructor(struct net_device *dev)
> +{
> +	struct team *team = netdev_priv(dev);
> +
> +	team_port_list_fini(team);
> +	free_percpu(team->pcpu_stats);
> +	free_netdev(dev);
> +}
> +
> +static int team_open(struct net_device *dev)
> +{
> +	netif_carrier_on(dev);
> +	return 0;
> +}
> +
> +static int team_close(struct net_device *dev)
> +{
> +	netif_carrier_off(dev);
> +	return 0;
> +}
> +
> +/*
> + * note: already called with rcu_read_lock
> + */
> +static netdev_tx_t team_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct team *team = netdev_priv(dev);
> +	bool tx_success = false;
> +	unsigned int len = skb->len;
> +
> +	/*
> +	 * Ensure transmit function is called only in case there is at least
> +	 * one port present.
> +	 */
> +	if (likely(!list_empty(&team->port_list) && team->mode_ops.transmit))
> +		tx_success = team->mode_ops.transmit(team, skb);
> +	if (tx_success) {
> +		struct team_pcpu_stats *pcpu_stats;
> +
> +		pcpu_stats = this_cpu_ptr(team->pcpu_stats);
> +		u64_stats_update_begin(&pcpu_stats->syncp);
> +		pcpu_stats->tx_packets++;
> +		pcpu_stats->tx_bytes += len;
> +		u64_stats_update_end(&pcpu_stats->syncp);
> +	} else {
> +		this_cpu_inc(team->pcpu_stats->tx_dropped);
> +	}
> +
> +	return NETDEV_TX_OK;
> +}
> +
> +static void team_change_rx_flags(struct net_device *dev, int change)
> +{
> +	struct team *team = netdev_priv(dev);
> +	struct team_port *port;
> +	int inc;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(port, &team->port_list, list) {
> +		if (change & IFF_PROMISC) {
> +			inc = dev->flags & IFF_PROMISC ? 1 : -1;
> +			dev_set_promiscuity(port->dev, inc);
> +		}
> +		if (change & IFF_ALLMULTI) {
> +			inc = dev->flags & IFF_ALLMULTI ? 1 : -1;
> +			dev_set_allmulti(port->dev, inc);
> +		}
> +	}
> +	rcu_read_unlock();
> +}
> +
> +static void team_set_rx_mode(struct net_device *dev)
> +{
> +	struct team *team = netdev_priv(dev);
> +	struct team_port *port;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(port, &team->port_list, list) {
> +		dev_uc_sync(port->dev, dev);
> +		dev_mc_sync(port->dev, dev);
> +	}
> +	rcu_read_unlock();
> +}
> +
> +static int team_set_mac_address(struct net_device *dev, void *p)
> +{
> +	struct team *team = netdev_priv(dev);
> +	struct team_port *port;
> +	struct sockaddr *addr = p;
> +
> +	memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(port, &team->port_list, list)
> +		if (team->mode_ops.port_change_mac)
> +			team->mode_ops.port_change_mac(team, port);
> +	rcu_read_unlock();
> +	return 0;
> +}
> +
> +static int team_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +	struct team *team = netdev_priv(dev);
> +	struct team_port *port;
> +	int err;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(port, &team->port_list, list) {
> +		err = dev_set_mtu(port->dev, new_mtu);
> +		if (err) {
> +			netdev_err(dev, "Device %s failed to change mtu",
> +				   port->dev->name);
> +			goto unwind;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	dev->mtu = new_mtu;
> +
> +	return 0;
> +
> +unwind:
> +	list_for_each_entry_continue_reverse(port, &team->port_list, list)
> +		dev_set_mtu(port->dev, dev->mtu);

It may be worth noting that backwards list traversal is not rcu safe.
Under rcu_read_lock() list elements will not be freed but the list may
be modified. Moreover, list_del_rcu() poisons ->prev pointers.

In this case it doesn't really matter though. As Eric pointed out
previously, we are under rtnl protection and no rcu is needed. Perhaps
all the extra rcu locking should be removed?

-Ben

> +
> +	rcu_read_unlock();
> +	return err;
> +}
> +

^ permalink raw reply

* [patch net-next V2] net: introduce ethernet teaming device
From: Jiri Pirko @ 2011-10-21 12:39 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, bhutchings, shemminger, fubar, andy, tgraf,
	ebiederm, mirqus, kaber, greearb, jesse, fbl, benjamin.poirier,
	jzupka

This patch introduces new network device called team. It supposes to be
very fast, simple, userspace-driven alternative to existing bonding
driver.

Userspace library called libteam with couple of demo apps is available
here:
https://github.com/jpirko/libteam
Note it's still in its dipers atm.

team<->libteam use generic netlink for communication. That and rtnl
suppose to be the only way to configure team device, no sysfs etc.

Python binding basis for libteam was recently introduced (some need
still need to be done on it though). Daemon providing arpmon/miimon
active-backup functionality will be introduced shortly.
All what's necessary is already implemented in kernel team driver.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

v1->v2:
	- modes are made as modules. Makes team more modular and
	  extendable.
	- several commenters' nitpicks found on v1 were fixed
	- several other bugs were fixed.
	- note I ignored Eric's comment about roundrobin port selector
	  as Eric's way may be easily implemented as another mode (mode
	  "random") in future.
---
 Documentation/networking/team.txt         |    2 +
 MAINTAINERS                               |    7 +
 drivers/net/Kconfig                       |    2 +
 drivers/net/Makefile                      |    1 +
 drivers/net/team/Kconfig                  |   38 +
 drivers/net/team/Makefile                 |    7 +
 drivers/net/team/team.c                   | 1593 +++++++++++++++++++++++++++++
 drivers/net/team/team_mode_activebackup.c |  152 +++
 drivers/net/team/team_mode_roundrobin.c   |  107 ++
 include/linux/Kbuild                      |    1 +
 include/linux/if.h                        |    1 +
 include/linux/if_team.h                   |  233 +++++
 12 files changed, 2144 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/team.txt
 create mode 100644 drivers/net/team/Kconfig
 create mode 100644 drivers/net/team/Makefile
 create mode 100644 drivers/net/team/team.c
 create mode 100644 drivers/net/team/team_mode_activebackup.c
 create mode 100644 drivers/net/team/team_mode_roundrobin.c
 create mode 100644 include/linux/if_team.h

diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.txt
new file mode 100644
index 0000000..5a01368
--- /dev/null
+++ b/Documentation/networking/team.txt
@@ -0,0 +1,2 @@
+Team devices are driven from userspace via libteam library which is here:
+	https://github.com/jpirko/libteam
diff --git a/MAINTAINERS b/MAINTAINERS
index 5008b08..c33400d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6372,6 +6372,13 @@ W:	http://tcp-lp-mod.sourceforge.net/
 S:	Maintained
 F:	net/ipv4/tcp_lp.c
 
+TEAM DRIVER
+M:	Jiri Pirko <jpirko@redhat.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	drivers/net/team/
+F:	include/linux/if_team.h
+
 TEGRA SUPPORT
 M:	Colin Cross <ccross@android.com>
 M:	Erik Gilling <konkers@android.com>
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 583f66c..b3020be 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -125,6 +125,8 @@ config IFB
 	  'ifb1' etc.
 	  Look at the iproute2 documentation directory for usage etc
 
+source "drivers/net/team/Kconfig"
+
 config MACVLAN
 	tristate "MAC-VLAN support (EXPERIMENTAL)"
 	depends on EXPERIMENTAL
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index fa877cd..4e4ebfe 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_NET) += Space.o loopback.o
 obj-$(CONFIG_NETCONSOLE) += netconsole.o
 obj-$(CONFIG_PHYLIB) += phy/
 obj-$(CONFIG_RIONET) += rionet.o
+obj-$(CONFIG_NET_TEAM) += team/
 obj-$(CONFIG_TUN) += tun.o
 obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
diff --git a/drivers/net/team/Kconfig b/drivers/net/team/Kconfig
new file mode 100644
index 0000000..70a43a6
--- /dev/null
+++ b/drivers/net/team/Kconfig
@@ -0,0 +1,38 @@
+menuconfig NET_TEAM
+	tristate "Ethernet team driver support (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	---help---
+	  This allows one to create virtual interfaces that teams together
+	  multiple ethernet devices.
+
+	  Team devices can be added using the "ip" command from the
+	  iproute2 package:
+
+	  "ip link add link [ address MAC ] [ NAME ] type team"
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called team.
+
+if NET_TEAM
+
+config NET_TEAM_MODE_ROUNDROBIN
+	tristate "Round-robin mode support"
+	depends on NET_TEAM
+	---help---
+	  Basic mode where port used for transmitting packets is selected in
+	  round-robin fashion using packet counter.
+
+	  To compile this team mode as a module, choose M here: the module
+	  will be called team_mode_roundrobin.
+
+config NET_TEAM_MODE_ACTIVEBACKUP
+	tristate "Active-backup mode support"
+	depends on NET_TEAM
+	---help---
+	  Only one port is active at a time and the rest of ports are used
+	  for backup.
+
+	  To compile this team mode as a module, choose M here: the module
+	  will be called team_mode_activebackup.
+
+endif # NET_TEAM
diff --git a/drivers/net/team/Makefile b/drivers/net/team/Makefile
new file mode 100644
index 0000000..85f2028
--- /dev/null
+++ b/drivers/net/team/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for the network team driver
+#
+
+obj-$(CONFIG_NET_TEAM) += team.o
+obj-$(CONFIG_NET_TEAM_MODE_ROUNDROBIN) += team_mode_roundrobin.o
+obj-$(CONFIG_NET_TEAM_MODE_ACTIVEBACKUP) += team_mode_activebackup.o
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
new file mode 100644
index 0000000..398be58
--- /dev/null
+++ b/drivers/net/team/team.c
@@ -0,0 +1,1593 @@
+/*
+ * net/drivers/team/team.c - Network team device driver
+ * Copyright (c) 2011 Jiri Pirko <jpirko@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/rcupdate.h>
+#include <linux/errno.h>
+#include <linux/ctype.h>
+#include <linux/notifier.h>
+#include <linux/netdevice.h>
+#include <linux/if_arp.h>
+#include <linux/socket.h>
+#include <linux/etherdevice.h>
+#include <linux/rtnetlink.h>
+#include <net/rtnetlink.h>
+#include <net/genetlink.h>
+#include <net/netlink.h>
+#include <linux/if_team.h>
+
+#define DRV_NAME "team"
+
+
+/**********
+ * Helpers
+ **********/
+
+#define team_port_exists(dev) (dev->priv_flags & IFF_TEAM_PORT)
+
+static struct team_port *team_port_get_rcu(const struct net_device *dev)
+{
+	struct team_port *port = rcu_dereference(dev->rx_handler_data);
+
+	return team_port_exists(dev) ? port : NULL;
+}
+
+static struct team_port *team_port_get_rtnl(const struct net_device *dev)
+{
+	struct team_port *port = rtnl_dereference(dev->rx_handler_data);
+
+	return team_port_exists(dev) ? port : NULL;
+}
+
+/*
+ * Since the ability to change mac address for open port device is tested in
+ * team_port_add, this function can be called without control of return value
+ */
+static int __set_port_mac(struct net_device *port_dev,
+			  const unsigned char *dev_addr)
+{
+	struct sockaddr addr;
+
+	memcpy(addr.sa_data, dev_addr, ETH_ALEN);
+	addr.sa_family = ARPHRD_ETHER;
+	return dev_set_mac_address(port_dev, &addr);
+}
+
+int team_port_set_orig_mac(struct team_port *port)
+{
+	return __set_port_mac(port->dev, port->orig.dev_addr);
+}
+EXPORT_SYMBOL(team_port_set_orig_mac);
+
+int team_port_set_team_mac(struct team_port *port)
+{
+	return __set_port_mac(port->dev, port->team->dev->dev_addr);
+}
+EXPORT_SYMBOL(team_port_set_team_mac);
+
+
+/*******************
+ * Options handling
+ *******************/
+
+void team_options_register(struct team *team, struct team_option *option,
+			   size_t option_count)
+{
+	int i;
+
+	for (i = 0; i < option_count; i++, option++)
+		list_add_tail(&option->list, &team->option_list);
+}
+EXPORT_SYMBOL(team_options_register);
+
+static void __team_options_change_check(struct team *team,
+					struct team_option *changed_option);
+
+static void __team_options_unregister(struct team *team,
+				      struct team_option *option,
+				      size_t option_count)
+{
+	int i;
+
+	for (i = 0; i < option_count; i++, option++)
+		list_del(&option->list);
+}
+
+void team_options_unregister(struct team *team, struct team_option *option,
+			     size_t option_count)
+{
+	__team_options_unregister(team, option, option_count);
+	__team_options_change_check(team, NULL);
+}
+EXPORT_SYMBOL(team_options_unregister);
+
+static int team_option_get(struct team *team, struct team_option *option,
+			   void *arg)
+{
+	return option->getter(team, arg);
+}
+
+static int team_option_set(struct team *team, struct team_option *option,
+			   void *arg)
+{
+	int err;
+
+	err = option->setter(team, arg);
+	if (err)
+		return err;
+
+	__team_options_change_check(team, option);
+	return err;
+}
+
+/****************
+ * Mode handling
+ ****************/
+
+static LIST_HEAD(mode_list);
+static DEFINE_SPINLOCK(mode_list_lock);
+
+static struct team_mode *__find_mode(const char *kind)
+{
+	struct team_mode *mode;
+
+	list_for_each_entry(mode, &mode_list, list) {
+		if (strcmp(mode->kind, kind) == 0)
+			return mode;
+	}
+	return NULL;
+}
+
+static bool is_good_mode_name(const char *name)
+{
+	while (*name != '\0') {
+		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
+			return false;
+		name++;
+	}
+	return true;
+}
+
+int team_mode_register(struct team_mode *mode)
+{
+	int err = 0;
+
+	if (!is_good_mode_name(mode->kind) ||
+	    mode->priv_size > TEAM_MODE_PRIV_SIZE)
+		return -EINVAL;
+	spin_lock(&mode_list_lock);
+	if (__find_mode(mode->kind)) {
+		err = -EEXIST;
+		goto unlock;
+	}
+	list_add_tail(&mode->list, &mode_list);
+unlock:
+	spin_unlock(&mode_list_lock);
+	return err;
+}
+EXPORT_SYMBOL(team_mode_register);
+
+int team_mode_unregister(struct team_mode *mode)
+{
+	spin_lock(&mode_list_lock);
+	list_del_init(&mode->list);
+	spin_unlock(&mode_list_lock);
+	return 0;
+}
+EXPORT_SYMBOL(team_mode_unregister);
+
+static struct team_mode *team_mode_get(const char *kind)
+{
+	struct team_mode *mode;
+
+	spin_lock(&mode_list_lock);
+	mode = __find_mode(kind);
+	if (!mode) {
+		spin_unlock(&mode_list_lock);
+		request_module("team-mode-%s", kind);
+		spin_lock(&mode_list_lock);
+		mode = __find_mode(kind);
+	}
+	if (mode)
+		if (!try_module_get(mode->owner))
+			mode = NULL;
+
+	spin_unlock(&mode_list_lock);
+	return mode;
+}
+
+static void team_mode_put(const char *kind)
+{
+	struct team_mode *mode;
+
+	spin_lock(&mode_list_lock);
+	mode = __find_mode(kind);
+	BUG_ON(!mode);
+	module_put(mode->owner);
+	spin_unlock(&mode_list_lock);
+}
+
+/*
+ * We can benefit from the fact that it's ensured no port is present
+ * at the time of mode change.
+ */
+static int __team_change_mode(struct team *team,
+			      const struct team_mode *new_mode)
+{
+	/* Check if mode was previously set and do cleanup if so */
+	if (team->mode_kind) {
+		void (*exit_op)(struct team *team) = team->mode_ops.exit;
+
+		/* Clear ops area so no callback is called any longer */
+		memset(&team->mode_ops, 0, sizeof(struct team_mode_ops));
+
+		synchronize_rcu();
+
+		if (exit_op)
+			exit_op(team);
+		team_mode_put(team->mode_kind);
+		team->mode_kind = NULL;
+		/* zero private data area */
+		memset(&team->mode_priv, 0,
+		       sizeof(struct team) - offsetof(struct team, mode_priv));
+	}
+
+	if (!new_mode)
+		return 0;
+
+	if (new_mode->ops->init) {
+		int err;
+
+		err = new_mode->ops->init(team);
+		if (err)
+			return err;
+	}
+
+	team->mode_kind = new_mode->kind;
+	memcpy(&team->mode_ops, new_mode->ops, sizeof(struct team_mode_ops));
+
+	return 0;
+}
+
+static int team_change_mode(struct team *team, const char *kind)
+{
+	struct team_mode *new_mode;
+	struct net_device *dev = team->dev;
+	int err;
+
+	if (!list_empty(&team->port_list)) {
+		netdev_err(dev, "No ports can be present during mode change\n");
+		return -EBUSY;
+	}
+
+	if (team->mode_kind && strcmp(team->mode_kind, kind) == 0) {
+		netdev_err(dev, "Unable to change to the same mode the team is in\n");
+		return -EINVAL;
+	}
+
+	new_mode = team_mode_get(kind);
+	if (!new_mode) {
+		netdev_err(dev, "Mode \"%s\" not found\n", kind);
+		return -EINVAL;
+	}
+
+	err = __team_change_mode(team, new_mode);
+	if (err) {
+		netdev_err(dev, "Failed to change to mode \"%s\"\n", kind);
+		team_mode_put(kind);
+		return err;
+	}
+
+	netdev_info(dev, "Mode changed to \"%s\"\n", kind);
+	return 0;
+}
+
+
+/************************
+ * Rx path frame handler
+ ************************/
+
+/* note: already called with rcu_read_lock */
+static rx_handler_result_t team_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct team_port *port;
+	struct team *team;
+	rx_handler_result_t res = RX_HANDLER_ANOTHER;
+
+	skb = skb_share_check(skb, GFP_ATOMIC);
+	if (!skb)
+		return RX_HANDLER_CONSUMED;
+
+	*pskb = skb;
+
+	port = team_port_get_rcu(skb->dev);
+	team = port->team;
+
+	if (team->mode_ops.receive)
+		res = team->mode_ops.receive(team, port, skb);
+
+	if (res == RX_HANDLER_ANOTHER) {
+		struct team_pcpu_stats *pcpu_stats;
+
+		pcpu_stats = this_cpu_ptr(team->pcpu_stats);
+		u64_stats_update_begin(&pcpu_stats->syncp);
+		pcpu_stats->rx_packets++;
+		pcpu_stats->rx_bytes += skb->len;
+		if (skb->pkt_type == PACKET_MULTICAST)
+			pcpu_stats->rx_multicast++;
+		u64_stats_update_end(&pcpu_stats->syncp);
+
+		skb->dev = team->dev;
+	} else {
+		this_cpu_inc(team->pcpu_stats->rx_dropped);
+	}
+
+	return res;
+}
+
+
+/****************
+ * Port handling
+ ****************/
+
+static bool team_port_find(const struct team *team,
+			   const struct team_port *port)
+{
+	struct team_port *cur;
+
+	list_for_each_entry(cur, &team->port_list, list)
+		if (cur == port)
+			return true;
+	return false;
+}
+
+static int team_port_list_init(struct team *team)
+{
+	int i;
+	struct hlist_head *hash;
+
+	hash = kmalloc(sizeof(*hash) * TEAM_PORT_HASHENTRIES, GFP_KERNEL);
+	if (!hash)
+		return -ENOMEM;
+
+	for (i = 0; i < TEAM_PORT_HASHENTRIES; i++)
+		INIT_HLIST_HEAD(&hash[i]);
+	team->port_hlist = hash;
+	INIT_LIST_HEAD(&team->port_list);
+	return 0;
+}
+
+static void team_port_list_fini(struct team *team)
+{
+	kfree(team->port_hlist);
+}
+
+/*
+ * Add/delete port to the team port list. Write guarded by rtnl_lock.
+ * Takes care of correct port->index setup (might be racy).
+ */
+static void team_port_list_add_port(struct team *team,
+				    struct team_port *port)
+{
+	port->index = team->port_count++;
+	hlist_add_head_rcu(&port->hlist,
+			   team_port_index_hash(team, port->index));
+	list_add_tail_rcu(&port->list, &team->port_list);
+}
+
+static void __reconstruct_port_hlist(struct team *team, int rm_index)
+{
+	int i;
+	struct team_port *port;
+
+	for (i = rm_index + 1; i < team->port_count; i++) {
+		port = team_get_port_by_index_rcu(team, i);
+		hlist_del_rcu(&port->hlist);
+		port->index--;
+		hlist_add_head_rcu(&port->hlist,
+				   team_port_index_hash(team, port->index));
+	}
+}
+
+static void team_port_list_del_port(struct team *team,
+				   struct team_port *port)
+{
+	int rm_index = port->index;
+
+	hlist_del_rcu(&port->hlist);
+	list_del_rcu(&port->list);
+	__reconstruct_port_hlist(team, rm_index);
+	team->port_count--;
+}
+
+#define TEAM_VLAN_FEATURES (NETIF_F_ALL_CSUM | NETIF_F_SG | \
+			    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
+			    NETIF_F_HIGHDMA | NETIF_F_LRO)
+
+static void __team_compute_features(struct team *team)
+{
+	struct team_port *port;
+	u32 vlan_features = TEAM_VLAN_FEATURES;
+	unsigned short max_hard_header_len = ETH_HLEN;
+
+	list_for_each_entry(port, &team->port_list, list) {
+		vlan_features = netdev_increment_features(vlan_features,
+					port->dev->vlan_features,
+					TEAM_VLAN_FEATURES);
+
+		if (port->dev->hard_header_len > max_hard_header_len)
+			max_hard_header_len = port->dev->hard_header_len;
+	}
+
+	team->dev->vlan_features = vlan_features;
+	team->dev->hard_header_len = max_hard_header_len;
+
+	netdev_change_features(team->dev);
+}
+
+static void team_compute_features(struct team *team)
+{
+	spin_lock(&team->lock);
+	__team_compute_features(team);
+	spin_unlock(&team->lock);
+}
+
+static int team_port_enter(struct team *team, struct team_port *port)
+{
+	int err = 0;
+
+	dev_hold(team->dev);
+	port->dev->priv_flags |= IFF_TEAM_PORT;
+	if (team->mode_ops.port_enter) {
+		err = team->mode_ops.port_enter(team, port);
+		if (err)
+			netdev_err(team->dev, "Device %s failed to enter team mode\n",
+				   port->dev->name);
+	}
+	return err;
+}
+
+static void team_port_leave(struct team *team, struct team_port *port)
+{
+	if (team->mode_ops.port_leave)
+		team->mode_ops.port_leave(team, port);
+	port->dev->priv_flags &= ~IFF_TEAM_PORT;
+	dev_put(team->dev);
+}
+
+static void __team_port_change_check(struct team_port *port, bool linkup);
+
+static int team_port_add(struct team *team, struct net_device *port_dev)
+{
+	struct net_device *dev = team->dev;
+	struct team_port *port;
+	char *portname = port_dev->name;
+	char tmp_addr[ETH_ALEN];
+	int err;
+
+	if (port_dev->flags & IFF_LOOPBACK ||
+	    port_dev->type != ARPHRD_ETHER) {
+		netdev_err(dev, "Device %s is of an unsupported type\n",
+			   portname);
+		return -EINVAL;
+	}
+
+	if (team_port_exists(port_dev)) {
+		netdev_err(dev, "Device %s is already a port "
+				"of a team device\n", portname);
+		return -EBUSY;
+	}
+
+	if (port_dev->flags & IFF_UP) {
+		netdev_err(dev, "Device %s is up. Set it down before adding it as a team port\n",
+			   portname);
+		return -EBUSY;
+	}
+
+	port = kzalloc(sizeof(struct team_port), GFP_KERNEL);
+	if (!port)
+		return -ENOMEM;
+
+	port->dev = port_dev;
+	port->team = team;
+
+	port->orig.mtu = port_dev->mtu;
+	err = dev_set_mtu(port_dev, dev->mtu);
+	if (err) {
+		netdev_dbg(dev, "Error %d calling dev_set_mtu\n", err);
+		goto err_set_mtu;
+	}
+
+	memcpy(port->orig.dev_addr, port_dev->dev_addr, ETH_ALEN);
+	random_ether_addr(tmp_addr);
+	err = __set_port_mac(port_dev, tmp_addr);
+	if (err) {
+		netdev_dbg(dev, "Device %s mac addr set failed\n",
+			   portname);
+		goto err_set_mac_rand;
+	}
+
+	err = dev_open(port_dev);
+	if (err) {
+		netdev_dbg(dev, "Device %s opening failed\n",
+			   portname);
+		goto err_dev_open;
+	}
+
+	err = team_port_set_orig_mac(port);
+	if (err) {
+		netdev_dbg(dev, "Device %s mac addr set failed - Device does not support addr change when it's opened\n",
+			   portname);
+		goto err_set_mac_opened;
+	}
+
+	err = team_port_enter(team, port);
+	if (err) {
+		netdev_err(dev, "Device %s failed to enter team mode\n",
+			   portname);
+		goto err_port_enter;
+	}
+
+	err = netdev_set_master(port_dev, dev);
+	if (err) {
+		netdev_err(dev, "Device %s failed to set master\n", portname);
+		goto err_set_master;
+	}
+
+	err = netdev_rx_handler_register(port_dev, team_handle_frame,
+					 port);
+	if (err) {
+		netdev_err(dev, "Device %s failed to register rx_handler\n",
+			   portname);
+		goto err_handler_register;
+	}
+
+	team_port_list_add_port(team, port);
+	__team_compute_features(team);
+	__team_port_change_check(port, !!netif_carrier_ok(port_dev));
+
+	netdev_info(dev, "Port device %s added\n", portname);
+
+	return 0;
+
+err_handler_register:
+	netdev_set_master(port_dev, NULL);
+
+err_set_master:
+	team_port_leave(team, port);
+
+err_port_enter:
+err_set_mac_opened:
+	dev_close(port_dev);
+
+err_dev_open:
+	team_port_set_orig_mac(port);
+
+err_set_mac_rand:
+	dev_set_mtu(port_dev, port->orig.mtu);
+
+err_set_mtu:
+	kfree(port);
+
+	return err;
+}
+
+static int team_port_del(struct team *team, struct net_device *port_dev)
+{
+	struct net_device *dev = team->dev;
+	struct team_port *port;
+	char *portname = port_dev->name;
+
+	port = team_port_get_rtnl(port_dev);
+	if (!port || !team_port_find(team, port)) {
+		netdev_err(dev, "Device %s does not act as a port of this team\n",
+			   portname);
+		return -ENOENT;
+	}
+
+	__team_port_change_check(port, false);
+	team_port_list_del_port(team, port);
+	netdev_rx_handler_unregister(port_dev);
+	netdev_set_master(port_dev, NULL);
+	team_port_leave(team, port);
+	dev_close(port_dev);
+	team_port_set_orig_mac(port);
+	dev_set_mtu(port_dev, port->orig.mtu);
+	synchronize_rcu();
+	kfree(port);
+	netdev_info(dev, "Port device %s removed\n", portname);
+	__team_compute_features(team);
+
+	return 0;
+}
+
+
+/*****************
+ * Net device ops
+ *****************/
+
+static const char team_no_mode_kind[] = "*NOMODE*";
+
+static int team_mode_option_get(struct team *team, void *arg)
+{
+	const char **str = arg;
+
+	*str = team->mode_kind ? team->mode_kind : team_no_mode_kind;
+	return 0;
+}
+
+static int team_mode_option_set(struct team *team, void *arg)
+{
+	const char **str = arg;
+
+	return team_change_mode(team, *str);
+}
+
+static struct team_option team_options[] = {
+	{
+		.name = "mode",
+		.type = TEAM_OPTION_TYPE_STRING,
+		.getter = team_mode_option_get,
+		.setter = team_mode_option_set,
+	},
+};
+
+static int team_init(struct net_device *dev)
+{
+	struct team *team = netdev_priv(dev);
+	int err;
+
+	team->dev = dev;
+	spin_lock_init(&team->lock);
+
+	team->pcpu_stats = alloc_percpu(struct team_pcpu_stats);
+	if (!team->pcpu_stats)
+		return -ENOMEM;
+
+	err = team_port_list_init(team);
+	if (err)
+		goto err_port_list_init;
+
+	INIT_LIST_HEAD(&team->option_list);
+	team_options_register(team, team_options, ARRAY_SIZE(team_options));
+	netif_carrier_off(dev);
+
+	return 0;
+
+err_port_list_init:
+
+	free_percpu(team->pcpu_stats);
+
+	return err;
+}
+
+static void team_uninit(struct net_device *dev)
+{
+	struct team *team = netdev_priv(dev);
+	struct team_port *port;
+	struct team_port *tmp;
+
+	spin_lock(&team->lock);
+	list_for_each_entry_safe(port, tmp, &team->port_list, list)
+		team_port_del(team, port->dev);
+
+	__team_change_mode(team, NULL); /* cleanup */
+	__team_options_unregister(team, team_options, ARRAY_SIZE(team_options));
+	spin_unlock(&team->lock);
+}
+
+static void team_destructor(struct net_device *dev)
+{
+	struct team *team = netdev_priv(dev);
+
+	team_port_list_fini(team);
+	free_percpu(team->pcpu_stats);
+	free_netdev(dev);
+}
+
+static int team_open(struct net_device *dev)
+{
+	netif_carrier_on(dev);
+	return 0;
+}
+
+static int team_close(struct net_device *dev)
+{
+	netif_carrier_off(dev);
+	return 0;
+}
+
+/*
+ * note: already called with rcu_read_lock
+ */
+static netdev_tx_t team_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct team *team = netdev_priv(dev);
+	bool tx_success = false;
+	unsigned int len = skb->len;
+
+	/*
+	 * Ensure transmit function is called only in case there is at least
+	 * one port present.
+	 */
+	if (likely(!list_empty(&team->port_list) && team->mode_ops.transmit))
+		tx_success = team->mode_ops.transmit(team, skb);
+	if (tx_success) {
+		struct team_pcpu_stats *pcpu_stats;
+
+		pcpu_stats = this_cpu_ptr(team->pcpu_stats);
+		u64_stats_update_begin(&pcpu_stats->syncp);
+		pcpu_stats->tx_packets++;
+		pcpu_stats->tx_bytes += len;
+		u64_stats_update_end(&pcpu_stats->syncp);
+	} else {
+		this_cpu_inc(team->pcpu_stats->tx_dropped);
+	}
+
+	return NETDEV_TX_OK;
+}
+
+static void team_change_rx_flags(struct net_device *dev, int change)
+{
+	struct team *team = netdev_priv(dev);
+	struct team_port *port;
+	int inc;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(port, &team->port_list, list) {
+		if (change & IFF_PROMISC) {
+			inc = dev->flags & IFF_PROMISC ? 1 : -1;
+			dev_set_promiscuity(port->dev, inc);
+		}
+		if (change & IFF_ALLMULTI) {
+			inc = dev->flags & IFF_ALLMULTI ? 1 : -1;
+			dev_set_allmulti(port->dev, inc);
+		}
+	}
+	rcu_read_unlock();
+}
+
+static void team_set_rx_mode(struct net_device *dev)
+{
+	struct team *team = netdev_priv(dev);
+	struct team_port *port;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(port, &team->port_list, list) {
+		dev_uc_sync(port->dev, dev);
+		dev_mc_sync(port->dev, dev);
+	}
+	rcu_read_unlock();
+}
+
+static int team_set_mac_address(struct net_device *dev, void *p)
+{
+	struct team *team = netdev_priv(dev);
+	struct team_port *port;
+	struct sockaddr *addr = p;
+
+	memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
+	rcu_read_lock();
+	list_for_each_entry_rcu(port, &team->port_list, list)
+		if (team->mode_ops.port_change_mac)
+			team->mode_ops.port_change_mac(team, port);
+	rcu_read_unlock();
+	return 0;
+}
+
+static int team_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct team *team = netdev_priv(dev);
+	struct team_port *port;
+	int err;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(port, &team->port_list, list) {
+		err = dev_set_mtu(port->dev, new_mtu);
+		if (err) {
+			netdev_err(dev, "Device %s failed to change mtu",
+				   port->dev->name);
+			goto unwind;
+		}
+	}
+	rcu_read_unlock();
+
+	dev->mtu = new_mtu;
+
+	return 0;
+
+unwind:
+	list_for_each_entry_continue_reverse(port, &team->port_list, list)
+		dev_set_mtu(port->dev, dev->mtu);
+
+	rcu_read_unlock();
+	return err;
+}
+
+static struct rtnl_link_stats64 *
+team_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
+{
+	struct team *team = netdev_priv(dev);
+	struct team_pcpu_stats *p;
+	u64 rx_packets, rx_bytes, rx_multicast, tx_packets, tx_bytes;
+	u32 rx_dropped = 0, tx_dropped = 0;
+	unsigned int start;
+	int i;
+
+	for_each_possible_cpu(i) {
+		p = per_cpu_ptr(team->pcpu_stats, i);
+		do {
+			start = u64_stats_fetch_begin_bh(&p->syncp);
+			rx_packets	= p->rx_packets;
+			rx_bytes	= p->rx_bytes;
+			rx_multicast	= p->rx_multicast;
+			tx_packets	= p->tx_packets;
+			tx_bytes	= p->tx_bytes;
+		} while (u64_stats_fetch_retry_bh(&p->syncp, start));
+
+		stats->rx_packets	+= rx_packets;
+		stats->rx_bytes		+= rx_bytes;
+		stats->multicast	+= rx_multicast;
+		stats->tx_packets	+= tx_packets;
+		stats->tx_bytes		+= tx_bytes;
+		/*
+		 * rx_dropped & tx_dropped are u32, updated
+		 * without syncp protection.
+		 */
+		rx_dropped	+= p->rx_dropped;
+		tx_dropped	+= p->tx_dropped;
+	}
+	stats->rx_dropped	= rx_dropped;
+	stats->tx_dropped	= tx_dropped;
+	return stats;
+}
+
+static void team_vlan_rx_add_vid(struct net_device *dev, uint16_t vid)
+{
+	struct team *team = netdev_priv(dev);
+	struct team_port *port;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(port, &team->port_list, list) {
+		const struct net_device_ops *ops = port->dev->netdev_ops;
+
+		ops->ndo_vlan_rx_add_vid(port->dev, vid);
+	}
+	rcu_read_unlock();
+}
+
+static void team_vlan_rx_kill_vid(struct net_device *dev, uint16_t vid)
+{
+	struct team *team = netdev_priv(dev);
+	struct team_port *port;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(port, &team->port_list, list) {
+		const struct net_device_ops *ops = port->dev->netdev_ops;
+
+		ops->ndo_vlan_rx_kill_vid(port->dev, vid);
+	}
+	rcu_read_unlock();
+}
+
+static int team_add_slave(struct net_device *dev, struct net_device *port_dev)
+{
+	struct team *team = netdev_priv(dev);
+	int err;
+
+	spin_lock(&team->lock);
+	err = team_port_add(team, port_dev);
+	spin_unlock(&team->lock);
+	return err;
+}
+
+static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
+{
+	struct team *team = netdev_priv(dev);
+	int err;
+
+	spin_lock(&team->lock);
+	err = team_port_del(team, port_dev);
+	spin_unlock(&team->lock);
+	return err;
+}
+
+static const struct net_device_ops team_netdev_ops = {
+	.ndo_init		= team_init,
+	.ndo_uninit		= team_uninit,
+	.ndo_open		= team_open,
+	.ndo_stop		= team_close,
+	.ndo_start_xmit		= team_xmit,
+	.ndo_change_rx_flags	= team_change_rx_flags,
+	.ndo_set_rx_mode	= team_set_rx_mode,
+	.ndo_set_mac_address	= team_set_mac_address,
+	.ndo_change_mtu		= team_change_mtu,
+	.ndo_get_stats64	= team_get_stats64,
+	.ndo_vlan_rx_add_vid	= team_vlan_rx_add_vid,
+	.ndo_vlan_rx_kill_vid	= team_vlan_rx_kill_vid,
+	.ndo_add_slave		= team_add_slave,
+	.ndo_del_slave		= team_del_slave,
+};
+
+
+/***********************
+ * rt netlink interface
+ ***********************/
+
+static void team_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+
+	dev->netdev_ops = &team_netdev_ops;
+	dev->destructor	= team_destructor;
+	dev->tx_queue_len = 0;
+	dev->flags |= IFF_MULTICAST;
+	dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING);
+
+	/*
+	 * Indicate we support unicast address filtering. That way core won't
+	 * bring us to promisc mode in case a unicast addr is added.
+	 * Let this up to underlay drivers.
+	 */
+	dev->priv_flags |= IFF_UNICAST_FLT;
+
+	dev->features |= NETIF_F_LLTX;
+	dev->features |= NETIF_F_GRO;
+	dev->hw_features = NETIF_F_HW_VLAN_TX |
+			   NETIF_F_HW_VLAN_RX |
+			   NETIF_F_HW_VLAN_FILTER;
+
+	dev->features |= dev->hw_features;
+}
+
+static int team_newlink(struct net *src_net, struct net_device *dev,
+			struct nlattr *tb[], struct nlattr *data[])
+{
+	int err;
+
+	if (tb[IFLA_ADDRESS] == NULL)
+		random_ether_addr(dev->dev_addr);
+
+	err = register_netdevice(dev);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+static int team_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	if (tb[IFLA_ADDRESS]) {
+		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
+			return -EINVAL;
+		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
+			return -EADDRNOTAVAIL;
+	}
+	return 0;
+}
+
+static struct rtnl_link_ops team_link_ops __read_mostly = {
+	.kind		= DRV_NAME,
+	.priv_size	= sizeof(struct team),
+	.setup		= team_setup,
+	.newlink	= team_newlink,
+	.validate	= team_validate,
+};
+
+
+/***********************************
+ * Generic netlink custom interface
+ ***********************************/
+
+static struct genl_family team_nl_family = {
+	.id		= GENL_ID_GENERATE,
+	.name		= TEAM_GENL_NAME,
+	.version	= TEAM_GENL_VERSION,
+	.maxattr	= TEAM_ATTR_MAX,
+	.netnsok	= true,
+};
+
+static const struct nla_policy team_nl_policy[TEAM_ATTR_MAX + 1] = {
+	[TEAM_ATTR_UNSPEC]			= { .type = NLA_UNSPEC, },
+	[TEAM_ATTR_TEAM_IFINDEX]		= { .type = NLA_U32 },
+	[TEAM_ATTR_LIST_OPTION]			= { .type = NLA_NESTED },
+	[TEAM_ATTR_LIST_PORT]			= { .type = NLA_NESTED },
+};
+
+static const struct nla_policy
+team_nl_option_policy[TEAM_ATTR_OPTION_MAX + 1] = {
+	[TEAM_ATTR_OPTION_UNSPEC]		= { .type = NLA_UNSPEC, },
+	[TEAM_ATTR_OPTION_NAME] = {
+		.type = NLA_STRING,
+		.len = TEAM_STRING_MAX_LEN,
+	},
+	[TEAM_ATTR_OPTION_CHANGED]		= { .type = NLA_FLAG },
+	[TEAM_ATTR_OPTION_TYPE]			= { .type = NLA_U8 },
+	[TEAM_ATTR_OPTION_DATA] = {
+		.type = NLA_BINARY,
+		.len = TEAM_STRING_MAX_LEN,
+	},
+};
+
+static int team_nl_cmd_noop(struct sk_buff *skb, struct genl_info *info)
+{
+	struct sk_buff *msg;
+	void *hdr;
+	int err;
+
+	msg = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	hdr = genlmsg_put(msg, info->snd_pid, info->snd_seq,
+			  &team_nl_family, 0, TEAM_CMD_NOOP);
+	if (IS_ERR(hdr)) {
+		err = PTR_ERR(hdr);
+		goto err_msg_put;
+	}
+
+	genlmsg_end(msg, hdr);
+
+	return genlmsg_unicast(genl_info_net(info), msg, info->snd_pid);
+
+err_msg_put:
+	nlmsg_free(msg);
+
+	return err;
+}
+
+/*
+ * Netlink cmd functions should be locked by following two functions.
+ * To ensure team_uninit would not be called in between, hold rcu_read_lock
+ * all the time.
+ */
+static struct team *team_nl_team_get(struct genl_info *info)
+{
+	struct net *net = genl_info_net(info);
+	int ifindex;
+	struct net_device *dev;
+	struct team *team;
+
+	if (!info->attrs[TEAM_ATTR_TEAM_IFINDEX])
+		return NULL;
+
+	ifindex = nla_get_u32(info->attrs[TEAM_ATTR_TEAM_IFINDEX]);
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, ifindex);
+	if (!dev || dev->netdev_ops != &team_netdev_ops) {
+		rcu_read_unlock();
+		return NULL;
+	}
+
+	team = netdev_priv(dev);
+	spin_lock(&team->lock);
+	return team;
+}
+
+static void team_nl_team_put(struct team *team)
+{
+	spin_unlock(&team->lock);
+	rcu_read_unlock();
+}
+
+static int team_nl_send_generic(struct genl_info *info, struct team *team,
+				int (*fill_func)(struct sk_buff *skb,
+						 struct genl_info *info,
+						 int flags, struct team *team))
+{
+	struct sk_buff *skb;
+	int err;
+
+	skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	err = fill_func(skb, info, NLM_F_ACK, team);
+	if (err < 0)
+		goto err_fill;
+
+	err = genlmsg_unicast(genl_info_net(info), skb, info->snd_pid);
+	return err;
+
+err_fill:
+	nlmsg_free(skb);
+	return err;
+}
+
+static int team_nl_fill_options_get_changed(struct sk_buff *skb,
+					    u32 pid, u32 seq, int flags,
+					    struct team *team,
+					    struct team_option *changed_option)
+{
+	struct nlattr *option_list;
+	void *hdr;
+	struct team_option *option;
+
+	hdr = genlmsg_put(skb, pid, seq, &team_nl_family, flags,
+			  TEAM_CMD_OPTIONS_GET);
+	if (IS_ERR(hdr))
+		return PTR_ERR(hdr);
+
+	NLA_PUT_U32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex);
+	option_list = nla_nest_start(skb, TEAM_ATTR_LIST_OPTION);
+	if (!option_list)
+		return -EMSGSIZE;
+
+	list_for_each_entry(option, &team->option_list, list) {
+		struct nlattr *option_item;
+		long arg;
+
+		option_item = nla_nest_start(skb, TEAM_ATTR_ITEM_OPTION);
+		if (!option_item)
+			goto nla_put_failure;
+		NLA_PUT_STRING(skb, TEAM_ATTR_OPTION_NAME, option->name);
+		if (option == changed_option)
+			NLA_PUT_FLAG(skb, TEAM_ATTR_OPTION_CHANGED);
+		switch (option->type) {
+		case TEAM_OPTION_TYPE_U32:
+			NLA_PUT_U8(skb, TEAM_ATTR_OPTION_TYPE, NLA_U32);
+			team_option_get(team, option, &arg);
+			NLA_PUT_U32(skb, TEAM_ATTR_OPTION_DATA, arg);
+			break;
+		case TEAM_OPTION_TYPE_STRING:
+			NLA_PUT_U8(skb, TEAM_ATTR_OPTION_TYPE, NLA_STRING);
+			team_option_get(team, option, &arg);
+			NLA_PUT_STRING(skb, TEAM_ATTR_OPTION_DATA,
+				       (char *) arg);
+			break;
+		default:
+			BUG();
+		}
+		nla_nest_end(skb, option_item);
+	}
+
+	nla_nest_end(skb, option_list);
+	return genlmsg_end(skb, hdr);
+
+nla_put_failure:
+	genlmsg_cancel(skb, hdr);
+	return -EMSGSIZE;
+}
+
+static int team_nl_fill_options_get(struct sk_buff *skb,
+				    struct genl_info *info, int flags,
+				    struct team *team)
+{
+	return team_nl_fill_options_get_changed(skb, info->snd_pid,
+						info->snd_seq, NLM_F_ACK,
+						team, NULL);
+}
+
+static int team_nl_cmd_options_get(struct sk_buff *skb, struct genl_info *info)
+{
+	struct team *team;
+	int err;
+
+	team = team_nl_team_get(info);
+	if (!team)
+		return -EINVAL;
+
+	err = team_nl_send_generic(info, team, team_nl_fill_options_get);
+
+	team_nl_team_put(team);
+
+	return err;
+}
+
+static int team_nl_cmd_options_set(struct sk_buff *skb, struct genl_info *info)
+{
+	struct team *team;
+	int err = 0;
+	int i;
+	struct nlattr *nl_option;
+
+	team = team_nl_team_get(info);
+	if (!team)
+		return -EINVAL;
+
+	err = -EINVAL;
+	if (!info->attrs[TEAM_ATTR_LIST_OPTION]) {
+		err = -EINVAL;
+		goto team_put;
+	}
+
+	nla_for_each_nested(nl_option, info->attrs[TEAM_ATTR_LIST_OPTION], i) {
+		struct nlattr *mode_attrs[TEAM_ATTR_OPTION_MAX + 1];
+		enum team_option_type opt_type;
+		struct team_option *option;
+		char *opt_name;
+		bool opt_found = false;
+
+		if (nla_type(nl_option) != TEAM_ATTR_ITEM_OPTION) {
+			err = -EINVAL;
+			goto team_put;
+		}
+		err = nla_parse_nested(mode_attrs, TEAM_ATTR_OPTION_MAX,
+				       nl_option, team_nl_option_policy);
+		if (err)
+			goto team_put;
+		if (!mode_attrs[TEAM_ATTR_OPTION_NAME] ||
+		    !mode_attrs[TEAM_ATTR_OPTION_TYPE] ||
+		    !mode_attrs[TEAM_ATTR_OPTION_DATA]) {
+			err = -EINVAL;
+			goto team_put;
+		}
+		switch (nla_get_u8(mode_attrs[TEAM_ATTR_OPTION_TYPE])) {
+		case NLA_U32:
+			opt_type = TEAM_OPTION_TYPE_U32;
+			break;
+		case NLA_STRING:
+			opt_type = TEAM_OPTION_TYPE_STRING;
+			break;
+		default:
+			goto team_put;
+		}
+
+		opt_name = nla_data(mode_attrs[TEAM_ATTR_OPTION_NAME]);
+		list_for_each_entry(option, &team->option_list, list) {
+			long arg;
+			struct nlattr *opt_data_attr;
+
+			if (option->type != opt_type ||
+			    strcmp(option->name, opt_name))
+				continue;
+			opt_found = true;
+			opt_data_attr = mode_attrs[TEAM_ATTR_OPTION_DATA];
+			switch (opt_type) {
+			case TEAM_OPTION_TYPE_U32:
+				arg = nla_get_u32(opt_data_attr);
+				break;
+			case TEAM_OPTION_TYPE_STRING:
+				arg = (long) nla_data(opt_data_attr);
+				break;
+			default:
+				BUG();
+			}
+			err = team_option_set(team, option, &arg);
+			if (err)
+				goto team_put;
+		}
+		if (!opt_found) {
+			err = -ENOENT;
+			goto team_put;
+		}
+	}
+
+team_put:
+	team_nl_team_put(team);
+
+	return err;
+}
+
+static int team_nl_fill_port_list_get_changed(struct sk_buff *skb,
+					      u32 pid, u32 seq, int flags,
+					      struct team *team,
+					      struct team_port *changed_port)
+{
+	struct nlattr *port_list;
+	void *hdr;
+	struct team_port *port;
+
+	hdr = genlmsg_put(skb, pid, seq, &team_nl_family, flags,
+			  TEAM_CMD_PORT_LIST_GET);
+	if (IS_ERR(hdr))
+		return PTR_ERR(hdr);
+
+	NLA_PUT_U32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex);
+	port_list = nla_nest_start(skb, TEAM_ATTR_LIST_PORT);
+	if (!port_list)
+		return -EMSGSIZE;
+
+	list_for_each_entry_rcu(port, &team->port_list, list) {
+		struct nlattr *port_item;
+
+		port_item = nla_nest_start(skb, TEAM_ATTR_ITEM_PORT);
+		if (!port_item)
+			goto nla_put_failure;
+		NLA_PUT_U32(skb, TEAM_ATTR_PORT_IFINDEX, port->dev->ifindex);
+		if (port == changed_port)
+			NLA_PUT_FLAG(skb, TEAM_ATTR_PORT_CHANGED);
+		if (port->linkup)
+			NLA_PUT_FLAG(skb, TEAM_ATTR_PORT_LINKUP);
+		NLA_PUT_U32(skb, TEAM_ATTR_PORT_SPEED, port->speed);
+		NLA_PUT_U8(skb, TEAM_ATTR_PORT_DUPLEX, port->duplex);
+		nla_nest_end(skb, port_item);
+	}
+
+	nla_nest_end(skb, port_list);
+	return genlmsg_end(skb, hdr);
+
+nla_put_failure:
+	genlmsg_cancel(skb, hdr);
+	return -EMSGSIZE;
+}
+
+static int team_nl_fill_port_list_get(struct sk_buff *skb,
+				      struct genl_info *info, int flags,
+				      struct team *team)
+{
+	return team_nl_fill_port_list_get_changed(skb, info->snd_pid,
+						  info->snd_seq, NLM_F_ACK,
+						  team, NULL);
+}
+
+static int team_nl_cmd_port_list_get(struct sk_buff *skb,
+				     struct genl_info *info)
+{
+	struct team *team;
+	int err;
+
+	team = team_nl_team_get(info);
+	if (!team)
+		return -EINVAL;
+
+	err = team_nl_send_generic(info, team, team_nl_fill_port_list_get);
+
+	team_nl_team_put(team);
+
+	return err;
+}
+
+static struct genl_ops team_nl_ops[] = {
+	{
+		.cmd = TEAM_CMD_NOOP,
+		.doit = team_nl_cmd_noop,
+		.policy = team_nl_policy,
+	},
+	{
+		.cmd = TEAM_CMD_OPTIONS_SET,
+		.doit = team_nl_cmd_options_set,
+		.policy = team_nl_policy,
+		.flags = GENL_ADMIN_PERM,
+	},
+	{
+		.cmd = TEAM_CMD_OPTIONS_GET,
+		.doit = team_nl_cmd_options_get,
+		.policy = team_nl_policy,
+		.flags = GENL_ADMIN_PERM,
+	},
+	{
+		.cmd = TEAM_CMD_PORT_LIST_GET,
+		.doit = team_nl_cmd_port_list_get,
+		.policy = team_nl_policy,
+		.flags = GENL_ADMIN_PERM,
+	},
+};
+
+static struct genl_multicast_group team_change_event_mcgrp = {
+	.name = TEAM_GENL_CHANGE_EVENT_MC_GRP_NAME,
+};
+
+static int team_nl_send_event_options_get(struct team *team,
+					  struct team_option *changed_option)
+{
+	struct sk_buff *skb;
+	int err;
+	struct net *net = dev_net(team->dev);
+
+	skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	err = team_nl_fill_options_get_changed(skb, 0, 0, 0, team,
+					       changed_option);
+	if (err < 0)
+		goto err_fill;
+
+	err = genlmsg_multicast_netns(net, skb, 0, team_change_event_mcgrp.id,
+				      GFP_KERNEL);
+	return err;
+
+err_fill:
+	nlmsg_free(skb);
+	return err;
+}
+
+static int team_nl_send_event_port_list_get(struct team_port *port)
+{
+	struct sk_buff *skb;
+	int err;
+	struct net *net = dev_net(port->team->dev);
+
+	skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	err = team_nl_fill_port_list_get_changed(skb, 0, 0, 0,
+						 port->team, port);
+	if (err < 0)
+		goto err_fill;
+
+	err = genlmsg_multicast_netns(net, skb, 0, team_change_event_mcgrp.id,
+				      GFP_KERNEL);
+	return err;
+
+err_fill:
+	nlmsg_free(skb);
+	return err;
+}
+
+static int team_nl_init(void)
+{
+	int err;
+
+	err = genl_register_family_with_ops(&team_nl_family, team_nl_ops,
+					    ARRAY_SIZE(team_nl_ops));
+	if (err)
+		return err;
+
+	err = genl_register_mc_group(&team_nl_family, &team_change_event_mcgrp);
+	if (err)
+		goto err_change_event_grp_reg;
+
+	return 0;
+
+err_change_event_grp_reg:
+	genl_unregister_family(&team_nl_family);
+
+	return err;
+}
+
+static void team_nl_fini(void)
+{
+	genl_unregister_family(&team_nl_family);
+}
+
+
+/******************
+ * Change checkers
+ ******************/
+
+static void __team_options_change_check(struct team *team,
+					struct team_option *changed_option)
+{
+	int err;
+
+	err = team_nl_send_event_options_get(team, changed_option);
+	if (err)
+		netdev_warn(team->dev, "Failed to send options change via netlink\n");
+}
+
+/* rtnl lock is held */
+static void __team_port_change_check(struct team_port *port, bool linkup)
+{
+	int err;
+
+	if (port->linkup == linkup)
+		return;
+
+	port->linkup = linkup;
+	if (linkup) {
+		struct ethtool_cmd ecmd;
+
+		err = __ethtool_get_settings(port->dev, &ecmd);
+		if (!err) {
+			port->speed = ethtool_cmd_speed(&ecmd);
+			port->duplex = ecmd.duplex;
+			goto send_event;
+		}
+	}
+	port->speed = 0;
+	port->duplex = 0;
+
+send_event:
+	err = team_nl_send_event_port_list_get(port);
+	if (err)
+		netdev_warn(port->team->dev, "Failed to send port change of device %s via netlink\n",
+			    port->dev->name);
+
+}
+
+static void team_port_change_check(struct team_port *port, bool linkup)
+{
+	struct team *team = port->team;
+
+	spin_lock(&team->lock);
+	__team_port_change_check(port, linkup);
+	spin_unlock(&team->lock);
+}
+
+/************************************
+ * Net device notifier event handler
+ ************************************/
+
+static int team_device_event(struct notifier_block *unused,
+			     unsigned long event, void *ptr)
+{
+	struct net_device *dev = (struct net_device *) ptr;
+	struct team_port *port;
+
+	port = team_port_get_rtnl(dev);
+	if (!port)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UP:
+		if (netif_carrier_ok(dev))
+			team_port_change_check(port, true);
+	case NETDEV_DOWN:
+		team_port_change_check(port, false);
+	case NETDEV_CHANGE:
+		if (netif_running(port->dev))
+			team_port_change_check(port,
+					       !!netif_carrier_ok(port->dev));
+		break;
+	case NETDEV_UNREGISTER:
+		team_del_slave(port->team->dev, dev);
+		break;
+	case NETDEV_FEAT_CHANGE:
+		team_compute_features(port->team);
+		break;
+	case NETDEV_CHANGEMTU:
+		/* Forbid to change mtu of underlaying device */
+		return NOTIFY_BAD;
+	case NETDEV_CHANGEADDR:
+		/* Forbid to change addr of underlaying device */
+		return NOTIFY_BAD;
+	case NETDEV_PRE_TYPE_CHANGE:
+		/* Forbid to change type of underlaying device */
+		return NOTIFY_BAD;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block team_notifier_block __read_mostly = {
+	.notifier_call = team_device_event,
+};
+
+
+/***********************
+ * Module init and exit
+ ***********************/
+
+static int __init team_module_init(void)
+{
+	int err;
+
+	register_netdevice_notifier(&team_notifier_block);
+
+	err = rtnl_link_register(&team_link_ops);
+	if (err)
+		goto err_rtln_reg;
+
+	err = team_nl_init();
+	if (err)
+		goto err_nl_init;
+
+	return 0;
+
+err_nl_init:
+	rtnl_link_unregister(&team_link_ops);
+
+err_rtln_reg:
+	unregister_netdevice_notifier(&team_notifier_block);
+
+	return err;
+}
+
+static void __exit team_module_exit(void)
+{
+	team_nl_fini();
+	rtnl_link_unregister(&team_link_ops);
+	unregister_netdevice_notifier(&team_notifier_block);
+}
+
+module_init(team_module_init);
+module_exit(team_module_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jpirko@redhat.com>");
+MODULE_DESCRIPTION("Ethernet team device driver");
+MODULE_ALIAS_RTNL_LINK(DRV_NAME);
diff --git a/drivers/net/team/team_mode_activebackup.c b/drivers/net/team/team_mode_activebackup.c
new file mode 100644
index 0000000..1aa2bfb
--- /dev/null
+++ b/drivers/net/team/team_mode_activebackup.c
@@ -0,0 +1,152 @@
+/*
+ * net/drivers/team/team_mode_activebackup.c - Active-backup mode for team
+ * Copyright (c) 2011 Jiri Pirko <jpirko@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <net/rtnetlink.h>
+#include <linux/if_team.h>
+
+struct ab_priv {
+	struct team_port __rcu *active_port;
+};
+
+static struct ab_priv *ab_priv(struct team *team)
+{
+	return (struct ab_priv *) &team->mode_priv;
+}
+
+static rx_handler_result_t ab_receive(struct team *team, struct team_port *port,
+				      struct sk_buff *skb) {
+	struct team_port *active_port;
+
+	active_port = rcu_dereference(ab_priv(team)->active_port);
+	if (active_port != port)
+		return RX_HANDLER_EXACT;
+	return RX_HANDLER_ANOTHER;
+}
+
+static bool ab_transmit(struct team *team, struct sk_buff *skb)
+{
+	struct team_port *active_port;
+
+	active_port = rcu_dereference(ab_priv(team)->active_port);
+	if (unlikely(!active_port))
+		goto drop;
+	skb->dev = active_port->dev;
+	if (dev_queue_xmit(skb))
+		return false;
+	return true;
+
+drop:
+	dev_kfree_skb(skb);
+	return false;
+}
+
+static void ab_port_leave(struct team *team, struct team_port *port)
+{
+	if (ab_priv(team)->active_port == port)
+		rcu_assign_pointer(ab_priv(team)->active_port, NULL);
+}
+
+static void ab_port_change_mac(struct team *team, struct team_port *port)
+{
+	if (ab_priv(team)->active_port == port)
+		team_port_set_team_mac(port);
+}
+
+static int ab_active_port_get(struct team *team, void *arg)
+{
+	u32 *ifindex = arg;
+
+	*ifindex = 0;
+	if (ab_priv(team)->active_port)
+		*ifindex = ab_priv(team)->active_port->dev->ifindex;
+	return 0;
+}
+
+static int ab_active_port_set(struct team *team, void *arg)
+{
+	u32 *ifindex = arg;
+	struct team_port *port;
+
+	list_for_each_entry_rcu(port, &team->port_list, list) {
+		if (port->dev->ifindex == *ifindex) {
+			struct team_port *ac_port = ab_priv(team)->active_port;
+
+			/* rtnl_lock needs to be held when setting macs */
+			rtnl_lock();
+			if (ac_port)
+				team_port_set_orig_mac(ac_port);
+			rcu_assign_pointer(ab_priv(team)->active_port, port);
+			team_port_set_team_mac(port);
+			rtnl_unlock();
+			return 0;
+		}
+	}
+	return -ENOENT;
+}
+
+static struct team_option ab_options[] = {
+	{
+		.name = "activeport",
+		.type = TEAM_OPTION_TYPE_U32,
+		.getter = ab_active_port_get,
+		.setter = ab_active_port_set,
+	},
+};
+
+int ab_init(struct team *team)
+{
+	team_options_register(team, ab_options, ARRAY_SIZE(ab_options));
+	return 0;
+}
+
+void ab_exit(struct team *team)
+{
+	team_options_unregister(team, ab_options, ARRAY_SIZE(ab_options));
+}
+
+static const struct team_mode_ops ab_mode_ops = {
+	.init			= ab_init,
+	.exit			= ab_exit,
+	.receive		= ab_receive,
+	.transmit		= ab_transmit,
+	.port_leave		= ab_port_leave,
+	.port_change_mac	= ab_port_change_mac,
+};
+
+static struct team_mode ab_mode = {
+	.kind		= "activebackup",
+	.owner		= THIS_MODULE,
+	.priv_size	= sizeof(struct ab_priv),
+	.ops		= &ab_mode_ops,
+};
+
+static int __init ab_init_module(void)
+{
+	return team_mode_register(&ab_mode);
+}
+
+static void __exit ab_cleanup_module(void)
+{
+	team_mode_unregister(&ab_mode);
+}
+
+module_init(ab_init_module);
+module_exit(ab_cleanup_module);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jpirko@redhat.com>");
+MODULE_DESCRIPTION("Active-backup mode for team");
+MODULE_ALIAS("team-mode-activebackup");
diff --git a/drivers/net/team/team_mode_roundrobin.c b/drivers/net/team/team_mode_roundrobin.c
new file mode 100644
index 0000000..0374052
--- /dev/null
+++ b/drivers/net/team/team_mode_roundrobin.c
@@ -0,0 +1,107 @@
+/*
+ * net/drivers/team/team_mode_roundrobin.c - Round-robin mode for team
+ * Copyright (c) 2011 Jiri Pirko <jpirko@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/if_team.h>
+
+struct rr_priv {
+	unsigned int sent_packets;
+};
+
+static struct rr_priv *rr_priv(struct team *team)
+{
+	return (struct rr_priv *) &team->mode_priv;
+}
+
+static struct team_port *__get_first_port_up(struct team *team,
+					     struct team_port *port)
+{
+	struct team_port *cur;
+
+	if (port->linkup)
+		return port;
+	cur = port;
+	list_for_each_entry_continue_rcu(cur, &team->port_list, list)
+		if (cur->linkup)
+			return cur;
+	list_for_each_entry_rcu(cur, &team->port_list, list) {
+		if (cur == port)
+			break;
+		if (cur->linkup)
+			return cur;
+	}
+	return NULL;
+}
+
+static bool rr_transmit(struct team *team, struct sk_buff *skb)
+{
+	struct team_port *port;
+	int port_index;
+
+	port_index = rr_priv(team)->sent_packets++ % team->port_count;
+	port = team_get_port_by_index_rcu(team, port_index);
+	port = __get_first_port_up(team, port);
+	if (unlikely(!port))
+		goto drop;
+	skb->dev = port->dev;
+	if (dev_queue_xmit(skb))
+		return false;
+	return true;
+
+drop:
+	dev_kfree_skb(skb);
+	return false;
+}
+
+static int rr_port_enter(struct team *team, struct team_port *port)
+{
+	return team_port_set_team_mac(port);
+}
+
+static void rr_port_change_mac(struct team *team, struct team_port *port)
+{
+	team_port_set_team_mac(port);
+}
+
+static const struct team_mode_ops rr_mode_ops = {
+	.transmit		= rr_transmit,
+	.port_enter		= rr_port_enter,
+	.port_change_mac	= rr_port_change_mac,
+};
+
+static struct team_mode rr_mode = {
+	.kind		= "roundrobin",
+	.owner		= THIS_MODULE,
+	.priv_size	= sizeof(struct rr_priv),
+	.ops		= &rr_mode_ops,
+};
+
+static int __init rr_init_module(void)
+{
+	return team_mode_register(&rr_mode);
+}
+
+static void __exit rr_cleanup_module(void)
+{
+	team_mode_unregister(&rr_mode);
+}
+
+module_init(rr_init_module);
+module_exit(rr_cleanup_module);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jpirko@redhat.com>");
+MODULE_DESCRIPTION("Round-robin mode for team");
+MODULE_ALIAS("team-mode-roundrobin");
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 619b565..0b091b3 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -185,6 +185,7 @@ header-y += if_pppol2tp.h
 header-y += if_pppox.h
 header-y += if_slip.h
 header-y += if_strip.h
+header-y += if_team.h
 header-y += if_tr.h
 header-y += if_tun.h
 header-y += if_tunnel.h
diff --git a/include/linux/if.h b/include/linux/if.h
index db20bd4..06b6ef6 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -79,6 +79,7 @@
 #define IFF_TX_SKB_SHARING	0x10000	/* The interface supports sharing
 					 * skbs on transmit */
 #define IFF_UNICAST_FLT	0x20000		/* Supports unicast filtering	*/
+#define IFF_TEAM_PORT	0x40000		/* device used as team port */
 
 #define IF_GET_IFACE	0x0001		/* for querying only */
 #define IF_GET_PROTO	0x0002
diff --git a/include/linux/if_team.h b/include/linux/if_team.h
new file mode 100644
index 0000000..21581a7
--- /dev/null
+++ b/include/linux/if_team.h
@@ -0,0 +1,233 @@
+/*
+ * include/linux/if_team.h - Network team device driver header
+ * Copyright (c) 2011 Jiri Pirko <jpirko@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef _LINUX_IF_TEAM_H_
+#define _LINUX_IF_TEAM_H_
+
+#ifdef __KERNEL__
+
+struct team_pcpu_stats {
+	u64			rx_packets;
+	u64			rx_bytes;
+	u64			rx_multicast;
+	u64			tx_packets;
+	u64			tx_bytes;
+	struct u64_stats_sync	syncp;
+	u32			rx_dropped;
+	u32			tx_dropped;
+};
+
+struct team;
+
+struct team_port {
+	struct net_device *dev;
+	struct hlist_node hlist; /* node in hash list */
+	struct list_head list; /* node in ordinary list */
+	struct team *team;
+	int index;
+
+	/*
+	 * A place for storing original values of the device before it
+	 * become a port.
+	 */
+	struct {
+		unsigned char dev_addr[MAX_ADDR_LEN];
+		unsigned int mtu;
+	} orig;
+
+	bool linkup;
+	u32 speed;
+	u8 duplex;
+
+	struct rcu_head rcu;
+};
+
+struct team_mode_ops {
+	int (*init)(struct team *team);
+	void (*exit)(struct team *team);
+	rx_handler_result_t (*receive)(struct team *team,
+				       struct team_port *port,
+				       struct sk_buff *skb);
+	bool (*transmit)(struct team *team, struct sk_buff *skb);
+	int (*port_enter)(struct team *team, struct team_port *port);
+	void (*port_leave)(struct team *team, struct team_port *port);
+	void (*port_change_mac)(struct team *team, struct team_port *port);
+};
+
+enum team_option_type {
+	TEAM_OPTION_TYPE_U32,
+	TEAM_OPTION_TYPE_STRING,
+};
+
+struct team_option {
+	struct list_head list;
+	const char *name;
+	enum team_option_type type;
+	int (*getter)(struct team *team, void *arg);
+	int (*setter)(struct team *team, void *arg);
+};
+
+struct team_mode {
+	struct list_head list;
+	const char *kind;
+	struct module *owner;
+	size_t priv_size;
+	const struct team_mode_ops *ops;
+};
+
+#define TEAM_MODE_PRIV_LONGS 4
+#define TEAM_MODE_PRIV_SIZE (sizeof(long) * TEAM_MODE_PRIV_LONGS)
+
+struct team {
+	struct net_device *dev; /* associated netdevice */
+	struct team_pcpu_stats __percpu *pcpu_stats;
+
+	spinlock_t lock; /* used for overall locking, e.g. port lists write */
+
+	/*
+	 * port lists with port count
+	 */
+	int port_count;
+	struct hlist_head *port_hlist;
+	struct list_head port_list;
+
+	struct list_head option_list;
+
+	const char *mode_kind;
+	struct team_mode_ops mode_ops;
+	long mode_priv[TEAM_MODE_PRIV_LONGS];
+};
+
+#define TEAM_PORT_HASHBITS 4
+#define TEAM_PORT_HASHENTRIES (1 << TEAM_PORT_HASHBITS)
+
+static inline struct hlist_head *
+team_port_index_hash(const struct team *team,
+		     int port_index)
+{
+	return &team->port_hlist[port_index & (TEAM_PORT_HASHENTRIES - 1)];
+}
+
+static inline struct team_port *
+team_get_port_by_index_rcu(const struct team *team,
+			   int port_index)
+{
+	struct hlist_node *p;
+	struct team_port *port;
+	struct hlist_head *head = team_port_index_hash(team, port_index);
+
+	hlist_for_each_entry_rcu(port, p, head, hlist)
+		if (port->index == port_index)
+			return port;
+	return NULL;
+}
+
+extern int team_port_set_orig_mac(struct team_port *port);
+extern int team_port_set_team_mac(struct team_port *port);
+extern void team_options_register(struct team *team,
+				  struct team_option *option,
+				  size_t option_count);
+extern void team_options_unregister(struct team *team,
+				    struct team_option *option,
+				    size_t option_count);
+extern int team_mode_register(struct team_mode *mode);
+extern int team_mode_unregister(struct team_mode *mode);
+
+#endif /* __KERNEL__ */
+
+#define TEAM_STRING_MAX_LEN 32
+
+/**********************************
+ * NETLINK_GENERIC netlink family.
+ **********************************/
+
+enum {
+	TEAM_CMD_NOOP,
+	TEAM_CMD_OPTIONS_SET,
+	TEAM_CMD_OPTIONS_GET,
+	TEAM_CMD_PORT_LIST_GET,
+
+	__TEAM_CMD_MAX,
+	TEAM_CMD_MAX = (__TEAM_CMD_MAX - 1),
+};
+
+enum {
+	TEAM_ATTR_UNSPEC,
+	TEAM_ATTR_TEAM_IFINDEX,		/* u32 */
+	TEAM_ATTR_LIST_OPTION,		/* nest */
+	TEAM_ATTR_LIST_PORT,		/* nest */
+
+	__TEAM_ATTR_MAX,
+	TEAM_ATTR_MAX = __TEAM_ATTR_MAX - 1,
+};
+
+/* Nested layout of get/set msg:
+ *
+ *	[TEAM_ATTR_LIST_OPTION]
+ *		[TEAM_ATTR_ITEM_OPTION]
+ *			[TEAM_ATTR_OPTION_*], ...
+ *		[TEAM_ATTR_ITEM_OPTION]
+ *			[TEAM_ATTR_OPTION_*], ...
+ *		...
+ *	[TEAM_ATTR_LIST_PORT]
+ *		[TEAM_ATTR_ITEM_PORT]
+ *			[TEAM_ATTR_PORT_*], ...
+ *		[TEAM_ATTR_ITEM_PORT]
+ *			[TEAM_ATTR_PORT_*], ...
+ *		...
+ */
+
+enum {
+	TEAM_ATTR_ITEM_OPTION_UNSPEC,
+	TEAM_ATTR_ITEM_OPTION,		/* nest */
+
+	__TEAM_ATTR_ITEM_OPTION_MAX,
+	TEAM_ATTR_ITEM_OPTION_MAX = __TEAM_ATTR_ITEM_OPTION_MAX - 1,
+};
+
+enum {
+	TEAM_ATTR_OPTION_UNSPEC,
+	TEAM_ATTR_OPTION_NAME,		/* string */
+	TEAM_ATTR_OPTION_CHANGED,	/* flag */
+	TEAM_ATTR_OPTION_TYPE,		/* u8 */
+	TEAM_ATTR_OPTION_DATA,		/* dynamic */
+
+	__TEAM_ATTR_OPTION_MAX,
+	TEAM_ATTR_OPTION_MAX = __TEAM_ATTR_OPTION_MAX - 1,
+};
+
+enum {
+	TEAM_ATTR_ITEM_PORT_UNSPEC,
+	TEAM_ATTR_ITEM_PORT,		/* nest */
+
+	__TEAM_ATTR_ITEM_PORT_MAX,
+	TEAM_ATTR_ITEM_PORT_MAX = __TEAM_ATTR_ITEM_PORT_MAX - 1,
+};
+
+enum {
+	TEAM_ATTR_PORT_UNSPEC,
+	TEAM_ATTR_PORT_IFINDEX,		/* u32 */
+	TEAM_ATTR_PORT_CHANGED,		/* flag */
+	TEAM_ATTR_PORT_LINKUP,		/* flag */
+	TEAM_ATTR_PORT_SPEED,		/* u32 */
+	TEAM_ATTR_PORT_DUPLEX,		/* u8 */
+
+	__TEAM_ATTR_PORT_MAX,
+	TEAM_ATTR_PORT_MAX = __TEAM_ATTR_PORT_MAX - 1,
+};
+
+/*
+ * NETLINK_GENERIC related info
+ */
+#define TEAM_GENL_NAME "team"
+#define TEAM_GENL_VERSION 0x1
+#define TEAM_GENL_CHANGE_EVENT_MC_GRP_NAME "change_event"
+
+#endif /* _LINUX_IF_TEAM_H_ */
-- 
1.7.6

^ permalink raw reply related

* Re: Bug#645308: tg3 broken for NetXtreme 5714S in squeeze 6.0.3 installer
From: Ben Hutchings @ 2011-10-21 12:19 UTC (permalink / raw)
  To: Matt Carlson, Michael Chan; +Cc: 645308, Marc Haber, netdev
In-Reply-To: <20111021090811.GA5993@torres.zugschlus.de>

[-- Attachment #1: Type: text/plain, Size: 1788 bytes --]

On Fri, 2011-10-21 at 11:08 +0200, Marc Haber wrote:
> On Fri, Oct 21, 2011 at 11:00:46AM +0200, Marc Haber wrote:
> > On Thu, Oct 20, 2011 at 05:28:34AM +0100, Ben Hutchings wrote:
> > > I don't see any changes that would obviously change the way this device
> > > is reconfigured during a down/up cycle.  There were some changes to
> > > power management that should just let the PCI core do some work that the
> > > driver used to, but it's possible that the result isn't quite the same.
> > > I built a module with those reverted; source and binary attached.  Could
> > > you test that?  I checked that d-i does include an insmod command.
> > 
> > The squeeze 6.0.3 installer with the shipped tg3.ko replaced with
> > yours boots and networks just fine without any workaround and without
> > manual interaction.
> 
> I was a bit fast on that. The interface now fails right in the middle
> of installation and needs the modprobe -r, modprobe stunt to network
> again.

Matt, Michael,

The tg3 driver has regressed for the 5714S since Linux 2.6.32.  Marc
Haber found this in the backported version included in our stable
update, but also confirmed it in Linux 3.0.

Bringing the interface down and then up again (which the installer does
for some reason) can leave it unable to pass traffic (possibly after
working for a few packets) until the module is reloaded.

I asked Marc to check whether reverting the power management changes
(071697e2bcd8dff2af4d6fdd6525c2324f89553b,
d237d9ecf06a00f0ebca657958cf2a1e92940796) made a difference, but it
doesn't seem to.

There is more information in the bug log at
<http://bugs.debian.org/645308>.

Ben.

-- 
Ben Hutchings
Unix is many things to many people,
but it's never been everything to anybody.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: Questions about CHECKSUM_COMPLETE
From: Ben Hutchings @ 2011-10-21 12:09 UTC (permalink / raw)
  To: fengcheng lu; +Cc: netdev
In-Reply-To: <CAF5V5FbthBZ3QFzCUVov8MZ9TgF=+B6XM3eR+cyizbzcT-6v-A@mail.gmail.com>

On Thu, 2011-10-20 at 23:05 -0400, fengcheng lu wrote:
> Hello everyone
> 
> I have one question about the CHECKSUM_COMPLETE. When
> CHECK_SUM_COMPLETE is set, which data does the skb->csum computed by
> hardware cover?
> 
> I thought skb->csum only covers the Transport header (e.g. TCP/UDP) +
> Transport payload + pseudo header. However, after I read the vlan
> codes (vlan_skb_recv in the vlan_dev.c of linux kernel 2.6.27.19), I
> become confuse.
> 
> The vlan_skb_recv calls skb_pull_rcsum which updates the skb->csum if
> CHECKSUM_COMPLETE is set. It implies the vlan  header is also covered
> by the skb->csum. so I wonder if the skb->csum cover the whole data
> besides the eth header (14 bytes).

That's right, it's supposed to cover the complete packet (as seen by
Linux, so not including an Ethernet CRC).  The hardware doesn't need to
parse L3/L4 headers to implement this.

If the hardware you're dealing with doesn't calculate the full checksum
but it does parse headers and verify checksums for specific protocols
then you can set ip_summed = CHECKSUM_UNNECESSARY if the checksums are
OK.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH net-next] tcp: md5: dont write skb head in tcp_md5_hash_header()
From: Eric Dumazet @ 2011-10-21 12:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

tcp_md5_hash_header() writes into skb header a temporary zero value,
this might confuse other users of this area.

Since tcphdr is small (20 bytes), copy it in a temporary variable and
make the change in the copy.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/tcp.h |    2 +-
 net/ipv4/tcp.c    |   14 ++++++++------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3edef0b..910cc29 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1209,7 +1209,7 @@ extern void tcp_free_md5sig_pool(void);
 extern struct tcp_md5sig_pool	*tcp_get_md5sig_pool(void);
 extern void tcp_put_md5sig_pool(void);
 
-extern int tcp_md5_hash_header(struct tcp_md5sig_pool *, struct tcphdr *);
+extern int tcp_md5_hash_header(struct tcp_md5sig_pool *, const struct tcphdr *);
 extern int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *, const struct sk_buff *,
 				 unsigned header_len);
 extern int tcp_md5_hash_key(struct tcp_md5sig_pool *hp,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 704adad..eefc61e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2994,17 +2994,19 @@ void tcp_put_md5sig_pool(void)
 EXPORT_SYMBOL(tcp_put_md5sig_pool);
 
 int tcp_md5_hash_header(struct tcp_md5sig_pool *hp,
-			struct tcphdr *th)
+			const struct tcphdr *th)
 {
 	struct scatterlist sg;
+	struct tcphdr hdr;
 	int err;
 
-	__sum16 old_checksum = th->check;
-	th->check = 0;
+	/* We are not allowed to change tcphdr, make a local copy */
+	memcpy(&hdr, th, sizeof(hdr));
+	hdr.check = 0;
+
 	/* options aren't included in the hash */
-	sg_init_one(&sg, th, sizeof(struct tcphdr));
-	err = crypto_hash_update(&hp->md5_desc, &sg, sizeof(struct tcphdr));
-	th->check = old_checksum;
+	sg_init_one(&sg, &hdr, sizeof(hdr));
+	err = crypto_hash_update(&hp->md5_desc, &sg, sizeof(hdr));
 	return err;
 }
 EXPORT_SYMBOL(tcp_md5_hash_header);

^ permalink raw reply related

* Re: [PATCH v2 1/3] net: hold sock reference while processing tx timestamps
From: Johannes Berg @ 2011-10-21 11:44 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev, David Miller, Eric Dumazet
In-Reply-To: <f13715e7d094e7989581f5eacf0efe9d15379e10.1319193734.git.richard.cochran@omicron.at>

On Fri, 2011-10-21 at 12:49 +0200, Richard Cochran wrote:
> The pair of functions,
> 
>  * skb_clone_tx_timestamp()
>  * skb_complete_tx_timestamp()
> 
> were designed to allow timestamping in PHY devices. The first
> function, called during the MAC driver's hard_xmit method, identifies
> PTP protocol packets, clones them, and gives them to the PHY device
> driver. The PHY driver may hold onto the packet and deliver it at a
> later time using the second function, which adds the packet to the
> socket's error queue.
> 
> As pointed out by Johannes, nothing prevents the socket from
> disappearing while the cloned packet is sitting in the PHY driver
> awaiting a timestamp. This patch fixes the issue by taking a reference
> on the socket for each such packet. In addition, the comments
> regarding the usage of these function are expanded to highlight the
> rule that PHY drivers must use skb_complete_tx_timestamp() to release
> the packet, in order to release the socket reference, too.

It just now occurred to me that technically I think you could use a
destructor function that just points to something like this:

void tstamp_clone_free(struct sk_buff *skb)
{
	sock_put(skb->sk);
}

but just forcing drivers to use the right API seems equally good to me.

> These functions first appeared in v2.6.36.
> 
> Reported-by: Johannes Berg <johannes@sipsolutions.net>
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
> Cc: <stable@vger.kernel.org>


Thanks all for the discussions! :-)

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>

johannes

^ permalink raw reply

* Re: [PATCH v2 1/3] net: hold sock reference while processing tx timestamps
From: Eric Dumazet @ 2011-10-21 11:31 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev, David Miller, Johannes Berg, stable
In-Reply-To: <f13715e7d094e7989581f5eacf0efe9d15379e10.1319193734.git.richard.cochran@omicron.at>

Le vendredi 21 octobre 2011 à 12:49 +0200, Richard Cochran a écrit :
> The pair of functions,
> 
>  * skb_clone_tx_timestamp()
>  * skb_complete_tx_timestamp()
> 
> were designed to allow timestamping in PHY devices. The first
> function, called during the MAC driver's hard_xmit method, identifies
> PTP protocol packets, clones them, and gives them to the PHY device
> driver. The PHY driver may hold onto the packet and deliver it at a
> later time using the second function, which adds the packet to the
> socket's error queue.
> 
> As pointed out by Johannes, nothing prevents the socket from
> disappearing while the cloned packet is sitting in the PHY driver
> awaiting a timestamp. This patch fixes the issue by taking a reference
> on the socket for each such packet. In addition, the comments
> regarding the usage of these function are expanded to highlight the
> rule that PHY drivers must use skb_complete_tx_timestamp() to release
> the packet, in order to release the socket reference, too.
> 
> These functions first appeared in v2.6.36.
> 
> Reported-by: Johannes Berg <johannes@sipsolutions.net>
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
> Cc: <stable@vger.kernel.org>
> ---

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply

* Re: [net-next 3/5] igb: Fix for Alt MAC Address feature on 82580 and later devices
From: David Lamparter @ 2011-10-21 11:11 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, Carolyn Wyborny, netdev, gospo, sassmann
In-Reply-To: <1319193121-2729-4-git-send-email-jeffrey.t.kirsher@intel.com>

On Fri, Oct 21, 2011 at 03:31:59AM -0700, Jeff Kirsher wrote:
> From: Carolyn Wyborny <carolyn.wyborny@intel.com>
> 
> In 82580 and later devices, the alternate MAC address feature is
> completely handled by the option ROM and software does not handle
> it anymore.  This patch changes the check_alt_mac_addr function to
> exit immediately if device is 82580 or later.

'Stupid' question: what happens if you put such a card with an x86 
Option ROM into a, say, PowerPC server?

Just feeling like the devil's advocate today :)


-David

^ permalink raw reply

* Re: [PATCH][RFC] ipconfig: Add new kernel parameter to force MAC address
From: David Lamparter @ 2011-10-21 11:04 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-kernel, netdev
In-Reply-To: <878vofr2t9.fsf@elisp.net>

On Fri, Oct 21, 2011 at 02:27:30AM +0900, Naohiro Aota wrote:
> However it's not possible when you are using NFS root since the
> kernel already set the random MAC address and you cannot change
> it after system boot up process.

Is there any particular reason you're not using an initramfs?
Kernel autoconfiguration & NFS-root are relicts of old times, with an
initramfs you can start a proper portmap/rpcbind daemon and use full
regular NFS (v4 even).

... and you can change the MAC before bringing the interface up.


-David

^ permalink raw reply

* [PATCH v2 3/3] dp83640: free packet queues on remove
From: Richard Cochran @ 2011-10-21 10:49 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Johannes Berg, stable
In-Reply-To: <cover.1319193734.git.richard.cochran@omicron.at>

If the PHY should disappear (for example, on an USB Ethernet MAC), then
the driver would leak any undelivered time stamp packets. This commit
fixes the issue by calling the appropriate functions to free any packets
left in the transmit and receive queues.

The driver first appeared in v3.0.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@vger.kernel.org>
---
 drivers/net/phy/dp83640.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index 13e5713..9663e0b 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -1007,6 +1007,7 @@ static void dp83640_remove(struct phy_device *phydev)
 	struct dp83640_clock *clock;
 	struct list_head *this, *next;
 	struct dp83640_private *tmp, *dp83640 = phydev->priv;
+	struct sk_buff *skb;
 
 	if (phydev->addr == BROADCAST_ADDR)
 		return;
@@ -1014,6 +1015,12 @@ static void dp83640_remove(struct phy_device *phydev)
 	enable_status_frames(phydev, false);
 	cancel_work_sync(&dp83640->ts_work);
 
+	while ((skb = skb_dequeue(&dp83640->rx_queue)) != NULL)
+		kfree_skb(skb);
+
+	while ((skb = skb_dequeue(&dp83640->tx_queue)) != NULL)
+		skb_complete_tx_timestamp(skb, NULL);
+
 	clock = dp83640_clock_get(dp83640->clock);
 
 	if (dp83640 == clock->chosen) {
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH v2 2/3] dp83640: use proper function to free transmit time stamping packets
From: Richard Cochran @ 2011-10-21 10:49 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Johannes Berg, stable
In-Reply-To: <cover.1319193734.git.richard.cochran@omicron.at>

The previous commit enforces a new rule for handling the cloned packets
for transmit time stamping. These packets must not be freed using any other
function than skb_complete_tx_timestamp. This commit fixes the one and only
driver using this API.

The driver first appeared in v3.0.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@vger.kernel.org>
---
 drivers/net/phy/dp83640.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index c588a16..13e5713 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -1192,7 +1192,7 @@ static void dp83640_txtstamp(struct phy_device *phydev,
 
 	case HWTSTAMP_TX_ONESTEP_SYNC:
 		if (is_sync(skb, type)) {
-			kfree_skb(skb);
+			skb_complete_tx_timestamp(skb, NULL);
 			return;
 		}
 		/* fall through */
@@ -1203,7 +1203,7 @@ static void dp83640_txtstamp(struct phy_device *phydev,
 
 	case HWTSTAMP_TX_OFF:
 	default:
-		kfree_skb(skb);
+		skb_complete_tx_timestamp(skb, NULL);
 		break;
 	}
 }
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH v2 1/3] net: hold sock reference while processing tx timestamps
From: Richard Cochran @ 2011-10-21 10:49 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Johannes Berg, stable
In-Reply-To: <cover.1319193734.git.richard.cochran@omicron.at>

The pair of functions,

 * skb_clone_tx_timestamp()
 * skb_complete_tx_timestamp()

were designed to allow timestamping in PHY devices. The first
function, called during the MAC driver's hard_xmit method, identifies
PTP protocol packets, clones them, and gives them to the PHY device
driver. The PHY driver may hold onto the packet and deliver it at a
later time using the second function, which adds the packet to the
socket's error queue.

As pointed out by Johannes, nothing prevents the socket from
disappearing while the cloned packet is sitting in the PHY driver
awaiting a timestamp. This patch fixes the issue by taking a reference
on the socket for each such packet. In addition, the comments
regarding the usage of these function are expanded to highlight the
rule that PHY drivers must use skb_complete_tx_timestamp() to release
the packet, in order to release the socket reference, too.

These functions first appeared in v2.6.36.

Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Cc: <stable@vger.kernel.org>
---
 include/linux/phy.h     |    2 +-
 include/linux/skbuff.h  |    7 ++++++-
 net/core/timestamping.c |   12 ++++++++++--
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/phy.h b/include/linux/phy.h
index 54fc413..79f337c 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -420,7 +420,7 @@ struct phy_driver {
 
 	/*
 	 * Requests a Tx timestamp for 'skb'. The phy driver promises
-	 * to deliver it to the socket's error queue as soon as a
+	 * to deliver it using skb_complete_tx_timestamp() as soon as a
 	 * timestamp becomes available. One of the PTP_CLASS_ values
 	 * is passed in 'type'.
 	 */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bd383c..0f96646 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2020,8 +2020,13 @@ static inline bool skb_defer_rx_timestamp(struct sk_buff *skb)
 /**
  * skb_complete_tx_timestamp() - deliver cloned skb with tx timestamps
  *
+ * PHY drivers may accept clones of transmitted packets for
+ * timestamping via their phy_driver.txtstamp method. These drivers
+ * must call this function to return the skb back to the stack, with
+ * or without a timestamp.
+ *
  * @skb: clone of the the original outgoing packet
- * @hwtstamps: hardware time stamps
+ * @hwtstamps: hardware time stamps, may be NULL if not available
  *
  */
 void skb_complete_tx_timestamp(struct sk_buff *skb,
diff --git a/net/core/timestamping.c b/net/core/timestamping.c
index 98a5264..82fb288 100644
--- a/net/core/timestamping.c
+++ b/net/core/timestamping.c
@@ -57,9 +57,13 @@ void skb_clone_tx_timestamp(struct sk_buff *skb)
 	case PTP_CLASS_V2_VLAN:
 		phydev = skb->dev->phydev;
 		if (likely(phydev->drv->txtstamp)) {
+			if (!atomic_inc_not_zero(&sk->sk_refcnt))
+				return;
 			clone = skb_clone(skb, GFP_ATOMIC);
-			if (!clone)
+			if (!clone) {
+				sock_put(sk);
 				return;
+			}
 			clone->sk = sk;
 			phydev->drv->txtstamp(phydev, clone, type);
 		}
@@ -77,8 +81,11 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 	struct sock_exterr_skb *serr;
 	int err;
 
-	if (!hwtstamps)
+	if (!hwtstamps) {
+		sock_put(sk);
+		kfree_skb(skb);
 		return;
+	}
 
 	*skb_hwtstamps(skb) = *hwtstamps;
 	serr = SKB_EXT_ERR(skb);
@@ -87,6 +94,7 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
 	skb->sk = NULL;
 	err = sock_queue_err_skb(sk, skb);
+	sock_put(sk);
 	if (err)
 		kfree_skb(skb);
 }
-- 
1.7.2.5

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox