Netdev List
 help / color / mirror / Atom feed
* [RFC PATCH net-next v6 0/4] Enable virtio_net to act as a backup for a passthru device
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Patch 2 introduces a bypass module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. It provides 2 sets of interfaces to paravirtual 
drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.

Patch 3 extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

Patch 4 refactors netvsc to use the registration/notification framework
supported by bypass module.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v6 RFC:
  Simplified virtio_net changes by moving all the ndo_ops of the 
  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
  avoided 2 phase registration(driver + instances).
  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
  replaced mutex with a spinlock

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (4):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  net: Introduce generic bypass module
  virtio_net: Extend virtio to use VF datapath when available
  netvsc: refactor notifier/event handling code to use the bypass
    framework

 drivers/net/Kconfig             |   1 +
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/netvsc_drv.c | 219 ++++----------
 drivers/net/virtio_net.c        | 614 +++++++++++++++++++++++++++++++++++++++-
 include/net/bypass.h            |  80 ++++++
 include/uapi/linux/virtio_net.h |   3 +
 net/Kconfig                     |  18 ++
 net/core/Makefile               |   1 +
 net/core/bypass.c               | 406 ++++++++++++++++++++++++++
 9 files changed, 1184 insertions(+), 159 deletions(-)
 create mode 100644 include/net/bypass.h
 create mode 100644 net/core/bypass.c

-- 
2.14.3

^ permalink raw reply

* [RFC PATCH net-next v6 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri
In-Reply-To: <1523386790-12396-1-git-send-email-sridhar.samudrala@intel.com>

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..befb5944f3fd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2962,7 +2962,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP	  62	/* Act as backup for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related

* [RFC PATCH net-next v6 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri
In-Reply-To: <1523386790-12396-1-git-send-email-sridhar.samudrala@intel.com>

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

It uses the generic bypass framework that provides 2 functions to create
and destroy a master bypass netdev. When BACKUP feature is enabled, an
additional netdev(bypass netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'backup' netdev and a passthru device with the same MAC is
registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/Kconfig      |  1 +
 drivers/net/virtio_net.c | 36 +++++++++++++++++++++++++++++++++++-
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 891846655000..9e2cf61fd1c1 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -331,6 +331,7 @@ config VETH
 config VIRTIO_NET
 	tristate "Virtio network driver"
 	depends on VIRTIO
+	depends on MAY_USE_BYPASS
 	---help---
 	  This is the virtual network driver for virtio.  It can be used with
 	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index befb5944f3fd..99aa52d5ac9b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,8 +30,11 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
 #include <net/route.h>
 #include <net/xdp.h>
+#include <net/bypass.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -206,6 +209,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* bypass_master created when BACKUP feature enabled */
+	struct bypass_master *bypass_master;
 };
 
 struct padded_vnet_hdr {
@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+				      size_t len)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int ret;
+
+	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
+		return -EOPNOTSUPP;
+
+	ret = snprintf(buf, len, "_bkup");
+	if (ret >= len)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
 	.ndo_xdp_flush		= virtnet_xdp_flush,
 	.ndo_features_check	= passthru_features_check,
+	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -2839,10 +2862,16 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
+		err = bypass_master_create(vi->dev, &vi->bypass_master);
+		if (err)
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_bypass;
 	}
 
 	virtio_device_ready(vdev);
@@ -2879,6 +2908,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_bypass:
+	bypass_master_destroy(vi->bypass_master);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2913,6 +2944,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	bypass_master_destroy(vi->bypass_master);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
@@ -3010,6 +3043,7 @@ static __init int virtio_net_driver_init(void)
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
+
 	return 0;
 err_virtio:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
-- 
2.14.3

^ permalink raw reply related

* [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri
In-Reply-To: <1523386790-12396-1-git-send-email-sridhar.samudrala@intel.com>

This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation.

It exposes 2 sets of interfaces to the paravirtual drivers.
1. existing netvsc driver that uses 2 netdev model. In this model, no
master netdev is created. The paravirtual driver registers each bypass
instance along with a set of ops to manage the slave events.
     bypass_master_register()
     bypass_master_unregister()
2. new virtio_net based solution that uses 3 netdev model. In this model,
the bypass module provides interfaces to create/destroy additional master
netdev and all the slave events are managed internally.
      bypass_master_create()
      bypass_master_destroy()

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/linux/netdevice.h |  14 +
 include/net/bypass.h      |  96 ++++++
 net/Kconfig               |  18 +
 net/core/Makefile         |   1 +
 net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 973 insertions(+)
 create mode 100644 include/net/bypass.h
 create mode 100644 net/core/bypass.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503ea81a..587293728f70 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
 	IFF_PHONY_HEADROOM		= 1<<24,
 	IFF_MACSEC			= 1<<25,
 	IFF_NO_RX_HANDLER		= 1<<26,
+	IFF_BYPASS			= 1 << 27,
+	IFF_BYPASS_SLAVE		= 1 << 28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
 #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
+#define IFF_BYPASS			IFF_BYPASS
+#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
 }
 
+static inline bool netif_is_bypass_master(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_BYPASS;
+}
+
+static inline bool netif_is_bypass_slave(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_BYPASS_SLAVE;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/bypass.h b/include/net/bypass.h
new file mode 100644
index 000000000000..86b02cb894cf
--- /dev/null
+++ b/include/net/bypass.h
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_BYPASS_H
+#define _NET_BYPASS_H
+
+#include <linux/netdevice.h>
+
+struct bypass_ops {
+	int (*slave_pre_register)(struct net_device *slave_netdev,
+				  struct net_device *bypass_netdev);
+	int (*slave_join)(struct net_device *slave_netdev,
+			  struct net_device *bypass_netdev);
+	int (*slave_pre_unregister)(struct net_device *slave_netdev,
+				    struct net_device *bypass_netdev);
+	int (*slave_release)(struct net_device *slave_netdev,
+			     struct net_device *bypass_netdev);
+	int (*slave_link_change)(struct net_device *slave_netdev,
+				 struct net_device *bypass_netdev);
+	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
+};
+
+struct bypass_master {
+	struct list_head list;
+	struct net_device __rcu *bypass_netdev;
+	struct bypass_ops __rcu *ops;
+};
+
+/* bypass state */
+struct bypass_info {
+	/* passthru netdev with same MAC */
+	struct net_device __rcu *active_netdev;
+
+	/* virtio_net netdev */
+	struct net_device __rcu *backup_netdev;
+
+	/* active netdev stats */
+	struct rtnl_link_stats64 active_stats;
+
+	/* backup netdev stats */
+	struct rtnl_link_stats64 backup_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 bypass_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
+#if IS_ENABLED(CONFIG_NET_BYPASS)
+
+int bypass_master_create(struct net_device *backup_netdev,
+			 struct bypass_master **pbypass_master);
+void bypass_master_destroy(struct bypass_master *bypass_master);
+
+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
+			   struct bypass_master **pbypass_master);
+void bypass_master_unregister(struct bypass_master *bypass_master);
+
+int bypass_slave_unregister(struct net_device *slave_netdev);
+
+#else
+
+static inline
+int bypass_master_create(struct net_device *backup_netdev,
+			 struct bypass_master **pbypass_master);
+{
+	return 0;
+}
+
+static inline
+void bypass_master_destroy(struct bypass_master *bypass_master)
+{
+}
+
+static inline
+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
+			   struct pbypass_master **pbypass_master);
+{
+	return 0;
+}
+
+static inline
+void bypass_master_unregister(struct bypass_master *bypass_master)
+{
+}
+
+static inline
+int bypass_slave_unregister(struct net_device *slave_netdev)
+{
+	return 0;
+}
+
+#endif
+
+#endif /* _NET_BYPASS_H */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..994445f4a96a 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
 	  devlink is a loadable module and the driver using it is built-in.
 
+config NET_BYPASS
+	tristate "Bypass interface"
+	---help---
+	  This provides a generic interface for paravirtual drivers to listen
+	  for netdev register/unregister/link change events from pci ethernet
+	  devices with the same MAC and takeover their datapath. This also
+	  enables live migration of a VM with direct attached VF by failing
+	  over to the paravirtual datapath when the VF is unplugged.
+
+config MAY_USE_BYPASS
+	tristate
+	default m if NET_BYPASS=m
+	default y if NET_BYPASS=y || NET_BYPASS=n
+	help
+	  Drivers using the bypass infrastructure should have a dependency
+	  on MAY_USE_BYPASS to ensure they do not cause link errors when
+	  bypass is a loadable module and the driver using it is built-in.
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..a9727ed1c8fc 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
 obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_NET_BYPASS) += bypass.o
diff --git a/net/core/bypass.c b/net/core/bypass.c
new file mode 100644
index 000000000000..b5b9cb554c3f
--- /dev/null
+++ b/net/core/bypass.c
@@ -0,0 +1,844 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* A common module to handle registrations and notifications for paravirtual
+ * drivers to enable accelerated datapath and support VF live migration.
+ *
+ * The notifier and event handling code is based on netvsc driver.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/netpoll.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
+#include <linux/pci.h>
+#include <net/sch_generic.h>
+#include <uapi/linux/if_arp.h>
+#include <net/bypass.h>
+
+static LIST_HEAD(bypass_master_list);
+static DEFINE_SPINLOCK(bypass_lock);
+
+static int bypass_slave_pre_register(struct net_device *slave_netdev,
+				     struct net_device *bypass_netdev,
+				     struct bypass_ops *bypass_ops)
+{
+	struct bypass_info *bi;
+	bool backup;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_pre_register)
+			return -EINVAL;
+
+		return bypass_ops->slave_pre_register(slave_netdev,
+						      bypass_netdev);
+	}
+
+	bi = netdev_priv(bypass_netdev);
+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
+	if (backup ? rtnl_dereference(bi->backup_netdev) :
+			rtnl_dereference(bi->active_netdev)) {
+		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
+			   slave_netdev->name, backup ? "backup" : "active");
+		return -EEXIST;
+	}
+
+	/* Avoid non pci devices as active netdev */
+	if (!backup && (!slave_netdev->dev.parent ||
+			!dev_is_pci(slave_netdev->dev.parent)))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int bypass_slave_join(struct net_device *slave_netdev,
+			     struct net_device *bypass_netdev,
+			     struct bypass_ops *bypass_ops)
+{
+	struct bypass_info *bi;
+	bool backup;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_join)
+			return -EINVAL;
+
+		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
+	}
+
+	bi = netdev_priv(bypass_netdev);
+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
+
+	dev_hold(slave_netdev);
+
+	if (backup) {
+		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
+		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
+	} else {
+		rcu_assign_pointer(bi->active_netdev, slave_netdev);
+		dev_get_stats(bi->active_netdev, &bi->active_stats);
+		bypass_netdev->min_mtu = slave_netdev->min_mtu;
+		bypass_netdev->max_mtu = slave_netdev->max_mtu;
+	}
+
+	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
+		    slave_netdev->name);
+
+	return 0;
+}
+
+/* Called when slave dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+	skb->dev = ndev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+static struct net_device *bypass_master_get_bymac(u8 *mac,
+						  struct bypass_ops **ops)
+{
+	struct bypass_master *bypass_master;
+	struct net_device *bypass_netdev;
+
+	spin_lock(&bypass_lock);
+	list_for_each_entry(bypass_master, &bypass_master_list, list) {
+		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
+		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
+			*ops = rcu_dereference(bypass_master->ops);
+			spin_unlock(&bypass_lock);
+			return bypass_netdev;
+		}
+	}
+	spin_unlock(&bypass_lock);
+	return NULL;
+}
+
+static int bypass_slave_register(struct net_device *slave_netdev)
+{
+	struct net_device *bypass_netdev;
+	struct bypass_ops *bypass_ops;
+	int ret, orig_mtu;
+
+	ASSERT_RTNL();
+
+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
+						&bypass_ops);
+	if (!bypass_netdev)
+		goto done;
+
+	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
+					bypass_ops);
+	if (ret != 0)
+		goto done;
+
+	ret = netdev_rx_handler_register(slave_netdev,
+					 bypass_ops ? bypass_ops->handle_frame :
+					 bypass_handle_frame, bypass_netdev);
+	if (ret != 0) {
+		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
+			   ret);
+		goto done;
+	}
+
+	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
+	if (ret != 0) {
+		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
+			   bypass_netdev->name, ret);
+		goto upper_link_failed;
+	}
+
+	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
+
+	if (netif_running(bypass_netdev)) {
+		ret = dev_open(slave_netdev);
+		if (ret && (ret != -EBUSY)) {
+			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
+				   slave_netdev->name, ret);
+			goto err_interface_up;
+		}
+	}
+
+	/* Align MTU of slave with master */
+	orig_mtu = slave_netdev->mtu;
+	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
+	if (ret != 0) {
+		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
+			   slave_netdev->name, bypass_netdev->mtu);
+		goto err_set_mtu;
+	}
+
+	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
+	if (ret != 0)
+		goto err_join;
+
+	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
+
+	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
+		    slave_netdev->name);
+
+	goto done;
+
+err_join:
+	dev_set_mtu(slave_netdev, orig_mtu);
+err_set_mtu:
+	dev_close(slave_netdev);
+err_interface_up:
+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
+upper_link_failed:
+	netdev_rx_handler_unregister(slave_netdev);
+done:
+	return NOTIFY_DONE;
+}
+
+static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
+				       struct net_device *bypass_netdev,
+				       struct bypass_ops *bypass_ops)
+{
+	struct net_device *backup_netdev, *active_netdev;
+	struct bypass_info *bi;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_pre_unregister)
+			return -EINVAL;
+
+		return bypass_ops->slave_pre_unregister(slave_netdev,
+							bypass_netdev);
+	}
+
+	bi = netdev_priv(bypass_netdev);
+	active_netdev = rtnl_dereference(bi->active_netdev);
+	backup_netdev = rtnl_dereference(bi->backup_netdev);
+
+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int bypass_slave_release(struct net_device *slave_netdev,
+				struct net_device *bypass_netdev,
+				struct bypass_ops *bypass_ops)
+{
+	struct net_device *backup_netdev, *active_netdev;
+	struct bypass_info *bi;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_release)
+			return -EINVAL;
+
+		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
+	}
+
+	bi = netdev_priv(bypass_netdev);
+	active_netdev = rtnl_dereference(bi->active_netdev);
+	backup_netdev = rtnl_dereference(bi->backup_netdev);
+
+	if (slave_netdev == backup_netdev) {
+		RCU_INIT_POINTER(bi->backup_netdev, NULL);
+	} else {
+		RCU_INIT_POINTER(bi->active_netdev, NULL);
+		if (backup_netdev) {
+			bypass_netdev->min_mtu = backup_netdev->min_mtu;
+			bypass_netdev->max_mtu = backup_netdev->max_mtu;
+		}
+	}
+
+	dev_put(slave_netdev);
+
+	netdev_info(bypass_netdev, "bypass slave:%s released\n",
+		    slave_netdev->name);
+
+	return 0;
+}
+
+int bypass_slave_unregister(struct net_device *slave_netdev)
+{
+	struct net_device *bypass_netdev;
+	struct bypass_ops *bypass_ops;
+	int ret;
+
+	if (!netif_is_bypass_slave(slave_netdev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
+						&bypass_ops);
+	if (!bypass_netdev)
+		goto done;
+
+	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
+					  bypass_ops);
+	if (ret != 0)
+		goto done;
+
+	netdev_rx_handler_unregister(slave_netdev);
+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
+
+	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
+
+	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
+		    slave_netdev->name);
+
+done:
+	return NOTIFY_DONE;
+}
+EXPORT_SYMBOL_GPL(bypass_slave_unregister);
+
+static bool bypass_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static int bypass_slave_link_change(struct net_device *slave_netdev)
+{
+	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
+	struct bypass_ops *bypass_ops;
+	struct bypass_info *bi;
+
+	if (!netif_is_bypass_slave(slave_netdev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
+						&bypass_ops);
+	if (!bypass_netdev)
+		goto done;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_link_change)
+			goto done;
+
+		return bypass_ops->slave_link_change(slave_netdev,
+						     bypass_netdev);
+	}
+
+	if (!netif_running(bypass_netdev))
+		return 0;
+
+	bi = netdev_priv(bypass_netdev);
+
+	active_netdev = rtnl_dereference(bi->active_netdev);
+	backup_netdev = rtnl_dereference(bi->backup_netdev);
+
+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
+		goto done;
+
+	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
+	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
+		netif_carrier_on(bypass_netdev);
+		netif_tx_wake_all_queues(bypass_netdev);
+	} else {
+		netif_carrier_off(bypass_netdev);
+		netif_tx_stop_all_queues(bypass_netdev);
+	}
+
+done:
+	return NOTIFY_DONE;
+}
+
+static bool bypass_validate_event_dev(struct net_device *dev)
+{
+	/* Skip parent events */
+	if (netif_is_bypass_master(dev))
+		return false;
+
+	/* Avoid non-Ethernet type devices */
+	if (dev->type != ARPHRD_ETHER)
+		return false;
+
+	/* Avoid Vlan dev with same MAC registering as VF */
+	if (is_vlan_dev(dev))
+		return false;
+
+	/* Avoid Bonding master dev with same MAC registering as slave dev */
+	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
+		return false;
+
+	return true;
+}
+
+static int
+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	if (!bypass_validate_event_dev(event_dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return bypass_slave_register(event_dev);
+	case NETDEV_UNREGISTER:
+		return bypass_slave_unregister(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return bypass_slave_link_change(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block bypass_notifier = {
+	.notifier_call = bypass_event,
+};
+
+int bypass_open(struct net_device *dev)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *active_netdev, *backup_netdev;
+	int err;
+
+	netif_carrier_off(dev);
+	netif_tx_wake_all_queues(dev);
+
+	active_netdev = rtnl_dereference(bi->active_netdev);
+	if (active_netdev) {
+		err = dev_open(active_netdev);
+		if (err)
+			goto err_active_open;
+	}
+
+	backup_netdev = rtnl_dereference(bi->backup_netdev);
+	if (backup_netdev) {
+		err = dev_open(backup_netdev);
+		if (err)
+			goto err_backup_open;
+	}
+
+	return 0;
+
+err_backup_open:
+	dev_close(active_netdev);
+err_active_open:
+	netif_tx_disable(dev);
+	return err;
+}
+EXPORT_SYMBOL_GPL(bypass_open);
+
+int bypass_close(struct net_device *dev)
+{
+	struct bypass_info *vi = netdev_priv(dev);
+	struct net_device *slave_netdev;
+
+	netif_tx_disable(dev);
+
+	slave_netdev = rtnl_dereference(vi->active_netdev);
+	if (slave_netdev)
+		dev_close(slave_netdev);
+
+	slave_netdev = rtnl_dereference(vi->backup_netdev);
+	if (slave_netdev)
+		dev_close(slave_netdev);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(bypass_close);
+
+static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via active netdev followed by backup netdev */
+	xmit_dev = rcu_dereference_bh(bi->active_netdev);
+	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
+		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
+			return bypass_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+EXPORT_SYMBOL_GPL(bypass_start_xmit);
+
+u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
+			void *accel_priv, select_queue_fallback_t fallback)
+{
+	/* This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the bonding driver.
+	 */
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+EXPORT_SYMBOL_GPL(bypass_select_queue);
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
+			      const struct rtnl_link_stats64 *_new,
+			      const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *slave_netdev;
+
+	spin_lock(&bi->stats_lock);
+	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	slave_netdev = rcu_dereference(bi->active_netdev);
+	if (slave_netdev) {
+		new = dev_get_stats(slave_netdev, &temp);
+		bypass_fold_stats(stats, new, &bi->active_stats);
+		memcpy(&bi->active_stats, new, sizeof(*new));
+	}
+
+	slave_netdev = rcu_dereference(bi->backup_netdev);
+	if (slave_netdev) {
+		new = dev_get_stats(slave_netdev, &temp);
+		bypass_fold_stats(stats, new, &bi->backup_stats);
+		memcpy(&bi->backup_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
+	spin_unlock(&bi->stats_lock);
+}
+EXPORT_SYMBOL_GPL(bypass_get_stats);
+
+int bypass_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *active_netdev, *backup_netdev;
+	int ret = 0;
+
+	active_netdev = rcu_dereference(bi->active_netdev);
+	if (active_netdev) {
+		ret = dev_set_mtu(active_netdev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	backup_netdev = rcu_dereference(bi->backup_netdev);
+	if (backup_netdev) {
+		ret = dev_set_mtu(backup_netdev, new_mtu);
+		if (ret) {
+			dev_set_mtu(active_netdev, dev->mtu);
+			return ret;
+		}
+	}
+
+	dev->mtu = new_mtu;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(bypass_change_mtu);
+
+void bypass_set_rx_mode(struct net_device *dev)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *slave_netdev;
+
+	rcu_read_lock();
+
+	slave_netdev = rcu_dereference(bi->active_netdev);
+	if (slave_netdev) {
+		dev_uc_sync_multiple(slave_netdev, dev);
+		dev_mc_sync_multiple(slave_netdev, dev);
+	}
+
+	slave_netdev = rcu_dereference(bi->backup_netdev);
+	if (slave_netdev) {
+		dev_uc_sync_multiple(slave_netdev, dev);
+		dev_mc_sync_multiple(slave_netdev, dev);
+	}
+
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
+
+static const struct net_device_ops bypass_netdev_ops = {
+	.ndo_open		= bypass_open,
+	.ndo_stop		= bypass_close,
+	.ndo_start_xmit		= bypass_start_xmit,
+	.ndo_select_queue	= bypass_select_queue,
+	.ndo_get_stats64	= bypass_get_stats,
+	.ndo_change_mtu		= bypass_change_mtu,
+	.ndo_set_rx_mode	= bypass_set_rx_mode,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+#define BYPASS_DRV_NAME "bypass"
+#define BYPASS_DRV_VERSION "0.1"
+
+static void bypass_ethtool_get_drvinfo(struct net_device *dev,
+				       struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
+}
+
+int bypass_ethtool_get_link_ksettings(struct net_device *dev,
+				      struct ethtool_link_ksettings *cmd)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *slave_netdev;
+
+	slave_netdev = rtnl_dereference(bi->active_netdev);
+	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
+		slave_netdev = rtnl_dereference(bi->backup_netdev);
+		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(slave_netdev, cmd);
+}
+EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
+
+static const struct ethtool_ops bypass_ethtool_ops = {
+	.get_drvinfo            = bypass_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
+};
+
+static void bypass_register_existing_slave(struct net_device *bypass_netdev)
+{
+	struct net *net = dev_net(bypass_netdev);
+	struct net_device *dev;
+
+	rtnl_lock();
+	for_each_netdev(net, dev) {
+		if (dev == bypass_netdev)
+			continue;
+		if (!bypass_validate_event_dev(dev))
+			continue;
+		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
+			bypass_slave_register(dev);
+	}
+	rtnl_unlock();
+}
+
+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
+			   struct bypass_master **pbypass_master)
+{
+	struct bypass_master *bypass_master;
+
+	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
+	if (!bypass_master)
+		return -ENOMEM;
+
+	rcu_assign_pointer(bypass_master->ops, ops);
+	dev_hold(dev);
+	dev->priv_flags |= IFF_BYPASS;
+	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
+
+	spin_lock(&bypass_lock);
+	list_add_tail(&bypass_master->list, &bypass_master_list);
+	spin_unlock(&bypass_lock);
+
+	bypass_register_existing_slave(dev);
+
+	*pbypass_master = bypass_master;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(bypass_master_register);
+
+void bypass_master_unregister(struct bypass_master *bypass_master)
+{
+	struct net_device *bypass_netdev;
+
+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
+
+	bypass_netdev->priv_flags &= ~IFF_BYPASS;
+	dev_put(bypass_netdev);
+
+	spin_lock(&bypass_lock);
+	list_del(&bypass_master->list);
+	spin_unlock(&bypass_lock);
+
+	kfree(bypass_master);
+}
+EXPORT_SYMBOL_GPL(bypass_master_unregister);
+
+int bypass_master_create(struct net_device *backup_netdev,
+			 struct bypass_master **pbypass_master)
+{
+	struct device *dev = backup_netdev->dev.parent;
+	struct net_device *bypass_netdev;
+	int err;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that most devices being bonded won't have too many queues.
+	 */
+	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
+	if (!bypass_netdev) {
+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+		return -ENOMEM;
+	}
+
+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
+	SET_NETDEV_DEV(bypass_netdev, dev);
+
+	bypass_netdev->netdev_ops = &bypass_netdev_ops;
+	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
+
+	/* Initialize the device options */
+	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
+	bypass_netdev->features |= NETIF_F_LLTX;
+
+	/* Don't allow bypass devices to change network namespaces. */
+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
+
+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	bypass_netdev->features |= bypass_netdev->hw_features;
+
+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
+	       bypass_netdev->addr_len);
+
+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+
+	err = register_netdev(bypass_netdev);
+	if (err < 0) {
+		dev_err(dev, "Unable to register bypass_netdev!\n");
+		goto err_register_netdev;
+	}
+
+	netif_carrier_off(bypass_netdev);
+
+	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
+	if (err < 0)
+		goto err_bypass;
+
+	return 0;
+
+err_bypass:
+	unregister_netdev(bypass_netdev);
+err_register_netdev:
+	free_netdev(bypass_netdev);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(bypass_master_create);
+
+void bypass_master_destroy(struct bypass_master *bypass_master)
+{
+	struct net_device *bypass_netdev;
+	struct net_device *slave_netdev;
+	struct bypass_info *bi;
+
+	if (!bypass_master)
+		return;
+
+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
+	bi = netdev_priv(bypass_netdev);
+
+	netif_device_detach(bypass_netdev);
+
+	rtnl_lock();
+
+	slave_netdev = rtnl_dereference(bi->active_netdev);
+	if (slave_netdev)
+		bypass_slave_unregister(slave_netdev);
+
+	slave_netdev = rtnl_dereference(bi->backup_netdev);
+	if (slave_netdev)
+		bypass_slave_unregister(slave_netdev);
+
+	bypass_master_unregister(bypass_master);
+
+	unregister_netdevice(bypass_netdev);
+
+	rtnl_unlock();
+
+	free_netdev(bypass_netdev);
+}
+EXPORT_SYMBOL_GPL(bypass_master_destroy);
+
+static __init int
+bypass_init(void)
+{
+	register_netdevice_notifier(&bypass_notifier);
+
+	return 0;
+}
+module_init(bypass_init);
+
+static __exit
+void bypass_exit(void)
+{
+	unregister_netdevice_notifier(&bypass_notifier);
+}
+module_exit(bypass_exit);
+
+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
+MODULE_LICENSE("GPL v2");
-- 
2.14.3

^ permalink raw reply related

* [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri
In-Reply-To: <1523386790-12396-1-git-send-email-sridhar.samudrala@intel.com>

Use the registration/notification framework supported by the generic
bypass infrastructure.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 208 ++++++++++------------------------------
 3 files changed, 55 insertions(+), 156 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 936968d23559..cc3a721baa18 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -1,5 +1,6 @@
 config HYPERV_NET
 	tristate "Microsoft Hyper-V virtual network driver"
 	depends on HYPERV
+	depends on MAY_USE_BYPASS
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 960f06141472..5f8137bc5c1c 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -768,6 +768,8 @@ struct net_device_context {
 	u32 vf_alloc;
 	/* Serial number of the VF to team with */
 	u32 vf_serial;
+
+	struct bypass_master *bypass_master;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..87c2a276e62f 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
 #include <net/pkt_sched.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
+#include <net/bypass.h>
 
 #include "hyperv_net.h"
 
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		if (ether_addr_equal(mac, dev->perm_addr))
-			return dev;
-	}
-
-	return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		struct net_device_context *net_device_ctx;
-
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		net_device_ctx = netdev_priv(dev);
-		if (!rtnl_dereference(net_device_ctx->nvdev))
-			continue;	/* device is removed */
-
-		if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-			return dev;	/* a match */
-	}
-
-	return NULL;
-}
-
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1829,39 +1790,15 @@ static int netvsc_vf_join(struct net_device *vf_netdev,
 			  struct net_device *ndev)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(ndev);
-	int ret;
-
-	ret = netdev_rx_handler_register(vf_netdev,
-					 netvsc_vf_handle_frame, ndev);
-	if (ret != 0) {
-		netdev_err(vf_netdev,
-			   "can not register netvsc VF receive handler (err = %d)\n",
-			   ret);
-		goto rx_handler_failed;
-	}
-
-	ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
-	if (ret != 0) {
-		netdev_err(vf_netdev,
-			   "can not set master device %s (err = %d)\n",
-			   ndev->name, ret);
-		goto upper_link_failed;
-	}
-
-	/* set slave flag before open to prevent IPv6 addrconf */
-	vf_netdev->flags |= IFF_SLAVE;
 
 	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
 
-	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
 	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
-	return 0;
 
-upper_link_failed:
-	netdev_rx_handler_unregister(vf_netdev);
-rx_handler_failed:
-	return ret;
+	dev_hold(vf_netdev);
+	rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
+
+	return 0;
 }
 
 static void __netvsc_vf_setup(struct net_device *ndev,
@@ -1914,85 +1851,82 @@ static void netvsc_vf_setup(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_vf_pre_register(struct net_device *vf_netdev,
+				  struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
 
-	if (vf_netdev->addr_len != ETH_ALEN)
-		return NOTIFY_DONE;
-
-	/*
-	 * We will use the MAC address to locate the synthetic interface to
-	 * associate with the VF interface. If we don't find a matching
-	 * synthetic interface, move on.
-	 */
-	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
-		return NOTIFY_DONE;
-
-	if (netvsc_vf_join(vf_netdev, ndev) != 0)
-		return NOTIFY_DONE;
+		return -EEXIST;
 
 	netdev_info(ndev, "VF registering: %s\n", vf_netdev->name);
 
-	dev_hold(vf_netdev);
-	rcu_assign_pointer(net_device_ctx->vf_netdev, vf_netdev);
-	return NOTIFY_OK;
+	return 0;
 }
 
 /* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
-	struct net_device *ndev;
 	bool vf_is_up = netif_running(vf_netdev);
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev)
-		return NOTIFY_DONE;
+		return -EINVAL;
 
 	netvsc_switch_datapath(ndev, vf_is_up);
 	netdev_info(ndev, "Data path switched %s VF: %s\n",
 		    vf_is_up ? "to" : "from", vf_netdev->name);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_vf_release(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
-	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
+	if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+		return -EINVAL;
 
-	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
 
-	netdev_rx_handler_unregister(vf_netdev);
-	netdev_upper_dev_unlink(vf_netdev, ndev);
 	RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
 	dev_put(vf_netdev);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
+static int netvsc_vf_pre_unregister(struct net_device *vf_netdev,
+				    struct net_device *ndev)
+{
+	struct net_device_context *net_device_ctx;
+
+	net_device_ctx = netdev_priv(ndev);
+	if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+		return -EINVAL;
+
+	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+
+	return 0;
+}
+
+static struct bypass_ops netvsc_bypass_ops = {
+	.slave_pre_register	= netvsc_vf_pre_register,
+	.slave_join		= netvsc_vf_join,
+	.slave_pre_unregister	= netvsc_vf_pre_unregister,
+	.slave_release		= netvsc_vf_release,
+	.slave_link_change	= netvsc_vf_changed,
+	.handle_frame		= netvsc_vf_handle_frame,
+};
+
 static int netvsc_probe(struct hv_device *dev,
 			const struct hv_vmbus_device_id *dev_id)
 {
@@ -2082,8 +2016,15 @@ static int netvsc_probe(struct hv_device *dev,
 		goto register_failed;
 	}
 
+	ret = bypass_master_register(net, &netvsc_bypass_ops,
+				     &net_device_ctx->bypass_master);
+	if (ret != 0)
+		goto err_bypass;
+
 	return ret;
 
+err_bypass:
+	unregister_netdev(net);
 register_failed:
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
@@ -2124,13 +2065,15 @@ static int netvsc_remove(struct hv_device *dev)
 	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
 	if (vf_netdev)
-		netvsc_unregister_vf(vf_netdev);
+		bypass_slave_unregister(vf_netdev);
 
 	if (nvdev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
 
+	bypass_master_unregister(ndev_ctx->bypass_master);
+
 	rtnl_unlock();
 	rcu_read_unlock();
 
@@ -2157,54 +2100,8 @@ static struct  hv_driver netvsc_drv = {
 	.remove = netvsc_remove,
 };
 
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
-			       unsigned long event, void *ptr)
-{
-	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
-	/* Skip our own events */
-	if (event_dev->netdev_ops == &device_ops)
-		return NOTIFY_DONE;
-
-	/* Avoid non-Ethernet type devices */
-	if (event_dev->type != ARPHRD_ETHER)
-		return NOTIFY_DONE;
-
-	/* Avoid Vlan dev with same MAC registering as VF */
-	if (is_vlan_dev(event_dev))
-		return NOTIFY_DONE;
-
-	/* Avoid Bonding master dev with same MAC registering as VF */
-	if ((event_dev->priv_flags & IFF_BONDING) &&
-	    (event_dev->flags & IFF_MASTER))
-		return NOTIFY_DONE;
-
-	switch (event) {
-	case NETDEV_REGISTER:
-		return netvsc_register_vf(event_dev);
-	case NETDEV_UNREGISTER:
-		return netvsc_unregister_vf(event_dev);
-	case NETDEV_UP:
-	case NETDEV_DOWN:
-		return netvsc_vf_changed(event_dev);
-	default:
-		return NOTIFY_DONE;
-	}
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
-	.notifier_call = netvsc_netdev_event,
-};
-
 static void __exit netvsc_drv_exit(void)
 {
-	unregister_netdevice_notifier(&netvsc_netdev_notifier);
 	vmbus_driver_unregister(&netvsc_drv);
 }
 
@@ -2224,7 +2121,6 @@ static int __init netvsc_drv_init(void)
 	if (ret)
 		return ret;
 
-	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH net 0/2] l2tp: tunnel creation fixes
From: Guillaume Nault @ 2018-04-10 19:01 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Tom Parkin

L2TP tunnel creation is racy. We need to make sure that the tunnel
returned by l2tp_tunnel_create() isn't going to be freed while the
caller is using it. This is done in patch #1, by separating tunnel
creation from tunnel registration.

With the tunnel registration code in place, we can now check for
duplicate tunnels in a race-free way. This is done in patch #2, which
incidentally removes the last use of l2tp_tunnel_find().

Guillaume Nault (2):
  l2tp: fix races in tunnel creation
  l2tp: fix race in duplicate tunnel detection

 net/l2tp/l2tp_core.c    | 225 +++++++++++++++++-----------------------
 net/l2tp/l2tp_core.h    |   4 +-
 net/l2tp/l2tp_netlink.c |  22 ++--
 net/l2tp/l2tp_ppp.c     |   9 ++
 4 files changed, 123 insertions(+), 137 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [PATCH net 2/2] l2tp: fix race in duplicate tunnel detection
From: Guillaume Nault @ 2018-04-10 19:01 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Tom Parkin
In-Reply-To: <cover.1523385906.git.g.nault@alphalink.fr>

We can't use l2tp_tunnel_find() to prevent l2tp_nl_cmd_tunnel_create()
from creating a duplicate tunnel. A tunnel can be concurrently
registered after l2tp_tunnel_find() returns. Therefore, searching for
duplicates must be done at registration time.

Finally, remove l2tp_tunnel_find() entirely as it isn't use anywhere
anymore.

Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_core.c    | 35 ++++++++++++++---------------------
 net/l2tp/l2tp_core.h    |  1 -
 net/l2tp/l2tp_netlink.c |  6 ------
 3 files changed, 14 insertions(+), 28 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index afb42d142807..0fbd3ee26165 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -335,26 +335,6 @@ int l2tp_session_register(struct l2tp_session *session,
 }
 EXPORT_SYMBOL_GPL(l2tp_session_register);
 
-/* Lookup a tunnel by id
- */
-struct l2tp_tunnel *l2tp_tunnel_find(const struct net *net, u32 tunnel_id)
-{
-	struct l2tp_tunnel *tunnel;
-	struct l2tp_net *pn = l2tp_pernet(net);
-
-	rcu_read_lock_bh();
-	list_for_each_entry_rcu(tunnel, &pn->l2tp_tunnel_list, list) {
-		if (tunnel->tunnel_id == tunnel_id) {
-			rcu_read_unlock_bh();
-			return tunnel;
-		}
-	}
-	rcu_read_unlock_bh();
-
-	return NULL;
-}
-EXPORT_SYMBOL_GPL(l2tp_tunnel_find);
-
 struct l2tp_tunnel *l2tp_tunnel_find_nth(const struct net *net, int nth)
 {
 	struct l2tp_net *pn = l2tp_pernet(net);
@@ -1501,6 +1481,7 @@ static int l2tp_validate_socket(const struct sock *sk, const struct net *net,
 int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
 			 struct l2tp_tunnel_cfg *cfg)
 {
+	struct l2tp_tunnel *tunnel_walk;
 	struct l2tp_net *pn;
 	struct socket *sock;
 	struct sock *sk;
@@ -1529,7 +1510,16 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
 	tunnel->l2tp_net = net;
 
 	pn = l2tp_pernet(net);
+
 	spin_lock_bh(&pn->l2tp_tunnel_list_lock);
+	list_for_each_entry(tunnel_walk, &pn->l2tp_tunnel_list, list) {
+		if (tunnel_walk->tunnel_id == tunnel->tunnel_id) {
+			spin_unlock_bh(&pn->l2tp_tunnel_list_lock);
+
+			ret = -EEXIST;
+			goto err_sock;
+		}
+	}
 	list_add_rcu(&tunnel->list, &pn->l2tp_tunnel_list);
 	spin_unlock_bh(&pn->l2tp_tunnel_list_lock);
 
@@ -1558,7 +1548,10 @@ int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
 	return 0;
 
 err_sock:
-	sockfd_put(sock);
+	if (tunnel->fd < 0)
+		sock_release(sock);
+	else
+		sockfd_put(sock);
 err:
 	return ret;
 }
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 12f0fa82f162..ba33cbec71eb 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -220,7 +220,6 @@ struct l2tp_session *l2tp_session_get(const struct net *net,
 struct l2tp_session *l2tp_session_get_nth(struct l2tp_tunnel *tunnel, int nth);
 struct l2tp_session *l2tp_session_get_by_ifname(const struct net *net,
 						const char *ifname);
-struct l2tp_tunnel *l2tp_tunnel_find(const struct net *net, u32 tunnel_id);
 struct l2tp_tunnel *l2tp_tunnel_find_nth(const struct net *net, int nth);
 
 int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id,
diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index 45db9b73eb1a..b05dbd9ffcb2 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -236,12 +236,6 @@ static int l2tp_nl_cmd_tunnel_create(struct sk_buff *skb, struct genl_info *info
 	if (info->attrs[L2TP_ATTR_DEBUG])
 		cfg.debug = nla_get_u32(info->attrs[L2TP_ATTR_DEBUG]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
-	if (tunnel != NULL) {
-		ret = -EEXIST;
-		goto out;
-	}
-
 	ret = -EINVAL;
 	switch (cfg.encap) {
 	case L2TP_ENCAPTYPE_UDP:
-- 
2.17.0

^ permalink raw reply related

* [PATCH net 1/2] l2tp: fix races in tunnel creation
From: Guillaume Nault @ 2018-04-10 19:01 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Tom Parkin
In-Reply-To: <cover.1523385906.git.g.nault@alphalink.fr>

l2tp_tunnel_create() inserts the new tunnel into the namespace's tunnel
list and sets the socket's ->sk_user_data field, before returning it to
the caller. Therefore, there are two ways the tunnel can be accessed
and freed, before the caller even had the opportunity to take a
reference. In practice, syzbot could crash the module by closing the
socket right after a new tunnel was returned to pppol2tp_create().

This patch moves tunnel registration out of l2tp_tunnel_create(), so
that the caller can safely hold a reference before publishing the
tunnel. This second step is done with the new l2tp_tunnel_register()
function, which is now responsible for associating the tunnel to its
socket and for inserting it into the namespace's list.

While moving the code to l2tp_tunnel_register(), a few modifications
have been done. First, the socket validation tests are done in a helper
function, for clarity. Also, modifying the socket is now done after
having inserted the tunnel to the namespace's tunnels list. This will
allow insertion to fail, without having to revert theses modifications
in the error path (a followup patch will check for duplicate tunnels
before insertion). Either the socket is a kernel socket which we
control, or it is a user-space socket for which we have a reference on
the file descriptor. In any case, the socket isn't going to be closed
from under us.

Reported-by: syzbot+fbeeb5c3b538e8545644@syzkaller.appspotmail.com
Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_core.c    | 192 ++++++++++++++++++----------------------
 net/l2tp/l2tp_core.h    |   3 +
 net/l2tp/l2tp_netlink.c |  16 +++-
 net/l2tp/l2tp_ppp.c     |   9 ++
 4 files changed, 110 insertions(+), 110 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 14b67dfacc4b..afb42d142807 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1436,74 +1436,11 @@ int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32
 {
 	struct l2tp_tunnel *tunnel = NULL;
 	int err;
-	struct socket *sock = NULL;
-	struct sock *sk = NULL;
-	struct l2tp_net *pn;
 	enum l2tp_encap_type encap = L2TP_ENCAPTYPE_UDP;
 
-	/* Get the tunnel socket from the fd, which was opened by
-	 * the userspace L2TP daemon. If not specified, create a
-	 * kernel socket.
-	 */
-	if (fd < 0) {
-		err = l2tp_tunnel_sock_create(net, tunnel_id, peer_tunnel_id,
-				cfg, &sock);
-		if (err < 0)
-			goto err;
-	} else {
-		sock = sockfd_lookup(fd, &err);
-		if (!sock) {
-			pr_err("tunl %u: sockfd_lookup(fd=%d) returned %d\n",
-			       tunnel_id, fd, err);
-			err = -EBADF;
-			goto err;
-		}
-
-		/* Reject namespace mismatches */
-		if (!net_eq(sock_net(sock->sk), net)) {
-			pr_err("tunl %u: netns mismatch\n", tunnel_id);
-			err = -EINVAL;
-			goto err;
-		}
-	}
-
-	sk = sock->sk;
-
 	if (cfg != NULL)
 		encap = cfg->encap;
 
-	/* Quick sanity checks */
-	err = -EPROTONOSUPPORT;
-	if (sk->sk_type != SOCK_DGRAM) {
-		pr_debug("tunl %hu: fd %d wrong socket type\n",
-			 tunnel_id, fd);
-		goto err;
-	}
-	switch (encap) {
-	case L2TP_ENCAPTYPE_UDP:
-		if (sk->sk_protocol != IPPROTO_UDP) {
-			pr_err("tunl %hu: fd %d wrong protocol, got %d, expected %d\n",
-			       tunnel_id, fd, sk->sk_protocol, IPPROTO_UDP);
-			goto err;
-		}
-		break;
-	case L2TP_ENCAPTYPE_IP:
-		if (sk->sk_protocol != IPPROTO_L2TP) {
-			pr_err("tunl %hu: fd %d wrong protocol, got %d, expected %d\n",
-			       tunnel_id, fd, sk->sk_protocol, IPPROTO_L2TP);
-			goto err;
-		}
-		break;
-	}
-
-	/* Check if this socket has already been prepped */
-	tunnel = l2tp_tunnel(sk);
-	if (tunnel != NULL) {
-		/* This socket has already been prepped */
-		err = -EBUSY;
-		goto err;
-	}
-
 	tunnel = kzalloc(sizeof(struct l2tp_tunnel), GFP_KERNEL);
 	if (tunnel == NULL) {
 		err = -ENOMEM;
@@ -1520,72 +1457,113 @@ int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32
 	rwlock_init(&tunnel->hlist_lock);
 	tunnel->acpt_newsess = true;
 
-	/* The net we belong to */
-	tunnel->l2tp_net = net;
-	pn = l2tp_pernet(net);
-
 	if (cfg != NULL)
 		tunnel->debug = cfg->debug;
 
-	/* Mark socket as an encapsulation socket. See net/ipv4/udp.c */
 	tunnel->encap = encap;
-	if (encap == L2TP_ENCAPTYPE_UDP) {
-		struct udp_tunnel_sock_cfg udp_cfg = { };
-
-		udp_cfg.sk_user_data = tunnel;
-		udp_cfg.encap_type = UDP_ENCAP_L2TPINUDP;
-		udp_cfg.encap_rcv = l2tp_udp_encap_recv;
-		udp_cfg.encap_destroy = l2tp_udp_encap_destroy;
-
-		setup_udp_tunnel_sock(net, sock, &udp_cfg);
-	} else {
-		sk->sk_user_data = tunnel;
-	}
 
-	/* Bump the reference count. The tunnel context is deleted
-	 * only when this drops to zero. A reference is also held on
-	 * the tunnel socket to ensure that it is not released while
-	 * the tunnel is extant. Must be done before sk_destruct is
-	 * set.
-	 */
 	refcount_set(&tunnel->ref_count, 1);
-	sock_hold(sk);
-	tunnel->sock = sk;
 	tunnel->fd = fd;
 
-	/* Hook on the tunnel socket destructor so that we can cleanup
-	 * if the tunnel socket goes away.
-	 */
-	tunnel->old_sk_destruct = sk->sk_destruct;
-	sk->sk_destruct = &l2tp_tunnel_destruct;
-	lockdep_set_class_and_name(&sk->sk_lock.slock, &l2tp_socket_class, "l2tp_sock");
-
-	sk->sk_allocation = GFP_ATOMIC;
-
 	/* Init delete workqueue struct */
 	INIT_WORK(&tunnel->del_work, l2tp_tunnel_del_work);
 
-	/* Add tunnel to our list */
 	INIT_LIST_HEAD(&tunnel->list);
-	spin_lock_bh(&pn->l2tp_tunnel_list_lock);
-	list_add_rcu(&tunnel->list, &pn->l2tp_tunnel_list);
-	spin_unlock_bh(&pn->l2tp_tunnel_list_lock);
 
 	err = 0;
 err:
 	if (tunnelp)
 		*tunnelp = tunnel;
 
-	/* If tunnel's socket was created by the kernel, it doesn't
-	 *  have a file.
-	 */
-	if (sock && sock->file)
-		sockfd_put(sock);
-
 	return err;
 }
 EXPORT_SYMBOL_GPL(l2tp_tunnel_create);
 
+static int l2tp_validate_socket(const struct sock *sk, const struct net *net,
+				enum l2tp_encap_type encap)
+{
+	if (!net_eq(sock_net(sk), net))
+		return -EINVAL;
+
+	if (sk->sk_type != SOCK_DGRAM)
+		return -EPROTONOSUPPORT;
+
+	if ((encap == L2TP_ENCAPTYPE_UDP && sk->sk_protocol != IPPROTO_UDP) ||
+	    (encap == L2TP_ENCAPTYPE_IP && sk->sk_protocol != IPPROTO_L2TP))
+		return -EPROTONOSUPPORT;
+
+	if (sk->sk_user_data)
+		return -EBUSY;
+
+	return 0;
+}
+
+int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
+			 struct l2tp_tunnel_cfg *cfg)
+{
+	struct l2tp_net *pn;
+	struct socket *sock;
+	struct sock *sk;
+	int ret;
+
+	if (tunnel->fd < 0) {
+		ret = l2tp_tunnel_sock_create(net, tunnel->tunnel_id,
+					      tunnel->peer_tunnel_id, cfg,
+					      &sock);
+		if (ret < 0)
+			goto err;
+	} else {
+		sock = sockfd_lookup(tunnel->fd, &ret);
+		if (!sock)
+			goto err;
+
+		ret = l2tp_validate_socket(sock->sk, net, tunnel->encap);
+		if (ret < 0)
+			goto err_sock;
+	}
+
+	sk = sock->sk;
+
+	sock_hold(sk);
+	tunnel->sock = sk;
+	tunnel->l2tp_net = net;
+
+	pn = l2tp_pernet(net);
+	spin_lock_bh(&pn->l2tp_tunnel_list_lock);
+	list_add_rcu(&tunnel->list, &pn->l2tp_tunnel_list);
+	spin_unlock_bh(&pn->l2tp_tunnel_list_lock);
+
+	if (tunnel->encap == L2TP_ENCAPTYPE_UDP) {
+		struct udp_tunnel_sock_cfg udp_cfg = {
+			.sk_user_data = tunnel,
+			.encap_type = UDP_ENCAP_L2TPINUDP,
+			.encap_rcv = l2tp_udp_encap_recv,
+			.encap_destroy = l2tp_udp_encap_destroy,
+		};
+
+		setup_udp_tunnel_sock(net, sock, &udp_cfg);
+	} else {
+		sk->sk_user_data = tunnel;
+	}
+
+	tunnel->old_sk_destruct = sk->sk_destruct;
+	sk->sk_destruct = &l2tp_tunnel_destruct;
+	lockdep_set_class_and_name(&sk->sk_lock.slock, &l2tp_socket_class,
+				   "l2tp_sock");
+	sk->sk_allocation = GFP_ATOMIC;
+
+	if (tunnel->fd >= 0)
+		sockfd_put(sock);
+
+	return 0;
+
+err_sock:
+	sockfd_put(sock);
+err:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(l2tp_tunnel_register);
+
 /* This function is used by the netlink TUNNEL_DELETE command.
  */
 void l2tp_tunnel_delete(struct l2tp_tunnel *tunnel)
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 2718d0b284d0..12f0fa82f162 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -226,6 +226,9 @@ struct l2tp_tunnel *l2tp_tunnel_find_nth(const struct net *net, int nth);
 int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id,
 		       u32 peer_tunnel_id, struct l2tp_tunnel_cfg *cfg,
 		       struct l2tp_tunnel **tunnelp);
+int l2tp_tunnel_register(struct l2tp_tunnel *tunnel, struct net *net,
+			 struct l2tp_tunnel_cfg *cfg);
+
 void l2tp_tunnel_closeall(struct l2tp_tunnel *tunnel);
 void l2tp_tunnel_delete(struct l2tp_tunnel *tunnel);
 struct l2tp_session *l2tp_session_create(int priv_size,
diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index e7ea9c4b89ff..45db9b73eb1a 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -251,9 +251,19 @@ static int l2tp_nl_cmd_tunnel_create(struct sk_buff *skb, struct genl_info *info
 		break;
 	}
 
-	if (ret >= 0)
-		ret = l2tp_tunnel_notify(&l2tp_nl_family, info,
-					 tunnel, L2TP_CMD_TUNNEL_CREATE);
+	if (ret < 0)
+		goto out;
+
+	l2tp_tunnel_inc_refcount(tunnel);
+	ret = l2tp_tunnel_register(tunnel, net, &cfg);
+	if (ret < 0) {
+		kfree(tunnel);
+		goto out;
+	}
+	ret = l2tp_tunnel_notify(&l2tp_nl_family, info, tunnel,
+				 L2TP_CMD_TUNNEL_CREATE);
+	l2tp_tunnel_dec_refcount(tunnel);
+
 out:
 	return ret;
 }
diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index d6deca11da19..896bbca9bdaa 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -698,6 +698,15 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
 			error = l2tp_tunnel_create(sock_net(sk), fd, ver, tunnel_id, peer_tunnel_id, &tcfg, &tunnel);
 			if (error < 0)
 				goto end;
+
+			l2tp_tunnel_inc_refcount(tunnel);
+			error = l2tp_tunnel_register(tunnel, sock_net(sk),
+						     &tcfg);
+			if (error < 0) {
+				kfree(tunnel);
+				goto end;
+			}
+			drop_tunnel = true;
 		}
 	} else {
 		/* Error if we can't find the tunnel */
-- 
2.17.0

^ permalink raw reply related

* Re: [patch net] devlink: convert occ_get op to separate registration
From: Sasha Levin @ 2018-04-10 19:08 UTC (permalink / raw)
  To: David Miller
  Cc: jiri@resnulli.us, jiri@mellanox.com, netdev@vger.kernel.org,
	idosch@mellanox.com, stable@vger.kernel.org,
	gregkh@linuxfoundation.org
In-Reply-To: <20180410.103507.935086967648690369.davem@davemloft.net>

On Tue, Apr 10, 2018 at 10:35:07AM -0400, David Miller wrote:
>From: Sasha Levin <Alexander.Levin@microsoft.com>
>Date: Tue, 10 Apr 2018 13:49:40 +0000
>
>> This commit has been processed because it contains a "Fixes:" tag,
>> fixing commit: d9f9b9a4d05f devlink: Add support for resource abstraction.
>>
>> The bot has tested the following trees: v4.16.1.
>
>This is nice, but it's becomming noise.

Thank you, and sorry. I've blacklisted net/ and drivers/net/.

>I pick and choose specific networking changes to queue up for -stable
>and whether it has a Fixes tag or applies cleanly are not the primary
>considerations I take into account when deciding what goes to stable
>or not.
>
>What matters primarily are two attributes:
>
>1) What is the impact of the bug?
>
>2) How risky is the change?
>
>3) Did someone explicitly ask for the backport?  Why?

When I started the autoselection project, I found it interesting that
the networking subsystem has a unique workflow for stable commits where
a single person would review every single patch that gets merged, and
would select patches purely based on whether the patch follows the
stable rules or not, without relying on stable tags.

For the first few iterations of the neural network, I thought that the
networking commit set is valuable since if I tried to learn on commits
that were tagged for stable, the neural network would develop a bias
towards the tag which will prevent it from detecting bug fixing patches
that do not contain such tags.

However, during validation, I found that the quality of the selection on
a random set of commits wasn't what I expected it to be. I reviewed a
random sample of networking commits and found that a high percentage of
commits that are clearly "bug fixing", were not included in any stable
tree.

This changed my view on how the stable process works for different
subsystems; I realized that there are 3 main things we look for in
stable trees:

 1. They have all bug fixing commits that are relevant to that tree.
 2. There are none non-bug-fixing commits in the tree.
 3. The commits are backported correctly to that tree.

And since subsystems have different workflows for stable, I could see
how different approaches balanced these goals. For networking, it was
easy to see that a high quality of review by the maintainer prevented
and non-bug-fixing commits from creeping in, and the error rates for
backports were much smaller than any other subsystem I looked at.

What was missing was the first goal. My hypothesis was that after all,
we are only humans; with the high rate of commits that flow into a
subsystem, and the opt-in nature of the process, we are bound to
incorrectly miss commits that would otherwise be in a stable tree.

I could back up my theory with experiences other folks had [1][2], 
so one of my biggest hopes was to see how the autoselection would could
improve the quality of selection and build on the biggest advantage the
networking subsystem over other subsystems: a dedicated maintainer who
actually gives a crap and reviews these patches himself.

When the work on the neural network itself quieted down, and the first
results started trickling in, I was delighted that we can now help
select commits that were missed by manual inspection, but are clearly
bug fixing. For example, [3][4][5].

>These automated emails tell me nothing about either attribute.

The attributes you've listed are subjective, and are the final step of
selecting patches for stable trees. Consider the following additional
attributes you undoubtfully look at when going through commits:

 1. Is it fixing a bug?
 2. Is it relevant to any of the stable trees?

Only after deciding on these attributes, it makes sense to look at ones
that require actual thinking (such as the ones you've listed).

>I figure out what Fixes: tags are involved and whether the patch
>applies cleanly when I process my stable queue of networking patches.

This is another thing that doesn't always end up right. When done
manually patches tend to get lost, forgotten, or end up with a partial
backport.

Doing this is a combination of "dumb" manual labor where for each stable
tree we try to find whether the commit we fix exists in it, and which
other commits we may depend on, and "smart" labor where we try to figure
out whether the commit should actually be in that tree.

The bot tries to take the "dumb" part out of your way, by letting
you know from the start which trees this applied/built on and what
dependencies it might have. It comes for free, why not use it?

As an example, a fix for CVE-2017-16939 [6] was backported to v4.9
but not to v4.4 or v3.18 because of missing dependencies.

>For various reasons I have to do this all manually anyways.  I always
>do a full analysis of where the bug came from, and also do manual
>code inspection to find the cases where a Fixes: tag indicates a commit
>that was also put into stable branches and what impact that has on
>the backport of the stable fix for a particular stable tree.

Would you be interested in going over these reasons, and whether the bot
can automate any of these? Things like determining if a Fixes: commit is
already in stable are already automated (excluding the impact analysis,
ofcourse).

>So I would like to kindly ask that you stop sending these automated
>emails, they really are not helping my networking stable backport
>process at all, and instead are just making more emails I have to
>delete from my inbox every day.
>
>Thank you very much for your kind consideration in this matter.

What would be the appropriate way to submit these to the networking
stable branch?

Let's look at a concrete example: I've sent a mail for a dccp fix[7] a
few days ago. Could we discuss why it's not in any stable tree?



[1] https://www.spinics.net/lists/stable/msg173995.html
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c58d6c93680f28ac58984af61d0a7ebf4319c241
[3] https://patchwork.kernel.org/patch/10188235/
[4] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.4.y&id=e29066778bc28eff5f63616800c6b60f12c87267
[5] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.4.y&id=7c5deeccc664a79d5586df86b4221493f28e9ef9
[6] https://www.spinics.net/lists/netdev/msg469458.html
[7] https://lkml.org/lkml/2018/4/8/762

^ permalink raw reply

* Re: [patch net] devlink: convert occ_get op to separate registration
From: David Miller @ 2018-04-10 19:22 UTC (permalink / raw)
  To: Alexander.Levin; +Cc: jiri, jiri, netdev, idosch, stable, gregkh
In-Reply-To: <20180410190819.GI2341@sasha-vm>

From: Sasha Levin <Alexander.Levin@microsoft.com>
Date: Tue, 10 Apr 2018 19:08:20 +0000

> The bot tries to take the "dumb" part out of your way, by letting
> you know from the start which trees this applied/built on and what
> dependencies it might have. It comes for free, why not use it?

I do this already while I'm processing the -stable queue in patchwork
and it automatically falls right out of the process I use to extract
patches out of Linus's GIT tree.

I manually pull the commit out of Linus's tree, verify that it is
actually the commit I'm interested in, then I do two things:

1) I look at the exact tag that the commit landed in using
   "git describe --contains SHA1_ID"

2) I look at the exact tag that the commit the Fixes: tag
   points at landed using "git describe --contains SHA1_ID"

I double check #2 to see for cases whether the Fixes: tag
itself was backported to stable trees.

I use these two tag values to organize the -stable queue into
subdirectories.  This guides my patch applying in order to minimize
useless work.

So I have to do all of this work anyways.

Even if the bot provided these values, I would still double
check them, every single one of them.

Therefore, the net result from my perspective is that for most
patches fixing bugs on this list, instead of N list postings,
there will now be at least N * 2.

Bots are starting to overwhelm actual content from human beings
on this list, and I want to put my foot on the brake right now
before it gets even more out of control.

Thank you.

^ permalink raw reply

* Re: [PATCH] net: dsa: Discard frames from unused ports
From: David Miller @ 2018-04-10 19:37 UTC (permalink / raw)
  To: andrew; +Cc: netdev


Andrew, I'm working on -stable backports, and hit a snag when pulling
this one into v4.14

4.14 predates the dsa_master_find_slave() helper, but even if I backport
that, the data structures at this point lack the 'type' member for
the test that you added.

Can you put together a v4.14 backport, if applicable, and submit it to
stable@vger.kernel.org CC:'ing me?

Thanks.

^ permalink raw reply

* Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section
From: Eric Dumazet @ 2018-04-10 20:35 UTC (permalink / raw)
  To: David Miller, saeedm; +Cc: netdev
In-Reply-To: <20180410.131649.583776764903333305.davem@davemloft.net>



On 04/10/2018 10:16 AM, David Miller wrote:
> 
> The tradeoff here is that now you are doing two unnecessary atomic
> operations per stats dump.
> 
> That is what the RCU lock allows us to avoid.
> 

dev_hold() and dev_put() are actually per cpu increment and decrements,
pretty cheap these days.

Problem here is that any preemption of the process holding device reference
might trigger warnings in device unregister.

^ permalink raw reply

* Re: [patch net] devlink: convert occ_get op to separate registration
From: Sasha Levin @ 2018-04-10 20:43 UTC (permalink / raw)
  To: David Miller
  Cc: jiri@resnulli.us, jiri@mellanox.com, netdev@vger.kernel.org,
	idosch@mellanox.com, stable@vger.kernel.org,
	gregkh@linuxfoundation.org
In-Reply-To: <20180410.152257.1812747859010556634.davem@davemloft.net>

On Tue, Apr 10, 2018 at 03:22:57PM -0400, David Miller wrote:
>From: Sasha Levin <Alexander.Levin@microsoft.com>
>Date: Tue, 10 Apr 2018 19:08:20 +0000
>
>> The bot tries to take the "dumb" part out of your way, by letting
>> you know from the start which trees this applied/built on and what
>> dependencies it might have. It comes for free, why not use it?
>
>I do this already while I'm processing the -stable queue in patchwork
>and it automatically falls right out of the process I use to extract
>patches out of Linus's GIT tree.
>
>I manually pull the commit out of Linus's tree, verify that it is
>actually the commit I'm interested in, then I do two things:
>
>1) I look at the exact tag that the commit landed in using
>   "git describe --contains SHA1_ID"
>
>2) I look at the exact tag that the commit the Fixes: tag
>   points at landed using "git describe --contains SHA1_ID"
>
>I double check #2 to see for cases whether the Fixes: tag
>itself was backported to stable trees.
>
>I use these two tag values to organize the -stable queue into
>subdirectories.  This guides my patch applying in order to minimize
>useless work.
>
>So I have to do all of this work anyways.
>
>Even if the bot provided these values, I would still double
>check them, every single one of them.
>
>Therefore, the net result from my perspective is that for most
>patches fixing bugs on this list, instead of N list postings,
>there will now be at least N * 2.

Gotcha. I can keep it out from the regular patch flow. How could we
address commits that might have been missed? Let's say I look at commits
that are older than ~1 month, how can I get those looked at again?

>Bots are starting to overwhelm actual content from human beings
>on this list, and I want to put my foot on the brake right now
>before it gets even more out of control.

I think we're just hitting the limitations of using a mailing list. Bots
aren't around to spam everyone for fun, right?

0day bot is spammy because patches fail to compile, syzbot is spammy
because we have tons of bugs users can hit and the stable bot is
spammy because we miss lots of commits that should be in stable.

As the kernel grows, not only the codebase is expanding but also the
tooling around it. While spammy, bots provide valuable input that in the
past would take blood, sweat and tears to acquire. We just need a better
way to consume it rather than blocking off these inputs.

Thanks!

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Stephen Hemminger @ 2018-04-10 21:26 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, davem, netdev, virtualization, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, jasowang, loseweigh, jiri
In-Reply-To: <1523386790-12396-5-git-send-email-sridhar.samudrala@intel.com>

On Tue, 10 Apr 2018 11:59:50 -0700
Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:

> Use the registration/notification framework supported by the generic
> bypass infrastructure.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---

Thanks for doing this.  Your current version has couple show stopper
issues.

First, the slave device is instantly taking over the slave.
This doesn't allow udev/systemd to do its device rename of the slave
device. Netvsc uses a delayed work to workaround this.

Secondly, the select queue needs to call queue selection in VF.
The bonding/teaming logic doesn't work well for UDP flows.
Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
fixed this performance problem.

Lastly, more indirection is bad in current climate.

I am not completely adverse to this but it needs to be fast, simple
and completely transparent.

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Jesus Sanchez-Palencia @ 2018-04-10 21:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
	mlichvar
In-Reply-To: <alpine.DEB.2.21.1804101412420.2058@nanos.tec.linutronix.de>

Hi Thomas,


On 04/10/2018 05:37 AM, Thomas Gleixner wrote:

(...)


>>>
>>>  - Simple feed through (Applications are time contraints aware and set the
>>>    exact schedule). qdisc has admission control.
>>
>> This will be provided by the tbs qdisc. It will still provide a txtime sorted
>> list and hw offload, but now there will be a per-socket option that tells the
>> qdisc if the per-packet timestamp is the txtime (i.e. explicit mode, as you've
>> called it) or a deadline. The drop_if_late flag will be removed.
>>
>> When in explicit mode, packets from that socket are dequeued from the qdisc
>> during its time slice if their [(txtime - delta) < now].
>>
>>>
>>>  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
>>>    of time constraints and provide the packet deadline. qdisc has admission
>>>    control. This can be a simple first comes, first served scheduler or
>>>    something like EDF which allows optimized utilization. The qdisc sets
>>>    the TX time depending on the deadline and feeds into the root.
>>
>> This will be provided by tbs if the socket which is transmitting packets is
>> configured for deadline mode.
> 
> You don't want the socket to decide that. The qdisc into which a socket
> feeds defines the mode and the qdisc rejects requests with the wrong mode.
> 
> Making a qdisc doing both and let the user decide what he wants it to be is
> not really going to fly. Especially if you have different users which want
> a different mode. It's clearly distinct functionality.


Ok, so just to make sure I got this right, are you suggesting that both the
'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
parameter for specifying the txtime mode? This way if there is a mismatch,
packets from that socket are rejected by the qdisc.



(...)


> 
>> Another question for this mode (but perhaps that applies to both modes) is, what
>> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
>> the packet during dequeue.
> 
> There the question is how user space is notified about that issue. The
> application which queued the packet on time does rightfully assume that
> it's going to be on the wire on time.
> 
> This is a violation of the overall scheduling plan, so you need to have
> a sane design to handle that.


In addition to the qdisc stats, we could look into using the socket's error
queue to notify the application about that.


> 
>> Putting it all together, we end up with:
>>
>> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
>> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
> 
> Why CLOCK_REALTIME? The only interesting time in a TSN network is
> CLOCK_TAI, really.


REALTIME was just an example here to show that the qdisc has to be configured
with a clockid parameter. Are you suggesting that instead both of the new qdiscs
(i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?


> 
>> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
>> either as a txtime or as deadline by tbs (and further the NIC driver for the
>> offlaod case): SCM_TXTIME.
>>
>> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
>> socket, and will have as parameters a clockid and a txtime mode (deadline or
>> explicit), that defines the semantics of the timestamp set on packets using
>> SCM_TXTIME.
>>
>> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
> 
> Can you remind me why we would need that?


So there is a "clockid" that can be used for the full hw offload modes. On this
case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
can't just use a clockid that was computed from the fd pointing to /dev/ptpX .


> 
>> 5) a new schedule-aware qdisc, 'tas' or 'taprio', to be used per port. Its cli
>> will look like what was proposed for taprio (base time being an absolute timestamp).
>>
>> If we all agree with the above, we will start by closing on 1-4 asap and will
>> focus on 5 next.
>>
>> How does that sound?
> 
> Backwards to be honest.
> 
> You should start with the NIC facing qdisc because that's the key part of
> all this and the design might have implications on how the qdiscs which
> feed into it need to be designed.


Ok, let's just try to close on the above first.


Thanks,
Jesus


> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply

* [PATCH] ibmvnic: Define vnic_login_client_data name field as unsized array
From: Kees Cook @ 2018-04-10 22:26 UTC (permalink / raw)
  To: Thomas Falcon
  Cc: John Allen, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, netdev, linuxppc-dev, linux-kernel,
	Daniel Micay

The "name" field of struct vnic_login_client_data is a char array of
undefined length. This should be written as "char name[]" so the compiler
can make better decisions about the field (for example, not assuming
it's a single character). This was noticed while trying to tighten the
CONFIG_FORTIFY_SOURCE checking.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index aad5658d79d5..35fbb41cd2d4 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3170,7 +3170,7 @@ static int send_version_xchg(struct ibmvnic_adapter *adapter)
 struct vnic_login_client_data {
 	u8	type;
 	__be16	len;
-	char	name;
+	char	name[];
 } __packed;
 
 static int vnic_client_data_len(struct ibmvnic_adapter *adapter)
@@ -3199,21 +3199,21 @@ static void vnic_add_client_data(struct ibmvnic_adapter *adapter,
 	vlcd->type = 1;
 	len = strlen(os_name) + 1;
 	vlcd->len = cpu_to_be16(len);
-	strncpy(&vlcd->name, os_name, len);
-	vlcd = (struct vnic_login_client_data *)((char *)&vlcd->name + len);
+	strncpy(vlcd->name, os_name, len);
+	vlcd = (struct vnic_login_client_data *)(vlcd->name + len);
 
 	/* Type 2 - LPAR name */
 	vlcd->type = 2;
 	len = strlen(utsname()->nodename) + 1;
 	vlcd->len = cpu_to_be16(len);
-	strncpy(&vlcd->name, utsname()->nodename, len);
-	vlcd = (struct vnic_login_client_data *)((char *)&vlcd->name + len);
+	strncpy(vlcd->name, utsname()->nodename, len);
+	vlcd = (struct vnic_login_client_data *)(vlcd->name + len);
 
 	/* Type 3 - device name */
 	vlcd->type = 3;
 	len = strlen(adapter->netdev->name) + 1;
 	vlcd->len = cpu_to_be16(len);
-	strncpy(&vlcd->name, adapter->netdev->name, len);
+	strncpy(vlcd->name, adapter->netdev->name, len);
 }
 
 static int send_login(struct ibmvnic_adapter *adapter)
-- 
2.7.4


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* Re: [RFC bpf-next v2 3/8] bpf: add documentation for eBPF helpers (12-22)
From: Alexei Starovoitov @ 2018-04-10 22:43 UTC (permalink / raw)
  To: Quentin Monnet; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410144157.4831-4-quentin.monnet@netronome.com>

On Tue, Apr 10, 2018 at 03:41:52PM +0100, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions, all
> writter by Alexei:
> 
> - bpf_get_current_pid_tgid()
> - bpf_get_current_uid_gid()
> - bpf_get_current_comm()
> - bpf_skb_vlan_push()
> - bpf_skb_vlan_pop()
> - bpf_skb_get_tunnel_key()
> - bpf_skb_set_tunnel_key()
> - bpf_redirect()
> - bpf_perf_event_output()
> - bpf_get_stackid()
> - bpf_get_current_task()
> 
> Cc: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  include/uapi/linux/bpf.h | 237 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 237 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2bc653a3a20f..f3ea8824efbc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -580,6 +580,243 @@ union bpf_attr {
>   * 		performed again.
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
> + *
> + * u64 bpf_get_current_pid_tgid(void)
> + * 	Return
> + * 		A 64-bit integer containing the current tgid and pid, and
> + * 		created as such:
> + * 		*current_task*\ **->tgid << 32 \|**
> + * 		*current_task*\ **->pid**.
> + *
> + * u64 bpf_get_current_uid_gid(void)
> + * 	Return
> + * 		A 64-bit integer containing the current GID and UID, and
> + * 		created as such: *current_gid* **<< 32 \|** *current_uid*.
> + *
> + * int bpf_get_current_comm(char *buf, u32 size_of_buf)
> + * 	Description
> + * 		Copy the **comm** attribute of the current task into *buf* of
> + * 		*size_of_buf*. The **comm** attribute contains the name of
> + * 		the executable (excluding the path) for the current task. The
> + * 		*size_of_buf* must be strictly positive. On success, the

that reminds me that we probably should relax it to ARG_CONST_SIZE_OR_ZERO.
The programs won't be passing an actual zero into it, but it helps
a lot to tell verifier that zero is also valid, since programs
become much simpler.

> + * 		helper makes sure that the *buf* is NUL-terminated. On failure,
> + * 		it is filled with zeroes.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
> + * 	Description
> + * 		Push a *vlan_tci* (VLAN tag control information) of protocol
> + * 		*vlan_proto* to the packet associated to *skb*, then update
> + * 		the checksum. Note that if *vlan_proto* is different from
> + * 		**ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to
> + * 		be **ETH_P_8021Q**.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_vlan_pop(struct sk_buff *skb)
> + * 	Description
> + * 		Pop a VLAN header from the packet associated to *skb*.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
> + * 	Description
> + * 		Get tunnel metadata. This helper takes a pointer *key* to an
> + * 		empty **struct bpf_tunnel_key** of **size**, that will be
> + * 		filled with tunnel metadata for the packet associated to *skb*.
> + * 		The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which
> + * 		indicates that the tunnel is based on IPv6 protocol instead of
> + * 		IPv4.
> + *
> + * 		This is typically used on the receive path to perform a lookup
> + * 		or a packet redirection based on the value of *key*:

above is correct, but feels a bit cryptic.
May be give more concrete example for particular tunneling protocol like gre
and say that tunnel_key.remote_ip[46] is essential part of the encap and
bpf prog will make decisions based on the contents of the encap header
where bpf_tunnel_key is a single structure that generalizes parameters of
various tunneling protocols into one struct.

> + *
> + * 		::
> + *
> + * 			struct bpf_tunnel_key key = {};
> + * 			bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
> + * 			     lookup or redirect based on key ...
> + *
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
> + * 	Description
> + * 		Populate tunnel metadata for packet associated to *skb.* The
> + * 		tunnel metadata is set to the contents of *key*, of *size*. The
> + * 		*flags* can be set to a combination of the following values:
> + *
> + * 		**BPF_F_TUNINFO_IPV6**
> + * 			Indicate that the tunnel is based on IPv6 protocol
> + * 			instead of IPv4.
> + * 		**BPF_F_ZERO_CSUM_TX**
> + * 			For IPv4 packets, add a flag to tunnel metadata
> + * 			indicating that checksum computation should be skipped
> + * 			and checksum set to zeroes.
> + * 		**BPF_F_DONT_FRAGMENT**
> + * 			Add a flag to tunnel metadata indicating that the
> + * 			packet should not be fragmented.
> + * 		**BPF_F_SEQ_NUMBER**
> + * 			Add a flag to tunnel metadata indicating that a
> + * 			sequence number should be added to tunnel header before
> + * 			sending the packet. This flag was added for GRE
> + * 			encapsulation, but might be used with other protocols
> + * 			as well in the future.
> + *
> + * 		Here is a typical usage on the transmit path:
> + *
> + * 		::
> + *
> + * 			struct bpf_tunnel_key key;
> + * 			     populate key ...
> + * 			bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0);
> + * 			bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
> + *
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_redirect(u32 ifindex, u64 flags)
> + * 	Description
> + * 		Redirect the packet to another net device of index *ifindex*.
> + * 		This helper is somewhat similar to **bpf_clone_redirect**\
> + * 		(), except that the packet is not cloned, which provides
> + * 		increased performance.
> + *
> + * 		For hooks other than XDP, *flags* can be set to
> + * 		**BPF_F_INGRESS**, which indicates the packet is to be
> + * 		redirected to the ingress interface instead of (by default)
> + * 		egress. Currently, XDP does not support any flag.
> + * 	Return
> + * 		For XDP, the helper returns **XDP_REDIRECT** on success or
> + * 		**XDP_ABORT** on error. For other program types, the values
> + * 		are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
> + * 		error.
> + *
> + * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
> + * 	Description
> + * 		Write perf raw sample into a perf event held by *map* of type

I'd say:
Write raw *data* blob into special bpf perf event held by ...

> + * 		**BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf event must
> + * 		have the following attributes: **PERF_SAMPLE_RAW** as
> + * 		**sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
> + * 		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
> + *
> + * 		The *flags* are used to indicate the index in *map* for which
> + * 		the value must be put, masked with **BPF_F_INDEX_MASK**.
> + * 		Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
> + * 		to indicate that the index of the current CPU core should be
> + * 		used.
> + *
> + * 		The value to write, of *size*, is passed through eBPF stack and
> + * 		pointed by *data*.
> + *
> + * 		The context of the program *ctx* needs also be passed to the
> + * 		helper, and will get interpreted as a pointer to a **struct
> + * 		pt_reg**.

Not quite correct.
Initially bpf_perf_event_output() was only used with 'struct pt_reg *ctx',
but then later it was generalized for all other tracing prog types,
for clsact and even for XDP.
So 'ctx' can be any of the context used by these program types.

> + *
> + * 		On user space, a program willing to read the values needs to
> + * 		call **perf_event_open**\ () on the perf event (either for
> + * 		one or for all CPUs) and to store the file descriptor into the
> + * 		*map*. This must be done before the eBPF program can send data
> + * 		into it. An example is available in file
> + * 		*samples/bpf/trace_output_user.c* in the Linux kernel source
> + * 		tree (the eBPF program counterpart is in
> + * 		*samples/bpf/trace_output_kern.c*). It looks like the
> + * 		following snippet:
> + *
> + * 		::
> + *
> + * 			volatile struct perf_event_mmap_page *header;
> + * 			struct perf_event_attr attr = {
> + * 			        .sample_type = PERF_SAMPLE_RAW,
> + * 			        .type = PERF_TYPE_SOFTWARE,
> + * 			        .config = PERF_COUNT_SW_BPF_OUTPUT,
> + * 			};
> + * 			int page_size;
> + * 			int mmap_size;
> + * 			int key = 0;
> + * 			int pmu_fd;
> + * 			void *base;
> + * 			
> + * 			if (load_bpf_file(filename))
> + * 			        return -1;
> + * 			
> + * 			pmu_fd = sys_perf_event_open(&attr,
> + * 			                             -1, // pid
> + * 			                              0, // cpu
> + * 			                             -1, // group_fd
> + * 			                              0);
> + * 			
> + * 			assert(pmu_fd >= 0);
> + * 			assert(bpf_map_update_elem(map_fd[0], &key,
> + * 			                           &pmu_fd, BPF_ANY) == 0);
> + * 			assert(ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0) == 0);
> + * 			
> + * 			page_size = getpagesize();
> + * 			mmap_size = page_size * (page_cnt + 1);
> + * 			
> + * 			base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
> + * 			            MAP_SHARED, fd, 0);
> + * 			if (base == MAP_FAILED)
> + * 			        return -1;
> + * 			
> + * 			header = base;

I think that is too much for the man page, especially above is far from
complete example.

> + *
> + * 		**bpf_perf_event_output**\ () achieves better performance
> + * 		than **bpf_trace_printk**\ () for sharing data with user
> + * 		space, and is much better suitable for streaming data from eBPF
> + * 		programs.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
> + * 	Description
> + * 		Walk a user or a kernel stack and return its id. To achieve
> + * 		this, the helper needs *ctx*, which is a pointer to the context
> + * 		on which the tracing program is executed, and a pointer to a
> + * 		*map* of type **BPF_MAP_TYPE_STACK_TRACE**.
> + *
> + * 		The last argument, *flags*, holds the number of stack frames to
> + * 		skip (from 0 to 255), masked with
> + * 		**BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
> + * 		a combination of the following flags:
> + *
> + * 		**BPF_F_USER_STACK**
> + * 			Collect a user space stack instead of a kernel stack.
> + * 		**BPF_F_FAST_STACK_CMP**
> + * 			Compare stacks by hash only.
> + * 		**BPF_F_REUSE_STACKID**
> + * 			If two different stacks hash into the same *stackid*,
> + * 			discard the old one.

we have an annoying bug here that we will be sending a patch to fix soon,
since right now there is no way for the program to know that stackid
got replaced.

> + *
> + * 		The stack id retrieved is a 32 bit long integer handle which
> + * 		can be further combined with other data (including other stack
> + * 		ids) and used as a key into maps. This can be useful for
> + * 		generating a variety of graphs (such as flame graphs or off-cpu
> + * 		graphs).
> + *
> + * 		For walking a stack, this helper is an improvement over
> + * 		**bpf_probe_read**\ (), which can be used with unrolled loops
> + * 		but is not efficient and consumes a lot of eBPF instructions.
> + * 		Instead, **bpf_get_stackid**\ () can collect up to
> + * 		**PERF_MAX_STACK_DEPTH** both kernel and user frames.

PERF_MAX_STACK_DEPTH is now controlled by sysctl knob.
Would be good to mention that this limit can and should be increased
for profiling long user stacks like java.

> + * 	Return
> + * 		The positive or null stack id on success, or a negative error
> + * 		in case of failure.
> + *
> + * u64 bpf_get_current_task(void)
> + * 	Return
> + * 		A pointer to the current task struct.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> -- 
> 2.14.1
> 

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Samudrala, Sridhar @ 2018-04-10 22:56 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, davem
In-Reply-To: <20180410142608.50f15b45@xeon-e3>

On 4/10/2018 2:26 PM, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
> Thanks for doing this.  Your current version has couple show stopper
> issues.
>
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

OK. I guess you are referring to the dev_set_mtu() and dev_open() calls that are
made in bypass_slave_register() and you want to defer them to be done after
a delay.  I could avoid these calls in case of netvsc based on bypass_ops.


>
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.

netvsc should not be using bypass_select_queue() as  that ndo op gets used
only with 3-netdev model.
Anyway, will look into updating bypass_select_queue() based on your fix.

>
> Lastly, more indirection is bad in current climate.
>
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

Not sure we can avoid this indirection if we want to commonize the code,  but use
different models for virtio-net and netvsc.

On the other hand, these patches avoid calls to get_netvsc_bymac() and
get_netvsc_by_ref() that go through all the devices for all the netdev events.
netvsc lookups should be much faster.

Thanks
Sridhar

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH] selftests: bpf: update .gitignore with missing generated files
From: Daniel Borkmann @ 2018-04-10 23:14 UTC (permalink / raw)
  To: Anders Roxell, ast, shuah; +Cc: netdev, linux-kernel, linux-kselftest
In-Reply-To: <20180410122421.25393-1-anders.roxell@linaro.org>

On 04/10/2018 02:24 PM, Anders Roxell wrote:
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>

Applied to bpf tree, thanks Anders!

^ permalink raw reply

* Re: [PATCH bpf v4] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Daniel Borkmann @ 2018-04-10 23:14 UTC (permalink / raw)
  To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180410163732.2818747-1-yhs@fb.com>

On 04/10/2018 06:37 PM, Yonghong Song wrote:
> syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
> The error details:
>   ======================================================
>   WARNING: possible circular locking dependency detected
>   4.16.0-rc7+ #3 Not tainted
>   ------------------------------------------------------
>   syz-executor7/24531 is trying to acquire lock:
>    (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
> 
>   but task is already holding lock:
>    (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
> 
>   which lock already depends on the new lock.
> 
>   the existing dependency chain (in reverse order) is:
> 
[...]
> 
> The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
> user space to query prog array on the same tp") where copy_to_user,
> which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
> At the same time, during perf_event file descriptor close,
> mm->mmap_sem is held first and then subsequent
> perf_event_detach_bpf_prog needs bpf_event_mutex lock.
> Such a senario caused a deadlock.
> 
> As suggested by Daniel, moving copy_to_user out of the
> bpf_event_mutex lock should fix the problem.
> 
> Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp")
> Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
> Signed-off-by: Yonghong Song <yhs@fb.com>

Applied to bpf tree, thanks Yonghong!

^ permalink raw reply

* Re: [PATCH] net/mlx5: remove some extraneous spaces in indentations
From: Jason Gunthorpe @ 2018-04-10 23:22 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Saeed Mahameed, Matan Barak, Leon Romanovsky, netdev, linux-rdma,
	kernel-janitors, linux-kernel
In-Reply-To: <20180409124336.29274-1-colin.king@canonical.com>

On Mon, Apr 09, 2018 at 01:43:36PM +0100, Colin Ian King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> There are several lines where there is an extraneous space causing
> indentation misalignment. Remove them.
> 
> Cleans up Cocconelle warnings:
> 
> ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:409:3-18: code aligned
> with following code on line 410
> ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:415:3-18: code aligned
> with following code on line 416
> ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:421:3-18: code aligned
> with following code on line 422
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/qp.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)

applied to for-next thanks

Jason

^ permalink raw reply

* Re: [PATCH] net/mlx5: remove some extraneous spaces in indentations
From: Jason Gunthorpe @ 2018-04-10 23:27 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Saeed Mahameed, Matan Barak, Leon Romanovsky, netdev, linux-rdma,
	kernel-janitors, linux-kernel
In-Reply-To: <20180410232254.mg7stooh5tytv6m2@ziepe.ca>

On Tue, Apr 10, 2018 at 05:22:54PM -0600, Jason Gunthorpe wrote:
> On Mon, Apr 09, 2018 at 01:43:36PM +0100, Colin Ian King wrote:
> > From: Colin Ian King <colin.king@canonical.com>
> > 
> > There are several lines where there is an extraneous space causing
> > indentation misalignment. Remove them.
> > 
> > Cleans up Cocconelle warnings:
> > 
> > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:409:3-18: code aligned
> > with following code on line 410
> > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:415:3-18: code aligned
> > with following code on line 416
> > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:421:3-18: code aligned
> > with following code on line 422
> > 
> > Signed-off-by: Colin Ian King <colin.king@canonical.com>
> >  drivers/net/ethernet/mellanox/mlx5/core/qp.c | 18 +++++++++---------
> >  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> applied to for-next thanks

Oh wait, this is for netdev, not rdma sorry. Never mind, DaveM should
pick it up.

Jason

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Michael S. Tsirkin @ 2018-04-10 23:28 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Sridhar Samudrala, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, jiri
In-Reply-To: <20180410142608.50f15b45@xeon-e3>

On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

Interesting. Does this mean udev must act within a specific time window
then?

> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.
> 
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Siwei Liu @ 2018-04-10 23:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Sridhar Samudrala, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski, Jason Wang, Jiri Pirko
In-Reply-To: <20180411022807-mutt-send-email-mst@kernel.org>

On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>>
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
> Interesting. Does this mean udev must act within a specific time window
> then?

Sighs, lots of hacks. Why propgating this from driver to a common
module. We really need a clean solution.

-Siwei


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.

^ permalink raw reply

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Stephen Hemminger @ 2018-04-10 23:59 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Netdev,
	David Miller
In-Reply-To: <CADGSJ22rVsC0TDTd6OKVnwbx0ExoQ8xWXBMumKB-OFH4sX=yaQ@mail.gmail.com>

On Tue, 10 Apr 2018 16:44:47 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:  
> >> On Tue, 10 Apr 2018 11:59:50 -0700
> >> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >>  
> >> > Use the registration/notification framework supported by the generic
> >> > bypass infrastructure.
> >> >
> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > ---  
> >>
> >> Thanks for doing this.  Your current version has couple show stopper
> >> issues.
> >>
> >> First, the slave device is instantly taking over the slave.
> >> This doesn't allow udev/systemd to do its device rename of the slave
> >> device. Netvsc uses a delayed work to workaround this.  
> >
> > Interesting. Does this mean udev must act within a specific time window
> > then?  
> 
> Sighs, lots of hacks. Why propgating this from driver to a common
> module. We really need a clean solution.
> 

I had a patch to wait for udev to do the rename and go from there
but davem rejected it.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox