* Re: [RFC PATCH net-next v5 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-06 12:57 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <1522962503-3963-3-git-send-email-sridhar.samudrala@intel.com>
Thu, Apr 05, 2018 at 11:08:21PM CEST, sridhar.samudrala@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation. A
>paravirtual driver can use this module by registering a set of ops and
>each instance of the device when it is probed.
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> include/net/bypass.h | 80 ++++++++++
> net/Kconfig | 18 +++
> net/core/Makefile | 1 +
> net/core/bypass.c | 406 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 505 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index 000000000000..e2dd122f951a
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,80 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include <linux/netdevice.h>
>+
>+struct bypass_ops {
Perhaps "net_bypass_" would be better prefix for this module structs
and functions. No strong opinion though.
>+ int (*register_child)(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev);
We have master/slave upper/lower netdevices. This adds "child". Consider
using some existing names. Not sure if possible without loss of meaning.
>+ int (*join_child)(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev);
>+ int (*unregister_child)(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev);
>+ int (*release_child)(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev);
>+ int (*update_link)(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev);
>+ rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_instance {
>+ struct list_head list;
>+ struct net_device __rcu *bypass_netdev;
>+ struct bypass *bypass;
>+};
>+
>+struct bypass {
>+ struct list_head list;
>+ const struct bypass_ops *ops;
>+ const struct net_device_ops *netdev_ops;
>+ struct list_head instance_list;
>+ struct mutex lock;
>+};
>+
>+#if IS_ENABLED(CONFIG_NET_BYPASS)
>+
>+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>+ const struct net_device_ops *netdev_ops);
>+void bypass_unregister_driver(struct bypass *bypass);
>+
>+int bypass_register_instance(struct bypass *bypass, struct net_device *dev);
>+int bypass_unregister_instance(struct bypass *bypass, struct net_device *dev);
>+
>+int bypass_unregister_child(struct net_device *child_netdev);
>+
>+#else
>+
>+static inline
>+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>+ const struct net_device_ops *netdev_ops)
>+{
>+ return NULL;
>+}
>+
>+static inline void bypass_unregister_driver(struct bypass *bypass)
>+{
>+}
>+
>+static inline int bypass_register_instance(struct bypass *bypass,
>+ struct net_device *dev)
>+{
>+ return 0;
>+}
>+
>+static inline int bypass_unregister_instance(struct bypass *bypass,
>+ struct net_device *dev)
>+{
>+ return 0;
>+}
>+
>+static inline int bypass_unregister_child(struct net_device *child_netdev)
>+{
>+ return 0;
>+}
>+
>+#endif
>+
>+#endif /* _NET_BYPASS_H */
>diff --git a/net/Kconfig b/net/Kconfig
>index 0428f12c25c2..994445f4a96a 100644
>--- a/net/Kconfig
>+++ b/net/Kconfig
>@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
> on MAY_USE_DEVLINK to ensure they do not cause link errors when
> devlink is a loadable module and the driver using it is built-in.
>
>+config NET_BYPASS
>+ tristate "Bypass interface"
>+ ---help---
>+ This provides a generic interface for paravirtual drivers to listen
>+ for netdev register/unregister/link change events from pci ethernet
>+ devices with the same MAC and takeover their datapath. This also
>+ enables live migration of a VM with direct attached VF by failing
>+ over to the paravirtual datapath when the VF is unplugged.
>+
>+config MAY_USE_BYPASS
>+ tristate
>+ default m if NET_BYPASS=m
>+ default y if NET_BYPASS=y || NET_BYPASS=n
>+ help
>+ Drivers using the bypass infrastructure should have a dependency
>+ on MAY_USE_BYPASS to ensure they do not cause link errors when
>+ bypass is a loadable module and the driver using it is built-in.
>+
> endif # if NET
>
> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>diff --git a/net/core/Makefile b/net/core/Makefile
>index 6dbbba8c57ae..a9727ed1c8fc 100644
>--- a/net/core/Makefile
>+++ b/net/core/Makefile
>@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
> obj-$(CONFIG_HWBM) += hwbm.o
> obj-$(CONFIG_NET_DEVLINK) += devlink.o
> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>+obj-$(CONFIG_NET_BYPASS) += bypass.o
>diff --git a/net/core/bypass.c b/net/core/bypass.c
>new file mode 100644
>index 000000000000..7bde962ec3d4
>--- /dev/null
>+++ b/net/core/bypass.c
>@@ -0,0 +1,406 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+/* A common module to handle registrations and notifications for paravirtual
>+ * drivers to enable accelerated datapath and support VF live migration.
>+ *
>+ * The notifier and event handling code is based on netvsc driver.
>+ */
>+
>+#include <linux/netdevice.h>
>+#include <linux/etherdevice.h>
>+#include <linux/ethtool.h>
>+#include <linux/module.h>
>+#include <linux/slab.h>
>+#include <linux/netdevice.h>
>+#include <linux/netpoll.h>
>+#include <linux/rtnetlink.h>
>+#include <linux/if_vlan.h>
>+#include <net/sch_generic.h>
>+#include <uapi/linux/if_arp.h>
>+#include <net/bypass.h>
>+
>+static LIST_HEAD(bypass_list);
>+
>+static DEFINE_MUTEX(bypass_mutex);
Why mutex? Apparently you don't need to sleep while holding a lock.
Simple spinlock would do.
>+
>+struct bypass_instance *bypass_instance_alloc(struct net_device *dev)
>+{
>+ struct bypass_instance *bypass_instance;
>+
>+ bypass_instance = kzalloc(sizeof(*bypass_instance), GFP_KERNEL);
>+ if (!bypass_instance)
>+ return NULL;
>+
>+ dev_hold(dev);
>+ rcu_assign_pointer(bypass_instance->bypass_netdev, dev);
>+
>+ return bypass_instance;
>+}
>+
>+void bypass_instance_free(struct bypass_instance *bypass_instance)
>+{
>+ struct net_device *bypass_netdev;
>+
>+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>+
>+ dev_put(bypass_netdev);
>+ kfree(bypass_instance);
>+}
>+
>+static struct bypass_instance *bypass_get_instance_bymac(u8 *mac)
>+{
>+ struct bypass_instance *bypass_instance;
>+ struct net_device *bypass_netdev;
>+ struct bypass *bypass;
>+
>+ list_for_each_entry(bypass, &bypass_list, list) {
>+ mutex_lock(&bypass->lock);
>+ list_for_each_entry(bypass_instance, &bypass->instance_list,
>+ list) {
>+ bypass_netdev =
>+ rcu_dereference(bypass_instance->bypass_netdev);
>+ if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>+ mutex_unlock(&bypass->lock);
>+ goto out;
>+ }
>+ }
>+ mutex_unlock(&bypass->lock);
>+ }
>+
>+ bypass_instance = NULL;
>+out:
>+ return bypass_instance;
>+}
>+
>+static int bypass_register_child(struct net_device *child_netdev)
>+{
>+ struct bypass_instance *bypass_instance;
>+ struct bypass *bypass;
>+ struct net_device *bypass_netdev;
>+ int ret, orig_mtu;
>+
>+ ASSERT_RTNL();
>+
>+ mutex_lock(&bypass_mutex);
>+ bypass_instance = bypass_get_instance_bymac(child_netdev->perm_addr);
>+ if (!bypass_instance) {
>+ mutex_unlock(&bypass_mutex);
>+ goto done;
>+ }
>+
>+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>+ bypass = bypass_instance->bypass;
>+ mutex_unlock(&bypass_mutex);
>+
>+ if (!bypass->ops->register_child)
>+ goto done;
>+
>+ ret = bypass->ops->register_child(bypass_netdev, child_netdev);
>+ if (ret != 0)
>+ goto done;
>+
>+ ret = netdev_rx_handler_register(child_netdev,
>+ bypass->ops->handle_frame,
>+ bypass_netdev);
>+ if (ret != 0) {
>+ netdev_err(child_netdev,
>+ "can not register bypass rx handler (err = %d)\n",
>+ ret);
>+ goto rx_handler_failed;
>+ }
>+
>+ ret = netdev_upper_dev_link(child_netdev, bypass_netdev, NULL);
>+ if (ret != 0) {
>+ netdev_err(child_netdev,
No line-wrap is needed here and in other cases like this.
>+ "can not set master device %s (err = %d)\n",
>+ bypass_netdev->name, ret);
>+ goto upper_link_failed;
>+ }
>+
>+ child_netdev->flags |= IFF_SLAVE;
Don't reuse IFF_SLAVE. That is bonding-specific thing. I know that
netvsc uses it, it is wrong.
Please rather introduce:
IFF_BYPASS for master
and IFF_BYPASS_SLAVE for slaves.
>+
>+ if (netif_running(bypass_netdev)) {
>+ ret = dev_open(child_netdev);
>+ if (ret && (ret != -EBUSY)) {
>+ netdev_err(bypass_netdev,
>+ "Opening child %s failed ret:%d\n",
>+ child_netdev->name, ret);
>+ goto err_interface_up;
>+ }
>+ }
>+
>+ /* Align MTU of child with master */
>+ orig_mtu = child_netdev->mtu;
>+ ret = dev_set_mtu(child_netdev, bypass_netdev->mtu);
>+ if (ret != 0) {
>+ netdev_err(bypass_netdev,
>+ "unable to change mtu of %s to %u register failed\n",
>+ child_netdev->name, bypass_netdev->mtu);
>+ goto err_set_mtu;
>+ }
>+
>+ ret = bypass->ops->join_child(bypass_netdev, child_netdev);
>+ if (ret != 0)
>+ goto err_join;
>+
>+ call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>+
>+ goto done;
>+
>+err_join:
>+ dev_set_mtu(child_netdev, orig_mtu);
>+err_set_mtu:
>+ dev_close(child_netdev);
>+err_interface_up:
>+ netdev_upper_dev_unlink(child_netdev, bypass_netdev);
>+ child_netdev->flags &= ~IFF_SLAVE;
>+upper_link_failed:
>+ netdev_rx_handler_unregister(child_netdev);
>+rx_handler_failed:
>+ bypass->ops->unregister_child(bypass_netdev, child_netdev);
>+
>+done:
>+ return NOTIFY_DONE;
>+}
>+
>+int bypass_unregister_child(struct net_device *child_netdev)
>+{
>+ struct bypass_instance *bypass_instance;
>+ struct net_device *bypass_netdev;
>+ struct bypass *bypass;
>+ int ret;
>+
>+ ASSERT_RTNL();
>+
>+ mutex_lock(&bypass_mutex);
>+ bypass_instance = bypass_get_instance_bymac(child_netdev->perm_addr);
>+ if (!bypass_instance) {
>+ mutex_unlock(&bypass_mutex);
>+ goto done;
>+ }
>+
>+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>+ bypass = bypass_instance->bypass;
>+ mutex_unlock(&bypass_mutex);
>+
>+ ret = bypass->ops->release_child(bypass_netdev, child_netdev);
>+ if (ret != 0)
>+ goto done;
>+
>+ netdev_rx_handler_unregister(child_netdev);
>+ netdev_upper_dev_unlink(child_netdev, bypass_netdev);
>+ child_netdev->flags &= ~IFF_SLAVE;
>+
>+ if (!bypass->ops->unregister_child)
>+ goto done;
>+
>+ bypass->ops->unregister_child(bypass_netdev, child_netdev);
>+
>+done:
>+ return NOTIFY_DONE;
>+}
>+EXPORT_SYMBOL(bypass_unregister_child);
Please use "EXPORT_SYMBOL_GPL" for all exported symbols.
>+
>+static int bypass_update_link(struct net_device *child_netdev)
>+{
>+ struct bypass_instance *bypass_instance;
>+ struct net_device *bypass_netdev;
>+ struct bypass *bypass;
>+
>+ ASSERT_RTNL();
>+
>+ mutex_lock(&bypass_mutex);
>+ bypass_instance = bypass_get_instance_bymac(child_netdev->perm_addr);
You don't really need this lookup. The kernel knows about the master
device, you can just use netdev_master_upper_dev_get_rcu() to get it.
>+ if (!bypass_instance) {
>+ mutex_unlock(&bypass_mutex);
>+ goto done;
>+ }
>+
>+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>+ bypass = bypass_instance->bypass;
>+ mutex_unlock(&bypass_mutex);
>+
>+ if (!bypass->ops->update_link)
>+ goto done;
>+
>+ bypass->ops->update_link(bypass_netdev, child_netdev);
>+
>+done:
>+ return NOTIFY_DONE;
>+}
>+
>+static bool bypass_validate_child_dev(struct net_device *dev)
>+{
>+ /* Avoid non-Ethernet type devices */
>+ if (dev->type != ARPHRD_ETHER)
>+ return false;
>+
>+ /* Avoid Vlan dev with same MAC registering as VF */
>+ if (is_vlan_dev(dev))
>+ return false;
>+
>+ /* Avoid Bonding master dev with same MAC registering as child dev */
>+ if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>+ return false;
>+
>+ return true;
>+}
>+
>+static int
>+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>+{
>+ struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>+ struct bypass *bypass;
>+
>+ /* Skip Parent events */
>+ mutex_lock(&bypass_mutex);
>+ list_for_each_entry(bypass, &bypass_list, list) {
>+ if (event_dev->netdev_ops == bypass->netdev_ops) {
>+ mutex_unlock(&bypass_mutex);
>+ return NOTIFY_DONE;
>+ }
What you need instead of this is an identification helper
netif_is_bypass_master()
similar to
netif_is_team_master()
netif_is_bridge_master()
etc
>+ }
>+ mutex_unlock(&bypass_mutex);
>+
>+ if (!bypass_validate_child_dev(event_dev))
>+ return NOTIFY_DONE;
>+
>+ switch (event) {
>+ case NETDEV_REGISTER:
>+ return bypass_register_child(event_dev);
>+ case NETDEV_UNREGISTER:
>+ return bypass_unregister_child(event_dev);
>+ case NETDEV_UP:
>+ case NETDEV_DOWN:
>+ case NETDEV_CHANGE:
>+ return bypass_update_link(event_dev);
>+ default:
>+ return NOTIFY_DONE;
>+ }
>+}
>+
>+static struct notifier_block bypass_notifier = {
>+ .notifier_call = bypass_event,
>+};
>+
>+static void bypass_register_existing_child(struct net_device *bypass_netdev)
>+{
>+ struct net *net = dev_net(bypass_netdev);
>+ struct net_device *dev;
>+
>+ rtnl_lock();
>+ for_each_netdev(net, dev) {
>+ if (dev == bypass_netdev)
>+ continue;
>+ if (!bypass_validate_child_dev(dev))
>+ continue;
>+ if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>+ bypass_register_child(dev);
>+ }
>+ rtnl_unlock();
>+}
>+
>+int bypass_register_instance(struct bypass *bypass, struct net_device *dev)
>+{
>+ struct bypass_instance *bypass_instance;
No need to allocate this instace here. You can just have is embedded
inside netdevice priv and pass pointer to it here. You can pass the
pointer back to the driver when you call ops as the driver can get priv
back by it.
I would also call it "struct bypass_master" and this function
"bypass_master_register".
It should contain the ops pointer too.
>+ struct net_device *bypass_netdev;
>+ int ret = 0;
>+
>+ mutex_lock(&bypass->lock);
>+ list_for_each_entry(bypass_instance, &bypass->instance_list, list) {
>+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>+ if (bypass_netdev == dev) {
This means the driver registered one netdev twice. That is a bug in
driver, so WARN_ON would be nice here to point that out.
>+ ret = -EEXIST;
>+ goto done;
>+ }
>+ }
>+
>+ bypass_instance = bypass_instance_alloc(dev);
>+ if (!bypass_instance) {
>+ ret = -ENOMEM;
>+ goto done;
>+ }
>+
>+ bypass_instance->bypass = bypass;
>+ list_add_tail(&bypass_instance->list, &bypass->instance_list);
>+
>+done:
>+ mutex_unlock(&bypass->lock);
>+ bypass_register_existing_child(dev);
>+ return ret;
>+}
>+EXPORT_SYMBOL(bypass_register_instance);
>+
>+int bypass_unregister_instance(struct bypass *bypass, struct net_device *dev)
>+{
>+ struct bypass_instance *bypass_instance;
>+ struct net_device *bypass_netdev;
>+ int ret = 0;
>+
>+ mutex_lock(&bypass->lock);
>+ list_for_each_entry(bypass_instance, &bypass->instance_list, list) {
>+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>+ if (bypass_netdev == dev) {
>+ list_del(&bypass_instance->list);
>+ bypass_instance_free(bypass_instance);
>+ goto done;
>+ }
>+ }
>+
>+ ret = -ENOENT;
>+done:
>+ mutex_unlock(&bypass->lock);
>+ return ret;
>+}
>+EXPORT_SYMBOL(bypass_unregister_instance);
>+
>+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>+ const struct net_device_ops *netdev_ops)
I don't see why you need a list of drivers. What you need is just a list
of instances - bypass masters (probably to call them like that in the
code as well). Well, you can use the common netdevice list for that
purpose with the identification helper I mentioned above. Then you need
no lists and no mutexes/spinlocks.
>+{
>+ struct bypass *bypass;
>+
>+ bypass = kzalloc(sizeof(*bypass), GFP_KERNEL);
>+ if (!bypass)
>+ return NULL;
>+
>+ bypass->ops = ops;
>+ bypass->netdev_ops = netdev_ops;
>+ INIT_LIST_HEAD(&bypass->instance_list);
>+
>+ mutex_lock(&bypass_mutex);
>+ list_add_tail(&bypass->list, &bypass_list);
>+ mutex_unlock(&bypass_mutex);
>+
>+ return bypass;
>+}
>+EXPORT_SYMBOL_GPL(bypass_register_driver);
>+
>+void bypass_unregister_driver(struct bypass *bypass)
>+{
>+ mutex_lock(&bypass_mutex);
>+ list_del(&bypass->list);
>+ mutex_unlock(&bypass_mutex);
>+
>+ kfree(bypass);
>+}
>+EXPORT_SYMBOL_GPL(bypass_unregister_driver);
>+
>+static __init int
>+bypass_init(void)
>+{
>+ register_netdevice_notifier(&bypass_notifier);
>+
>+ return 0;
>+}
>+module_init(bypass_init);
>+
>+static __exit
>+void bypass_exit(void)
>+{
>+ unregister_netdevice_notifier(&bypass_notifier);
>+}
>+module_exit(bypass_exit);
>+
>+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>+MODULE_LICENSE("GPL v2");
>--
>2.14.3
>
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-06 12:48 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <1522962503-3963-4-git-send-email-sridhar.samudrala@intel.com>
Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF
>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to
>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>created that acts as a master device and tracks the state of the 2 lower
>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>passthru device with the same MAC is registered as 'active' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> drivers/net/Kconfig | 1 +
> drivers/net/virtio_net.c | 612 ++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 612 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>index 891846655000..9e2cf61fd1c1 100644
>--- a/drivers/net/Kconfig
>+++ b/drivers/net/Kconfig
>@@ -331,6 +331,7 @@ config VETH
> config VIRTIO_NET
> tristate "Virtio network driver"
> depends on VIRTIO
>+ depends on MAY_USE_BYPASS
> ---help---
> This is the virtual network driver for virtio. It can be used with
> QEMU based VMMs (like KVM or Xen). Say Y or M.
>diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>index befb5944f3fd..86b2f8f2947d 100644
>--- a/drivers/net/virtio_net.c
>+++ b/drivers/net/virtio_net.c
>@@ -30,8 +30,11 @@
> #include <linux/cpu.h>
> #include <linux/average.h>
> #include <linux/filter.h>
>+#include <linux/netdevice.h>
>+#include <linux/pci.h>
> #include <net/route.h>
> #include <net/xdp.h>
>+#include <net/bypass.h>
>
> static int napi_weight = NAPI_POLL_WEIGHT;
> module_param(napi_weight, int, 0444);
>@@ -206,6 +209,9 @@ struct virtnet_info {
> u32 speed;
>
> unsigned long guest_offloads;
>+
>+ /* upper netdev created when BACKUP feature enabled */
>+ struct net_device __rcu *bypass_netdev;
> };
>
> struct padded_vnet_hdr {
>@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> }
> }
>
>+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>+ size_t len)
>+{
>+ struct virtnet_info *vi = netdev_priv(dev);
>+ int ret;
>+
>+ if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>+ return -EOPNOTSUPP;
>+
>+ ret = snprintf(buf, len, "_bkup");
>+ if (ret >= len)
>+ return -EOPNOTSUPP;
>+
>+ return 0;
>+}
>+
> static const struct net_device_ops virtnet_netdev = {
> .ndo_open = virtnet_open,
> .ndo_stop = virtnet_close,
>@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
> .ndo_xdp_xmit = virtnet_xdp_xmit,
> .ndo_xdp_flush = virtnet_xdp_flush,
> .ndo_features_check = passthru_features_check,
>+ .ndo_get_phys_port_name = virtnet_get_phys_port_name,
> };
>
> static void virtnet_config_changed_work(struct work_struct *work)
>@@ -2689,6 +2712,576 @@ static int virtnet_validate(struct virtio_device *vdev)
> return 0;
> }
>
>+/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>+ * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>+ * is created that acts as a master device and tracks the state of the
>+ * 2 lower netdevs. The original virtio_net netdev is registered as
>+ * 'backup' netdev and a passthru device with the same MAC is registered
>+ * as 'active' netdev.
>+ */
>+
>+/* bypass state maintained when BACKUP feature is enabled */
>+struct virtnet_bypass_info {
>+ /* passthru netdev with same MAC */
>+ struct net_device __rcu *active_netdev;
>+
>+ /* virtio_net netdev */
>+ struct net_device __rcu *backup_netdev;
>+
>+ /* active netdev stats */
>+ struct rtnl_link_stats64 active_stats;
>+
>+ /* backup netdev stats */
>+ struct rtnl_link_stats64 backup_stats;
>+
>+ /* aggregated stats */
>+ struct rtnl_link_stats64 bypass_stats;
>+
>+ /* spinlock while updating stats */
>+ spinlock_t stats_lock;
>+};
>+
>+static int virtnet_bypass_open(struct net_device *dev)
>+{
>+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
>+ struct net_device *active_netdev, *backup_netdev;
>+ int err;
>+
>+ netif_carrier_off(dev);
>+ netif_tx_wake_all_queues(dev);
>+
>+ active_netdev = rtnl_dereference(vbi->active_netdev);
>+ if (active_netdev) {
>+ err = dev_open(active_netdev);
>+ if (err)
>+ goto err_active_open;
>+ }
>+
>+ backup_netdev = rtnl_dereference(vbi->backup_netdev);
>+ if (backup_netdev) {
>+ err = dev_open(backup_netdev);
>+ if (err)
>+ goto err_backup_open;
>+ }
This should be moved to bypass module.
See "***" below.
>+
>+ return 0;
>+
>+err_backup_open:
>+ dev_close(active_netdev);
>+err_active_open:
>+ netif_tx_disable(dev);
>+ return err;
>+}
>+
>+static int virtnet_bypass_close(struct net_device *dev)
>+{
>+ struct virtnet_bypass_info *vi = netdev_priv(dev);
>+ struct net_device *child_netdev;
>+
>+ netif_tx_disable(dev);
>+
>+ child_netdev = rtnl_dereference(vi->active_netdev);
>+ if (child_netdev)
>+ dev_close(child_netdev);
>+
>+ child_netdev = rtnl_dereference(vi->backup_netdev);
>+ if (child_netdev)
>+ dev_close(child_netdev);
This should be moved to bypass module.
>+
>+ return 0;
>+}
>+
>+static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
>+ struct net_device *dev)
>+{
>+ atomic_long_inc(&dev->tx_dropped);
>+ dev_kfree_skb_any(skb);
>+ return NETDEV_TX_OK;
>+}
>+
>+static bool virtnet_bypass_xmit_ready(struct net_device *dev)
>+{
>+ return netif_running(dev) && netif_carrier_ok(dev);
>+}
>+
>+static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
>+ struct net_device *dev)
>+{
>+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
>+ struct net_device *xmit_dev;
>+
>+ /* Try xmit via active netdev followed by backup netdev */
>+ xmit_dev = rcu_dereference_bh(vbi->active_netdev);
>+ if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
>+ xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
This should be moved to bypass module.
>+ if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
>+ return virtnet_bypass_drop_xmit(skb, dev);
>+ }
>+
>+ skb->dev = xmit_dev;
>+ skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>+
>+ return dev_queue_xmit(skb);
>+}
>+
>+static u16 virtnet_bypass_select_queue(struct net_device *dev,
>+ struct sk_buff *skb, void *accel_priv,
>+ select_queue_fallback_t fallback)
>+{
>+ /* This helper function exists to help dev_pick_tx get the correct
>+ * destination queue. Using a helper function skips a call to
>+ * skb_tx_hash and will put the skbs in the queue we expect on their
>+ * way down to the bonding driver.
>+ */
>+ u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>+
>+ /* Save the original txq to restore before passing to the driver */
>+ qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>+
>+ if (unlikely(txq >= dev->real_num_tx_queues)) {
>+ do {
>+ txq -= dev->real_num_tx_queues;
>+ } while (txq >= dev->real_num_tx_queues);
>+ }
>+
>+ return txq;
>+}
>+
>+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>+ * that some drivers can provide 32bit values only.
>+ */
>+static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
>+ const struct rtnl_link_stats64 *_new,
>+ const struct rtnl_link_stats64 *_old)
>+{
>+ const u64 *new = (const u64 *)_new;
>+ const u64 *old = (const u64 *)_old;
>+ u64 *res = (u64 *)_res;
>+ int i;
>+
>+ for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>+ u64 nv = new[i];
>+ u64 ov = old[i];
>+ s64 delta = nv - ov;
>+
>+ /* detects if this particular field is 32bit only */
>+ if (((nv | ov) >> 32) == 0)
>+ delta = (s64)(s32)((u32)nv - (u32)ov);
>+
>+ /* filter anomalies, some drivers reset their stats
>+ * at down/up events.
>+ */
>+ if (delta > 0)
>+ res[i] += delta;
>+ }
>+}
>+
>+static void virtnet_bypass_get_stats(struct net_device *dev,
>+ struct rtnl_link_stats64 *stats)
>+{
>+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
>+ const struct rtnl_link_stats64 *new;
>+ struct rtnl_link_stats64 temp;
>+ struct net_device *child_netdev;
>+
>+ spin_lock(&vbi->stats_lock);
>+ memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
>+
>+ rcu_read_lock();
>+
>+ child_netdev = rcu_dereference(vbi->active_netdev);
>+ if (child_netdev) {
>+ new = dev_get_stats(child_netdev, &temp);
>+ virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
>+ memcpy(&vbi->active_stats, new, sizeof(*new));
>+ }
>+
>+ child_netdev = rcu_dereference(vbi->backup_netdev);
>+ if (child_netdev) {
>+ new = dev_get_stats(child_netdev, &temp);
>+ virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
>+ memcpy(&vbi->backup_stats, new, sizeof(*new));
>+ }
>+
>+ rcu_read_unlock();
>+
>+ memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
>+ spin_unlock(&vbi->stats_lock);
>+}
This should be moved to bypass module.
>+
>+static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
>+{
>+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
>+ struct net_device *active_netdev, *backup_netdev;
>+ int ret = 0;
>+
>+ active_netdev = rcu_dereference(vbi->active_netdev);
>+ if (active_netdev) {
>+ ret = dev_set_mtu(active_netdev, new_mtu);
>+ if (ret)
>+ return ret;
>+ }
>+
>+ backup_netdev = rcu_dereference(vbi->backup_netdev);
>+ if (backup_netdev) {
>+ ret = dev_set_mtu(backup_netdev, new_mtu);
>+ if (ret) {
>+ dev_set_mtu(active_netdev, dev->mtu);
>+ return ret;
>+ }
>+ }
>+
>+ dev->mtu = new_mtu;
>+ return 0;
>+}
This should be moved to bypass module.
>+
>+static void virtnet_bypass_set_rx_mode(struct net_device *dev)
>+{
>+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
>+ struct net_device *child_netdev;
>+
>+ rcu_read_lock();
>+
>+ child_netdev = rcu_dereference(vbi->active_netdev);
>+ if (child_netdev) {
>+ dev_uc_sync_multiple(child_netdev, dev);
>+ dev_mc_sync_multiple(child_netdev, dev);
>+ }
>+
>+ child_netdev = rcu_dereference(vbi->backup_netdev);
>+ if (child_netdev) {
>+ dev_uc_sync_multiple(child_netdev, dev);
>+ dev_mc_sync_multiple(child_netdev, dev);
>+ }
>+
>+ rcu_read_unlock();
>+}
This should be moved to bypass module.
>+
>+static const struct net_device_ops virtnet_bypass_netdev_ops = {
>+ .ndo_open = virtnet_bypass_open,
>+ .ndo_stop = virtnet_bypass_close,
>+ .ndo_start_xmit = virtnet_bypass_start_xmit,
>+ .ndo_select_queue = virtnet_bypass_select_queue,
>+ .ndo_get_stats64 = virtnet_bypass_get_stats,
>+ .ndo_change_mtu = virtnet_bypass_change_mtu,
>+ .ndo_set_rx_mode = virtnet_bypass_set_rx_mode,
>+ .ndo_validate_addr = eth_validate_addr,
>+ .ndo_features_check = passthru_features_check,
>+};
>+
>+static int
>+virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
>+ struct ethtool_link_ksettings *cmd)
>+{
>+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
>+ struct net_device *child_netdev;
>+
>+ child_netdev = rtnl_dereference(vbi->active_netdev);
>+ if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
>+ child_netdev = rtnl_dereference(vbi->backup_netdev);
>+ if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
>+ cmd->base.duplex = DUPLEX_UNKNOWN;
>+ cmd->base.port = PORT_OTHER;
>+ cmd->base.speed = SPEED_UNKNOWN;
>+
>+ return 0;
>+ }
>+ }
>+
>+ return __ethtool_get_link_ksettings(child_netdev, cmd);
>+}
>+
>+#define BYPASS_DRV_NAME "virtnet_bypass"
>+#define BYPASS_DRV_VERSION "0.1"
>+
>+static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
>+ struct ethtool_drvinfo *drvinfo)
>+{
>+ strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>+ strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>+}
>+
>+static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
>+ .get_drvinfo = virtnet_bypass_ethtool_get_drvinfo,
>+ .get_link = ethtool_op_get_link,
>+ .get_link_ksettings = virtnet_bypass_ethtool_get_link_ksettings,
>+};
>+
>+static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev)
>+{
>+ struct virtnet_bypass_info *vbi;
>+ bool backup;
>+
>+ vbi = netdev_priv(bypass_netdev);
>+ backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>+ if (backup ? rtnl_dereference(vbi->backup_netdev) :
>+ rtnl_dereference(vbi->active_netdev)) {
>+ netdev_info(bypass_netdev,
>+ "%s attempting to join bypass dev when %s already present\n",
>+ child_netdev->name, backup ? "backup" : "active");
Bypass module should check if there is already some other netdev
enslaved and refuse right there.
The active/backup terminology is quite confusing. From the bonding world
that means active is the one which is currently used for tx of the
packets. And it depends on link and other things what netdev is declared
active. However here, it is different. Backup is always the virtio_net
instance even when it is active. Odd. Please change the terminology.
For "active" I suggest to use name "stolen".
*** Also, the 2 slave netdev pointers should be stored in the bypass
module instance, not in the drivers.
>+ return -EEXIST;
>+ }
>+
>+ dev_hold(child_netdev);
>+
>+ if (backup) {
>+ rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>+ dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>+ } else {
>+ rcu_assign_pointer(vbi->active_netdev, child_netdev);
>+ dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>+ bypass_netdev->min_mtu = child_netdev->min_mtu;
>+ bypass_netdev->max_mtu = child_netdev->max_mtu;
>+ }
>+
>+ netdev_info(bypass_netdev, "child:%s joined\n", child_netdev->name);
>+
>+ return 0;
>+}
>+
>+static int virtnet_bypass_register_child(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev)
>+{
>+ struct virtnet_bypass_info *vbi;
>+ bool backup;
>+
>+ vbi = netdev_priv(bypass_netdev);
>+ backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>+ if (backup ? rtnl_dereference(vbi->backup_netdev) :
>+ rtnl_dereference(vbi->active_netdev)) {
>+ netdev_info(bypass_netdev,
>+ "%s attempting to register bypass dev when %s already present\n",
>+ child_netdev->name, backup ? "backup" : "active");
>+ return -EEXIST;
>+ }
>+
>+ /* Avoid non pci devices as active netdev */
>+ if (!backup && (!child_netdev->dev.parent ||
>+ !dev_is_pci(child_netdev->dev.parent)))
>+ return -EINVAL;
>+
>+ netdev_info(bypass_netdev, "child:%s registered\n", child_netdev->name);
>+
>+ return 0;
>+}
>+
>+static int virtnet_bypass_release_child(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev)
>+{
>+ struct net_device *backup_netdev, *active_netdev;
>+ struct virtnet_bypass_info *vbi;
>+
>+ vbi = netdev_priv(bypass_netdev);
>+ active_netdev = rtnl_dereference(vbi->active_netdev);
>+ backup_netdev = rtnl_dereference(vbi->backup_netdev);
>+
>+ if (child_netdev != active_netdev && child_netdev != backup_netdev)
>+ return -EINVAL;
>+
>+ netdev_info(bypass_netdev, "child:%s released\n", child_netdev->name);
>+
>+ return 0;
>+}
>+
>+static int virtnet_bypass_unregister_child(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev)
>+{
>+ struct net_device *backup_netdev, *active_netdev;
>+ struct virtnet_bypass_info *vbi;
>+
>+ vbi = netdev_priv(bypass_netdev);
>+ active_netdev = rtnl_dereference(vbi->active_netdev);
>+ backup_netdev = rtnl_dereference(vbi->backup_netdev);
>+
>+ if (child_netdev != active_netdev && child_netdev != backup_netdev)
>+ return -EINVAL;
>+
>+ if (child_netdev == backup_netdev) {
>+ RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>+ } else {
>+ RCU_INIT_POINTER(vbi->active_netdev, NULL);
>+ if (backup_netdev) {
>+ bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+ bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+ }
>+ }
>+
>+ dev_put(child_netdev);
>+
>+ netdev_info(bypass_netdev, "child:%s unregistered\n",
>+ child_netdev->name);
>+
>+ return 0;
>+}
>+
>+static int virtnet_bypass_update_link(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev)
>+{
>+ struct net_device *active_netdev, *backup_netdev;
>+ struct virtnet_bypass_info *vbi;
>+
>+ if (!netif_running(bypass_netdev))
>+ return 0;
>+
>+ vbi = netdev_priv(bypass_netdev);
>+
>+ active_netdev = rtnl_dereference(vbi->active_netdev);
>+ backup_netdev = rtnl_dereference(vbi->backup_netdev);
>+
>+ if (child_netdev != active_netdev && child_netdev != backup_netdev)
>+ return -EINVAL;
>+
>+ if ((active_netdev && virtnet_bypass_xmit_ready(active_netdev)) ||
>+ (backup_netdev && virtnet_bypass_xmit_ready(backup_netdev))) {
>+ netif_carrier_on(bypass_netdev);
>+ netif_tx_wake_all_queues(bypass_netdev);
>+ } else {
>+ netif_carrier_off(bypass_netdev);
>+ netif_tx_stop_all_queues(bypass_netdev);
>+ }
>+
>+ return 0;
>+}
>+
>+/* Called when child dev is injecting data into network stack.
>+ * Change the associated network device from lower dev to virtio.
>+ * note: already called with rcu_read_lock
>+ */
>+static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
>+{
>+ struct sk_buff *skb = *pskb;
>+ struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>+
>+ skb->dev = ndev;
>+
>+ return RX_HANDLER_ANOTHER;
>+}
Hmm, you have the rx_handler defined in drivers and you register it in
bypass module. It is odd because here you assume that the bypass module
passed "ndev" and rx_handler_data. Also, you don't need advanced
features rx_handler provides.
Instead, just register a common rx_handler defined in the bypass module
and have simple skb_rx callback here (static void).
>+
>+static const struct bypass_ops virtnet_bypass_ops = {
>+ .register_child = virtnet_bypass_register_child,
>+ .join_child = virtnet_bypass_join_child,
>+ .unregister_child = virtnet_bypass_unregister_child,
>+ .release_child = virtnet_bypass_release_child,
>+ .update_link = virtnet_bypass_update_link,
>+ .handle_frame = virtnet_bypass_handle_frame,
>+};
>+
>+static struct bypass *virtnet_bypass;
>+
>+static int virtnet_bypass_create(struct virtnet_info *vi)
>+{
>+ struct net_device *backup_netdev = vi->dev;
>+ struct device *dev = &vi->vdev->dev;
>+ struct net_device *bypass_netdev;
>+ int res;
>+
>+ /* Alloc at least 2 queues, for now we are going with 16 assuming
>+ * that most devices being bonded won't have too many queues.
>+ */
>+ bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>+ 16);
>+ if (!bypass_netdev) {
>+ dev_err(dev, "Unable to allocate bypass_netdev!\n");
>+ return -ENOMEM;
>+ }
>+
>+ dev_net_set(bypass_netdev, dev_net(backup_netdev));
>+ SET_NETDEV_DEV(bypass_netdev, dev);
>+
>+ bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>+ bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>+
>+ /* Initialize the device options */
>+ bypass_netdev->flags |= IFF_MASTER;
I think I pointed that out already. Don't use "IFF_MASTER". That is
specific to bonding. As I suggested in the reply to the patch #2, you
should introduce IFF_BYPASS. Also, this flag should be set by the bypass
module. Just create the netdev and do things specific to virtio and then
call to bypass module, pass the netdev so it can do the rest. I think
that the flags, features etc would be also fine to set there.
>+ bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>+ bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>+ IFF_TX_SKB_SHARING);
>+
>+ /* don't acquire bypass netdev's netif_tx_lock when transmitting */
>+ bypass_netdev->features |= NETIF_F_LLTX;
>+
>+ /* Don't allow bypass devices to change network namespaces. */
>+ bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>+
>+ bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>+ NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>+ NETIF_F_HIGHDMA | NETIF_F_LRO;
>+
>+ bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>+ bypass_netdev->features |= bypass_netdev->hw_features;
>+
>+ /* For now treat bypass netdev as VLAN challenged since we
>+ * cannot assume VLAN functionality with a VF
Why? I don't see such drivers. But to be 100% correct, you should check
the NETIF_F_VLAN_CHALLENGED feature in bypass module during VF enslave
and forbid to do so if it is on.
>+ */
>+ bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
>+
>+ memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>+ bypass_netdev->addr_len);
>+
>+ bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+ bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+
>+ res = register_netdev(bypass_netdev);
>+ if (res < 0) {
>+ dev_err(dev, "Unable to register bypass_netdev!\n");
>+ goto err_register_netdev;
>+ }
>+
>+ netif_carrier_off(bypass_netdev);
>+
>+ res = bypass_register_instance(virtnet_bypass, bypass_netdev);
>+ if (res < 0)
>+ goto err_bypass;
>+
>+ rcu_assign_pointer(vi->bypass_netdev, bypass_netdev);
>+
>+ return 0;
>+
>+err_bypass:
>+ unregister_netdev(bypass_netdev);
>+err_register_netdev:
>+ free_netdev(bypass_netdev);
>+
>+ return res;
>+}
>+
>+static void virtnet_bypass_destroy(struct virtnet_info *vi)
>+{
>+ struct net_device *bypass_netdev;
>+ struct virtnet_bypass_info *vbi;
>+ struct net_device *child_netdev;
>+
>+ bypass_netdev = rcu_dereference(vi->bypass_netdev);
>+ /* no device found, nothing to free */
>+ if (!bypass_netdev)
>+ return;
>+
>+ vbi = netdev_priv(bypass_netdev);
>+
>+ netif_device_detach(bypass_netdev);
>+
>+ rtnl_lock();
>+
>+ child_netdev = rtnl_dereference(vbi->active_netdev);
>+ if (child_netdev)
>+ bypass_unregister_child(child_netdev);
>+
>+ child_netdev = rtnl_dereference(vbi->backup_netdev);
>+ if (child_netdev)
>+ bypass_unregister_child(child_netdev);
>+
>+ unregister_netdevice(bypass_netdev);
>+
>+ bypass_unregister_instance(virtnet_bypass, bypass_netdev);
>+
>+ rtnl_unlock();
>+
>+ free_netdev(bypass_netdev);
>+}
>+
>+/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */
>+
> static int virtnet_probe(struct virtio_device *vdev)
> {
> int i, err = -ENOMEM;
>@@ -2839,10 +3432,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>
> virtnet_init_settings(dev);
>
>+ if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
>+ if (virtnet_bypass_create(vi) != 0)
You need to do:
err = virtnet_bypass_create(vi);
if (err)
otherwise you ignore err and virtnet_probe would return 0;
>+ goto free_vqs;
>+ }
>+
> err = register_netdev(dev);
> if (err) {
> pr_debug("virtio_net: registering device failed\n");
>- goto free_vqs;
>+ goto free_bypass;
> }
>
> virtio_device_ready(vdev);
>@@ -2879,6 +3477,8 @@ static int virtnet_probe(struct virtio_device *vdev)
> vi->vdev->config->reset(vdev);
>
> unregister_netdev(dev);
>+free_bypass:
>+ virtnet_bypass_destroy(vi);
> free_vqs:
> cancel_delayed_work_sync(&vi->refill);
> free_receive_page_frags(vi);
>@@ -2913,6 +3513,8 @@ static void virtnet_remove(struct virtio_device *vdev)
>
> unregister_netdev(vi->dev);
>
>+ virtnet_bypass_destroy(vi);
>+
> remove_vq_common(vi);
>
> free_netdev(vi->dev);
>@@ -2996,6 +3598,11 @@ static __init int virtio_net_driver_init(void)
> {
> int ret;
>
>+ virtnet_bypass = bypass_register_driver(&virtnet_bypass_ops,
>+ &virtnet_bypass_netdev_ops);
>+ if (!virtnet_bypass)
>+ return -ENOMEM;
If CONFIG_NET_BYPASS is undefined, you will always return -ENOMEM here.
>+
> ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "virtio/net:online",
> virtnet_cpu_online,
> virtnet_cpu_down_prep);
>@@ -3010,12 +3617,14 @@ static __init int virtio_net_driver_init(void)
> ret = register_virtio_driver(&virtio_net_driver);
> if (ret)
> goto err_virtio;
>+
> return 0;
> err_virtio:
> cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> err_dead:
> cpuhp_remove_multi_state(virtionet_online);
> out:
>+ bypass_unregister_driver(virtnet_bypass);
> return ret;
> }
> module_init(virtio_net_driver_init);
>@@ -3025,6 +3634,7 @@ static __exit void virtio_net_driver_exit(void)
> unregister_virtio_driver(&virtio_net_driver);
> cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> cpuhp_remove_multi_state(virtionet_online);
>+ bypass_unregister_driver(virtnet_bypass);
> }
> module_exit(virtio_net_driver_exit);
>
>--
>2.14.3
>
^ permalink raw reply
* [PATCH v31 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
From: Wei Wang @ 2018-04-06 12:17 UTC (permalink / raw)
To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
mhocko, akpm
Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, huangzhichao,
pbonzini, nilal
In-Reply-To: <1523017045-18315-1-git-send-email-wei.w.wang@intel.com>
The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value in use.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
drivers/virtio/virtio_balloon.c | 10 ++++++++++
include/uapi/linux/virtio_balloon.h | 3 +++
2 files changed, 13 insertions(+)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index aef73ee..47b02d5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -714,6 +714,7 @@ static struct file_system_type balloon_fs = {
static int virtballoon_probe(struct virtio_device *vdev)
{
struct virtio_balloon *vb;
+ __u32 poison_val;
int err;
if (!vdev->config->get) {
@@ -759,6 +760,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->stop_cmd_id = cpu_to_virtio32(vb->vdev,
VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID);
INIT_WORK(&vb->report_free_page_work, report_free_page_func);
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+ memset(&poison_val, PAGE_POISON, sizeof(poison_val));
+ virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, &poison_val);
+ }
}
vb->nb.notifier_call = virtballoon_oom_notify;
@@ -877,6 +883,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
static int virtballoon_validate(struct virtio_device *vdev)
{
+ if (!page_poisoning_enabled())
+ __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;
}
@@ -886,6 +895,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+ VIRTIO_BALLOON_F_PAGE_POISON,
};
static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index b2d86c2..8b93581 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
__u32 actual;
/* Free page report command id, readonly by guest */
__u32 free_page_report_cmd_id;
+ /* Stores PAGE_POISON if page poisoning is in use */
+ __u32 poison_val;
};
#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
--
2.7.4
^ permalink raw reply related
* [PATCH v31 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules
From: Wei Wang @ 2018-04-06 12:17 UTC (permalink / raw)
To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
mhocko, akpm
Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, huangzhichao,
pbonzini, nilal
In-Reply-To: <1523017045-18315-1-git-send-email-wei.w.wang@intel.com>
In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/page_poison.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/page_poison.c b/mm/page_poison.c
index e83fd44..762b472 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int early_page_poison_param(char *buf)
}
early_param("page_poison", early_page_poison_param);
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
bool page_poisoning_enabled(void)
{
/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()));
}
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
static void poison_page(struct page *page)
{
--
2.7.4
^ permalink raw reply related
* [PATCH v31 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wei Wang @ 2018-04-06 12:17 UTC (permalink / raw)
To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
mhocko, akpm
Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, huangzhichao,
pbonzini, nilal
In-Reply-To: <1523017045-18315-1-git-send-email-wei.w.wang@intel.com>
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.
Host requests the guest to report free page hints by sending a new cmd
id to the guest via the free_page_report_cmd_id configuration register.
When the guest starts to report, the first element added to the free page
vq is the cmd id given by host. When the guest finishes the reporting
of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added
to the vq to tell host that the reporting is done. Host polls the free
page vq after sending the starting cmd id, so the guest doesn't need to
kick after filling an element to the vq.
Host may also requests the guest to stop the reporting in advance by
sending the stop cmd id to the guest via the configuration register.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
drivers/virtio/virtio_balloon.c | 272 +++++++++++++++++++++++++++++++-----
include/uapi/linux/virtio_balloon.h | 4 +
2 files changed, 240 insertions(+), 36 deletions(-)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index dfe5684..aef73ee 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,9 +51,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
static struct vfsmount *balloon_mnt;
#endif
+enum virtio_balloon_vq {
+ VIRTIO_BALLOON_VQ_INFLATE,
+ VIRTIO_BALLOON_VQ_DEFLATE,
+ VIRTIO_BALLOON_VQ_STATS,
+ VIRTIO_BALLOON_VQ_FREE_PAGE,
+ VIRTIO_BALLOON_VQ_MAX
+};
+
struct virtio_balloon {
struct virtio_device *vdev;
- struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+ struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+ /* Balloon's own wq for cpu-intensive work items */
+ struct workqueue_struct *balloon_wq;
+ /* The free page reporting work item submitted to the balloon wq */
+ struct work_struct report_free_page_work;
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -63,6 +76,13 @@ struct virtio_balloon {
spinlock_t stop_update_lock;
bool stop_update;
+ /* The new cmd id received from host */
+ uint32_t cmd_id_received;
+ /* The cmd id that is in use */
+ __virtio32 cmd_id_use;
+ /* Buffer to store the stop sign */
+ __virtio32 stop_cmd_id;
+
/* Waiting for host to ack the pages we released. */
wait_queue_head_t acked;
@@ -320,17 +340,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
}
-static void virtballoon_changed(struct virtio_device *vdev)
-{
- struct virtio_balloon *vb = vdev->priv;
- unsigned long flags;
-
- spin_lock_irqsave(&vb->stop_update_lock, flags);
- if (!vb->stop_update)
- queue_work(system_freezable_wq, &vb->update_balloon_size_work);
- spin_unlock_irqrestore(&vb->stop_update_lock, flags);
-}
-
static inline s64 towards_target(struct virtio_balloon *vb)
{
s64 target;
@@ -347,6 +356,34 @@ static inline s64 towards_target(struct virtio_balloon *vb)
return target - vb->num_pages;
}
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+ struct virtio_balloon *vb = vdev->priv;
+ unsigned long flags;
+ s64 diff = towards_target(vb);
+
+ if (diff) {
+ spin_lock_irqsave(&vb->stop_update_lock, flags);
+ if (!vb->stop_update)
+ queue_work(system_freezable_wq,
+ &vb->update_balloon_size_work);
+ spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+ }
+
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ virtio_cread(vdev, struct virtio_balloon_config,
+ free_page_report_cmd_id, &vb->cmd_id_received);
+ if (vb->cmd_id_received !=
+ VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID) {
+ spin_lock_irqsave(&vb->stop_update_lock, flags);
+ if (!vb->stop_update)
+ queue_work(vb->balloon_wq,
+ &vb->report_free_page_work);
+ spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+ }
+ }
+}
+
static void update_balloon_size(struct virtio_balloon *vb)
{
u32 actual = vb->num_pages;
@@ -421,42 +458,178 @@ static void update_balloon_size_func(struct work_struct *work)
static int init_vqs(struct virtio_balloon *vb)
{
- struct virtqueue *vqs[3];
- vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
- static const char * const names[] = { "inflate", "deflate", "stats" };
- int err, nvqs;
+ struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
+ vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
+ const char *names[VIRTIO_BALLOON_VQ_MAX];
+ struct scatterlist sg;
+ int ret;
/*
- * We expect two virtqueues: inflate and deflate, and
- * optionally stat.
+ * Inflateq and deflateq are used unconditionally. The names[]
+ * will be NULL if the related feature is not enabled, which will
+ * cause no allocation for the corresponding virtqueue in find_vqs.
*/
- nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
- err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
- if (err)
- return err;
+ callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
+ names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
+ callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
+ names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
+ names[VIRTIO_BALLOON_VQ_STATS] = NULL;
+ names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
- vb->inflate_vq = vqs[0];
- vb->deflate_vq = vqs[1];
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
- struct scatterlist sg;
- unsigned int num_stats;
- vb->stats_vq = vqs[2];
+ names[VIRTIO_BALLOON_VQ_STATS] = "stats";
+ callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
+ }
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
+ callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+ }
+
+ ret = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
+ vqs, callbacks, names, NULL, NULL);
+ if (ret)
+ return ret;
+
+ vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
+ vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+ vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
/*
* Prime this virtqueue with one buffer so the hypervisor can
* use it to signal us later (it can't be broken yet!).
*/
- num_stats = update_balloon_stats(vb);
-
- sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
- if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
- < 0)
- BUG();
+ sg_init_one(&sg, vb->stats, sizeof(vb->stats));
+ ret = virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb,
+ GFP_KERNEL);
+ if (ret) {
+ dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+ __func__);
+ return ret;
+ }
virtqueue_kick(vb->stats_vq);
}
+
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+ vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE];
+
return 0;
}
+static int add_one_sg(struct virtqueue *vq, unsigned long pfn, uint32_t len)
+{
+ struct scatterlist sg;
+ unsigned int unused;
+
+ sg_init_table(&sg, 1);
+ sg_set_page(&sg, pfn_to_page(pfn), len, 0);
+
+ /* Detach all the used buffers from the vq */
+ while (virtqueue_get_buf(vq, &unused))
+ ;
+
+ /*
+ * Since this is an optimization feature, losing a couple of free
+ * pages to report isn't important. We simply return without adding
+ * this page hint if the vq is full.
+ * We are adding one entry each time, which essentially results in no
+ * memory allocation, so the GFP_KERNEL flag below can be ignored.
+ * Host works by polling the free page vq for hints after sending the
+ * starting cmd id, so the driver doesn't need to kick after filling
+ * the vq.
+ * Lastly, there is always one entry reserved for the cmd id to use.
+ *
+ * TODO: The current implementation could be further improved by
+ * stopping the reporting when the vq is full and continuing the
+ * reporting when host notifies the driver that entries have been
+ * used.
+ */
+ if (vq->num_free > 1)
+ return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+
+ return 0;
+}
+
+static int virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+ unsigned long nr_pages)
+{
+ struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+ uint32_t len = nr_pages << PAGE_SHIFT;
+
+ /*
+ * If a stop id or a new cmd id was just received from host, stop
+ * the reporting, and return -EINTR to indicate an active stop.
+ *
+ * Ideally, we could have cmd_id_received accessed under locks, which
+ * ensures that no more entries are added to the vq once a stop cmd
+ * id is received from host. This requires host to wait for the
+ * driver's ACK about finishing the update of cmd_id_received. But
+ * this is not how host side works, because host doesn't work on an
+ * assumption that the driver would always be responsive. So
+ * theorically, there are possibilities that some entries may stay in
+ * the vq when host has exited the optimization. This isn't an issue,
+ * because entries simply contain guest physical addresses. There is no
+ * allocated memory that needs to be freed or dma mapped pages that
+ * need to be unmapped since virtio-balloon works with
+ * VIRTIO_F_IOMMU_PLATFORM disabled.
+ */
+ if (virtio32_to_cpu(vb->vdev, vb->cmd_id_use) != vb->cmd_id_received)
+ return -EINTR;
+
+ return add_one_sg(vb->free_page_vq, pfn, len);
+}
+
+static int send_start_cmd_id(struct virtio_balloon *vb, uint32_t cmd_id)
+{
+ struct scatterlist sg;
+ struct virtqueue *vq = vb->free_page_vq;
+
+ vb->cmd_id_use = cpu_to_virtio32(vb->vdev, cmd_id);
+ sg_init_one(&sg, &vb->cmd_id_use, sizeof(vb->cmd_id_use));
+ return virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+}
+
+static int send_stop_cmd_id(struct virtio_balloon *vb)
+{
+ struct scatterlist sg;
+ struct virtqueue *vq = vb->free_page_vq;
+
+ sg_init_one(&sg, &vb->stop_cmd_id, sizeof(vb->stop_cmd_id));
+ return virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+}
+
+static void report_free_page_func(struct work_struct *work)
+{
+ struct virtio_balloon *vb;
+ struct virtqueue *vq;
+ unsigned int unused;
+ int ret;
+
+ vb = container_of(work, struct virtio_balloon, report_free_page_work);
+ vq = vb->free_page_vq;
+
+ /* Start by sending the received cmd id to host with an outbuf. */
+ ret = send_start_cmd_id(vb, vb->cmd_id_received);
+ if (unlikely(ret))
+ goto err;
+
+ ret = walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+ if (unlikely(ret == -EIO))
+ goto err;
+
+ /* End by sending a stop id to host with an outbuf. */
+ ret = send_stop_cmd_id(vb);
+ if (likely(!ret)) {
+ /* Ending: detach all the used buffers from the vq. */
+ while (vq->num_free != virtqueue_get_vring_size(vq))
+ virtqueue_get_buf(vq, &unused);
+ return;
+ }
+err:
+ dev_err(&vb->vdev->dev, "%s: free page vq failure, ret=%d\n",
+ __func__, ret);
+}
+
#ifdef CONFIG_BALLOON_COMPACTION
/*
* virtballoon_migratepage - perform the balloon page migration on behalf of
@@ -570,18 +743,36 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_free_vb;
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ /*
+ * There is always one entry reserved for cmd id, so the ring
+ * size needs to be at least two to report free page hints.
+ */
+ if (virtqueue_get_vring_size(vb->free_page_vq) < 2)
+ goto out_free_vb;
+ vb->balloon_wq = alloc_workqueue("balloon-wq",
+ WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+ if (!vb->balloon_wq) {
+ err = -ENOMEM;
+ goto out_del_vqs;
+ }
+ vb->stop_cmd_id = cpu_to_virtio32(vb->vdev,
+ VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID);
+ INIT_WORK(&vb->report_free_page_work, report_free_page_func);
+ }
+
vb->nb.notifier_call = virtballoon_oom_notify;
vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
err = register_oom_notifier(&vb->nb);
if (err < 0)
- goto out_del_vqs;
+ goto out_del_balloon_wq;
#ifdef CONFIG_BALLOON_COMPACTION
balloon_mnt = kern_mount(&balloon_fs);
if (IS_ERR(balloon_mnt)) {
err = PTR_ERR(balloon_mnt);
unregister_oom_notifier(&vb->nb);
- goto out_del_vqs;
+ goto out_del_balloon_wq;
}
vb->vb_dev_info.migratepage = virtballoon_migratepage;
@@ -591,7 +782,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
kern_unmount(balloon_mnt);
unregister_oom_notifier(&vb->nb);
vb->vb_dev_info.inode = NULL;
- goto out_del_vqs;
+ goto out_del_balloon_wq;
}
vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
#endif
@@ -602,6 +793,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
virtballoon_changed(vdev);
return 0;
+out_del_balloon_wq:
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+ destroy_workqueue(vb->balloon_wq);
out_del_vqs:
vdev->config->del_vqs(vdev);
out_free_vb:
@@ -635,6 +829,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
cancel_work_sync(&vb->update_balloon_size_work);
cancel_work_sync(&vb->update_balloon_stats_work);
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ cancel_work_sync(&vb->report_free_page_work);
+ destroy_workqueue(vb->balloon_wq);
+ }
+
remove_common(vb);
#ifdef CONFIG_BALLOON_COMPACTION
if (vb->vb_dev_info.inode)
@@ -686,6 +885,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+ VIRTIO_BALLOON_F_FREE_PAGE_HINT,
};
static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 4e8b830..b2d86c2 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,15 +34,19 @@
#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
+#define VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID 0
struct virtio_balloon_config {
/* Number of pages host wants Guest to give up. */
__u32 num_pages;
/* Number of pages we've actually got in balloon. */
__u32 actual;
+ /* Free page report command id, readonly by guest */
+ __u32 free_page_report_cmd_id;
};
#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
--
2.7.4
^ permalink raw reply related
* [PATCH v31 1/4] mm: support reporting free page blocks
From: Wei Wang @ 2018-04-06 12:17 UTC (permalink / raw)
To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
mhocko, akpm
Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, huangzhichao,
pbonzini, nilal
In-Reply-To: <1523017045-18315-1-git-send-email-wei.w.wang@intel.com>
This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.
One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
---
include/linux/mm.h | 6 ++++
mm/page_alloc.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 103 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f945dff..4698df1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1951,6 +1951,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
extern void free_initmem(void);
+extern int walk_free_mem_block(void *opaque,
+ int min_order,
+ int (*report_pfn_range)(void *opaque,
+ unsigned long pfn,
+ unsigned long num));
+
/*
* Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
* into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4ea0182..83e50fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4912,6 +4912,103 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
show_swap_cache_info();
}
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return 0 if it completes the reporting. Otherwise, return the non-zero
+ * value returned from the callback.
+ */
+static int walk_free_page_list(void *opaque,
+ struct zone *zone,
+ int order,
+ enum migratetype mt,
+ int (*report_pfn_range)(void *,
+ unsigned long,
+ unsigned long))
+{
+ struct page *page;
+ struct list_head *list;
+ unsigned long pfn, flags;
+ int ret = 0;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ list = &zone->free_area[order].free_list[mt];
+ list_for_each_entry(page, list, lru) {
+ pfn = page_to_pfn(page);
+ ret = report_pfn_range(opaque, pfn, 1 << order);
+ if (ret)
+ break;
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns a non-zero value, stop iterating the list of free
+ * page blocks. Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ *
+ * Return 0 if it completes the reporting. Otherwise, return the non-zero
+ * value returned from the callback.
+ */
+int walk_free_mem_block(void *opaque,
+ int min_order,
+ int (*report_pfn_range)(void *opaque,
+ unsigned long pfn,
+ unsigned long num))
+{
+ struct zone *zone;
+ int order;
+ enum migratetype mt;
+ int ret;
+
+ for_each_populated_zone(zone) {
+ for (order = MAX_ORDER - 1; order >= min_order; order--) {
+ for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+ ret = walk_free_page_list(opaque, zone,
+ order, mt,
+ report_pfn_range);
+ if (ret)
+ return ret;
+ }
+ cond_resched();
+ }
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
{
zoneref->zone = zone;
--
2.7.4
^ permalink raw reply related
* [PATCH v31 0/4] Virtio-balloon: support free page reporting
From: Wei Wang @ 2018-04-06 12:17 UTC (permalink / raw)
To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
mhocko, akpm
Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, huangzhichao,
pbonzini, nilal
This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:
Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.
This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.
* Tests
- Test Environment
Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Guest: 8G RAM, 4 vCPU
Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second
- Test Results
- Idle Guest Live Migration Time (results are averaged over 10 runs):
- Optimization v.s. Legacy = 271ms vs 1769ms --> ~86% reduction
- Guest with Linux Compilation Workload (make bzImage -j4):
- Live Migration Time (average)
Optimization v.s. Legacy = 1265ms v.s. 2634ms --> ~51% reduction
- Linux Compilation Time
Optimization v.s. Legacy = 4min56s v.s. 5min3s
--> no obvious difference
ChangeLog:
v30->v31:
- virtio_balloon:
- virtio_balloon_send_free_pages: return -EINTR rather than 1 to
indicate an active stop requested by host; and add more
comments to explain about access to cmd_id_received without
locks
- add_one_sg: add TODO to comments about possible improvement.
v29->v30:
- mm/walk_free_mem_block: add cond_sched() for each order
v28->v29:
- mm/page_poison: only expose page_poison_enabled(), rather than more
changes did in v28, as we are not 100% confident about that for now.
- virtio-balloon: use a separate buffer for the stop cmd, instead of
having the start and stop cmd use the same buffer. This avoids the
corner case that the start cmd is overridden by the stop cmd when
the host has a delay in reading the start cmd.
v27->v28:
- mm/page_poison: Move PAGE_POISON to page_poison.c and add a function
to expose page poison val to kernel modules.
v26->v27:
- add a new patch to expose page_poisoning_enabled to kernel modules
- virtio-balloon: set poison_val to 0xaaaaaaaa, instead of 0xaa
v25->v26: virtio-balloon changes only
- remove kicking free page vq since the host now polls the vq after
initiating the reporting
- report_free_page_func: detach all the used buffers after sending
the stop cmd id. This avoids leaving the detaching burden (i.e.
overhead) to the next cmd id. Detaching here isn't considered
overhead since the stop cmd id has been sent, and host has already
moved formard.
v24->v25:
- mm: change walk_free_mem_block to return 0 (instead of true) on
completing the report, and return a non-zero value from the
callabck, which stops the reporting.
- virtio-balloon:
- use enum instead of define for VIRTIO_BALLOON_VQ_INFLATE etc.
- avoid __virtio_clear_bit when bailing out;
- a new method to avoid reporting the some cmd id to host twice
- destroy_workqueue can cancel free page work when the feature is
negotiated;
- fail probe when the free page vq size is less than 2.
v23->v24:
- change feature name VIRTIO_BALLOON_F_FREE_PAGE_VQ to
VIRTIO_BALLOON_F_FREE_PAGE_HINT
- kick when vq->num_free < half full, instead of "= half full"
- replace BUG_ON with bailing out
- check vb->balloon_wq in probe(), if null, bail out
- add a new feature bit for page poisoning
- solve the corner case that one cmd id being sent to host twice
v22->v23:
- change to kick the device when the vq is half-way full;
- open-code batch_free_page_sg into add_one_sg;
- change cmd_id from "uint32_t" to "__virtio32";
- reserver one entry in the vq for the driver to send cmd_id, instead
of busywaiting for an available entry;
- add "stop_update" check before queue_work for prudence purpose for
now, will have a separate patch to discuss this flag check later;
- init_vqs: change to put some variables on stack to have simpler
implementation;
- add destroy_workqueue(vb->balloon_wq);
v21->v22:
- add_one_sg: some code and comment re-arrangement
- send_cmd_id: handle a cornercase
For previous ChangeLog, please reference
https://lwn.net/Articles/743660/
Wei Wang (4):
mm: support reporting free page blocks
virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
mm/page_poison: expose page_poisoning_enabled to kernel modules
virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
drivers/virtio/virtio_balloon.c | 282 +++++++++++++++++++++++++++++++-----
include/linux/mm.h | 6 +
include/uapi/linux/virtio_balloon.h | 7 +
mm/page_alloc.c | 97 +++++++++++++
mm/page_poison.c | 6 +
5 files changed, 362 insertions(+), 36 deletions(-)
--
2.7.4
^ permalink raw reply
* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Benjamin Herrenschmidt @ 2018-04-06 8:37 UTC (permalink / raw)
To: Christoph Hellwig, Anshuman Khandual
Cc: robh, Michael S. Tsirkin, linux-kernel, virtualization, joe,
linuxppc-dev, elfring, david
In-Reply-To: <20180406071634.GA31108@infradead.org>
On Fri, 2018-04-06 at 00:16 -0700, Christoph Hellwig wrote:
> On Fri, Apr 06, 2018 at 08:23:10AM +0530, Anshuman Khandual wrote:
> > On 04/06/2018 02:48 AM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-04-05 at 21:34 +0300, Michael S. Tsirkin wrote:
> > > > > In this specific case, because that would make qemu expect an iommu,
> > > > > and there isn't one.
> > > >
> > > >
> > > > I think that you can set iommu_platform in qemu without an iommu.
> > >
> > > No I mean the platform has one but it's not desirable for it to be used
> > > due to the performance hit.
> >
> > Also the only requirement is to bounce the I/O buffers through SWIOTLB
> > implemented as DMA API which the virtio core understands. There is no
> > need for an IOMMU to be involved for the device representation in this
> > case IMHO.
>
> This whole virtio translation issue is a mess. I think we need to
> switch it to the dma API, and then quirk the legacy case to always
> use the direct mapping inside the dma API.
Fine with using a dma API always on the Linux side, but we do want to
special case virtio still at the arch and qemu side to have a "direct
mapping" mode. Not sure how (special flags on PCI devices) to avoid
actually going through an emulated IOMMU on the qemu side, because that
slows things down, esp. with vhost.
IE, we can't I think just treat it the same as a physical device.
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Christoph Hellwig @ 2018-04-06 7:16 UTC (permalink / raw)
To: Anshuman Khandual
Cc: robh, Michael S. Tsirkin, Benjamin Herrenschmidt, linux-kernel,
virtualization, joe, linuxppc-dev, elfring, david
In-Reply-To: <70cb433f-a8f7-5199-3c96-a760c7624804@linux.vnet.ibm.com>
On Fri, Apr 06, 2018 at 08:23:10AM +0530, Anshuman Khandual wrote:
> On 04/06/2018 02:48 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-04-05 at 21:34 +0300, Michael S. Tsirkin wrote:
> >>> In this specific case, because that would make qemu expect an iommu,
> >>> and there isn't one.
> >>
> >>
> >> I think that you can set iommu_platform in qemu without an iommu.
> >
> > No I mean the platform has one but it's not desirable for it to be used
> > due to the performance hit.
>
> Also the only requirement is to bounce the I/O buffers through SWIOTLB
> implemented as DMA API which the virtio core understands. There is no
> need for an IOMMU to be involved for the device representation in this
> case IMHO.
This whole virtio translation issue is a mess. I think we need to
switch it to the dma API, and then quirk the legacy case to always
use the direct mapping inside the dma API.
^ permalink raw reply
* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Anshuman Khandual @ 2018-04-06 2:53 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Michael S. Tsirkin
Cc: robh, linux-kernel, virtualization, joe, linuxppc-dev, elfring,
david
In-Reply-To: <1522963113.21446.211.camel@kernel.crashing.org>
On 04/06/2018 02:48 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-04-05 at 21:34 +0300, Michael S. Tsirkin wrote:
>>> In this specific case, because that would make qemu expect an iommu,
>>> and there isn't one.
>>
>>
>> I think that you can set iommu_platform in qemu without an iommu.
>
> No I mean the platform has one but it's not desirable for it to be used
> due to the performance hit.
Also the only requirement is to bounce the I/O buffers through SWIOTLB
implemented as DMA API which the virtio core understands. There is no
need for an IOMMU to be involved for the device representation in this
case IMHO.
^ permalink raw reply
* RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-04-06 0:11 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
nilal@redhat.com, liliang.opensource@gmail.com,
linux-kernel@vger.kernel.org, mhocko@kernel.org,
linux-mm@kvack.org, huangzhichao@huawei.com, pbonzini@redhat.com,
akpm@linux-foundation.org,
virtualization@lists.linux-foundation.org
In-Reply-To: <20180405212850-mutt-send-email-mst@kernel.org>
On Friday, April 6, 2018 2:30 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 03:47:28PM +0000, Wang, Wei W wrote:
> > On Thursday, April 5, 2018 10:04 PM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 05, 2018 at 02:05:03AM +0000, Wang, Wei W wrote:
> > > > On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> > > > > On Thu, Apr 05, 2018 at 12:30:27AM +0000, Wang, Wei W wrote:
> > > > > > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > > > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > > > I'm afraid the driver couldn't be aware if the added hints are
> > > > > > stale or not,
> > > > >
> > > > >
> > > > > No - I mean that driver has code that compares two values and
> > > > > stops reporting. Can one of the values be stale?
> > > >
> > > > The driver compares "vb->cmd_id_use != vb->cmd_id_received" to
> > > > decide if it needs to stop reporting hints, and cmd_id_received is
> > > > what the driver reads from host (host notifies the driver to read
> > > > for the latest value). If host sends a new cmd id, it will notify
> > > > the guest to read again. I'm not sure how that could be a stale
> > > > cmd id (or maybe I misunderstood your point here?)
> > > >
> > > > Best,
> > > > Wei
> > >
> > > The comparison is done in one thread, the update in another one.
> >
> > I think this isn't something that could be solved by adding a lock,
> > unless host waits for the driver's ACK about finishing the update
> > (this is not agreed in the QEMU part discussion).
> >
> > Actually virtio_balloon has F_IOMMU_PLATFORM disabled, maybe we
> don't
> > need to worry about that using DMA api case (we only have gpa added to
> > the vq, and having some entries stay in the vq seems fine). For this
> > feature, I think it would not work with F_IOMMU enabled either.
>
> Adding a code comment explaining all this might be a good idea.
Thanks, will do.
Best,
Wei
^ permalink raw reply
* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Benjamin Herrenschmidt @ 2018-04-05 21:18 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: robh, linux-kernel, virtualization, joe, david, linuxppc-dev,
elfring, Anshuman Khandual
In-Reply-To: <20180405213042-mutt-send-email-mst@kernel.org>
On Thu, 2018-04-05 at 21:34 +0300, Michael S. Tsirkin wrote:
> > In this specific case, because that would make qemu expect an iommu,
> > and there isn't one.
>
>
> I think that you can set iommu_platform in qemu without an iommu.
No I mean the platform has one but it's not desirable for it to be used
due to the performance hit.
Cheers,
Ben.
>
> > Anshuman, you need to provide more background here. I don't have time
> > right now it's late, but explain about the fact that this is for a
> > specific type of secure VM which has only a limited pool of (insecure)
> > memory that can be shared with qemu, so all IOs need to bounce via that
> > pool, which can be achieved by using swiotlb.
> >
> > Note: this isn't urgent, we can discuss alternative approaches, this is
> > just to start the conversation.
> >
> > Cheers,
> > Ben.
^ permalink raw reply
* [RFC PATCH net-next v5 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Sridhar Samudrala @ 2018-04-05 21:08 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
In-Reply-To: <1522962503-3963-1-git-send-email-sridhar.samudrala@intel.com>
Use the registration/notification framework supported by the generic
bypass infrastructure.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
drivers/net/hyperv/Kconfig | 1 +
drivers/net/hyperv/netvsc_drv.c | 219 ++++++++++++----------------------------
2 files changed, 63 insertions(+), 157 deletions(-)
diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 936968d23559..cc3a721baa18 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -1,5 +1,6 @@
config HYPERV_NET
tristate "Microsoft Hyper-V virtual network driver"
depends on HYPERV
+ depends on MAY_USE_BYPASS
help
Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..b53a2be99bd2 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
#include <net/pkt_sched.h>
#include <net/checksum.h>
#include <net/ip6_checksum.h>
+#include <net/bypass.h>
#include "hyperv_net.h"
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
rtnl_unlock();
}
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
- struct net_device *dev;
-
- ASSERT_RTNL();
-
- for_each_netdev(&init_net, dev) {
- if (dev->netdev_ops != &device_ops)
- continue; /* not a netvsc device */
-
- if (ether_addr_equal(mac, dev->perm_addr))
- return dev;
- }
-
- return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
- struct net_device *dev;
-
- ASSERT_RTNL();
-
- for_each_netdev(&init_net, dev) {
- struct net_device_context *net_device_ctx;
-
- if (dev->netdev_ops != &device_ops)
- continue; /* not a netvsc device */
-
- net_device_ctx = netdev_priv(dev);
- if (!rtnl_dereference(net_device_ctx->nvdev))
- continue; /* device is removed */
-
- if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
- return dev; /* a match */
- }
-
- return NULL;
-}
-
/* Called when VF is injecting data into network stack.
* Change the associated network device from VF to netvsc.
* note: already called with rcu_read_lock
@@ -1825,43 +1786,19 @@ static rx_handler_result_t netvsc_vf_handle_frame(struct sk_buff **pskb)
return RX_HANDLER_ANOTHER;
}
-static int netvsc_vf_join(struct net_device *vf_netdev,
- struct net_device *ndev)
+static int netvsc_vf_join(struct net_device *ndev,
+ struct net_device *vf_netdev)
{
struct net_device_context *ndev_ctx = netdev_priv(ndev);
- int ret;
-
- ret = netdev_rx_handler_register(vf_netdev,
- netvsc_vf_handle_frame, ndev);
- if (ret != 0) {
- netdev_err(vf_netdev,
- "can not register netvsc VF receive handler (err = %d)\n",
- ret);
- goto rx_handler_failed;
- }
-
- ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
- if (ret != 0) {
- netdev_err(vf_netdev,
- "can not set master device %s (err = %d)\n",
- ndev->name, ret);
- goto upper_link_failed;
- }
-
- /* set slave flag before open to prevent IPv6 addrconf */
- vf_netdev->flags |= IFF_SLAVE;
schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
- call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
netdev_info(vf_netdev, "joined to %s\n", ndev->name);
- return 0;
-upper_link_failed:
- netdev_rx_handler_unregister(vf_netdev);
-rx_handler_failed:
- return ret;
+ dev_hold(vf_netdev);
+ rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
+
+ return 0;
}
static void __netvsc_vf_setup(struct net_device *ndev,
@@ -1914,85 +1851,84 @@ static void netvsc_vf_setup(struct work_struct *w)
rtnl_unlock();
}
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_register_vf(struct net_device *ndev,
+ struct net_device *vf_netdev)
{
- struct net_device *ndev;
struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
- if (vf_netdev->addr_len != ETH_ALEN)
- return NOTIFY_DONE;
-
- /*
- * We will use the MAC address to locate the synthetic interface to
- * associate with the VF interface. If we don't find a matching
- * synthetic interface, move on.
- */
- ndev = get_netvsc_bymac(vf_netdev->perm_addr);
- if (!ndev)
- return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
- return NOTIFY_DONE;
-
- if (netvsc_vf_join(vf_netdev, ndev) != 0)
- return NOTIFY_DONE;
+ return -EEXIST;
netdev_info(ndev, "VF registering: %s\n", vf_netdev->name);
- dev_hold(vf_netdev);
- rcu_assign_pointer(net_device_ctx->vf_netdev, vf_netdev);
- return NOTIFY_OK;
+ return 0;
}
/* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *ndev,
+ struct net_device *vf_netdev)
{
struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
- struct net_device *ndev;
bool vf_is_up = netif_running(vf_netdev);
- ndev = get_netvsc_byref(vf_netdev);
- if (!ndev)
- return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
if (!netvsc_dev)
- return NOTIFY_DONE;
+ return -EINVAL;
netvsc_switch_datapath(ndev, vf_is_up);
netdev_info(ndev, "Data path switched %s VF: %s\n",
vf_is_up ? "to" : "from", vf_netdev->name);
- return NOTIFY_OK;
+ return 0;
}
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_release_vf(struct net_device *ndev,
+ struct net_device *vf_netdev)
{
- struct net_device *ndev;
struct net_device_context *net_device_ctx;
- ndev = get_netvsc_byref(vf_netdev);
- if (!ndev)
- return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
+ if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+ return -EINVAL;
+
cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
+ return 0;
+}
+
+static int netvsc_unregister_vf(struct net_device *ndev,
+ struct net_device *vf_netdev)
+{
+ struct net_device_context *net_device_ctx;
+
+ net_device_ctx = netdev_priv(ndev);
+ if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+ return -EINVAL;
+
netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
- netdev_rx_handler_unregister(vf_netdev);
- netdev_upper_dev_unlink(vf_netdev, ndev);
RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
dev_put(vf_netdev);
- return NOTIFY_OK;
+ return 0;
}
+static const struct bypass_ops netvsc_bypass_ops = {
+ .register_child = netvsc_register_vf,
+ .join_child = netvsc_vf_join,
+ .unregister_child = netvsc_unregister_vf,
+ .release_child = netvsc_release_vf,
+ .update_link = netvsc_vf_changed,
+ .handle_frame = netvsc_vf_handle_frame,
+};
+
+static struct bypass *netvsc_bypass;
+
static int netvsc_probe(struct hv_device *dev,
const struct hv_vmbus_device_id *dev_id)
{
@@ -2082,8 +2018,14 @@ static int netvsc_probe(struct hv_device *dev,
goto register_failed;
}
+ ret = bypass_register_instance(netvsc_bypass, net);
+ if (ret != 0)
+ goto err_bypass;
+
return ret;
+err_bypass:
+ unregister_netdev(net);
register_failed:
rndis_filter_device_remove(dev, nvdev);
rndis_failed:
@@ -2124,13 +2066,15 @@ static int netvsc_remove(struct hv_device *dev)
rtnl_lock();
vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
if (vf_netdev)
- netvsc_unregister_vf(vf_netdev);
+ bypass_unregister_child(vf_netdev);
if (nvdev)
rndis_filter_device_remove(dev, nvdev);
unregister_netdevice(net);
+ bypass_unregister_instance(netvsc_bypass, net);
+
rtnl_unlock();
rcu_read_unlock();
@@ -2157,61 +2101,21 @@ static struct hv_driver netvsc_drv = {
.remove = netvsc_remove,
};
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
- unsigned long event, void *ptr)
-{
- struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
- /* Skip our own events */
- if (event_dev->netdev_ops == &device_ops)
- return NOTIFY_DONE;
-
- /* Avoid non-Ethernet type devices */
- if (event_dev->type != ARPHRD_ETHER)
- return NOTIFY_DONE;
-
- /* Avoid Vlan dev with same MAC registering as VF */
- if (is_vlan_dev(event_dev))
- return NOTIFY_DONE;
-
- /* Avoid Bonding master dev with same MAC registering as VF */
- if ((event_dev->priv_flags & IFF_BONDING) &&
- (event_dev->flags & IFF_MASTER))
- return NOTIFY_DONE;
-
- switch (event) {
- case NETDEV_REGISTER:
- return netvsc_register_vf(event_dev);
- case NETDEV_UNREGISTER:
- return netvsc_unregister_vf(event_dev);
- case NETDEV_UP:
- case NETDEV_DOWN:
- return netvsc_vf_changed(event_dev);
- default:
- return NOTIFY_DONE;
- }
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
- .notifier_call = netvsc_netdev_event,
-};
-
static void __exit netvsc_drv_exit(void)
{
- unregister_netdevice_notifier(&netvsc_netdev_notifier);
vmbus_driver_unregister(&netvsc_drv);
+ bypass_unregister_driver(netvsc_bypass);
}
static int __init netvsc_drv_init(void)
{
int ret;
+ netvsc_bypass = bypass_register_driver(&netvsc_bypass_ops,
+ &device_ops);
+ if (!netvsc_bypass)
+ return -ENOMEM;
+
if (ring_size < RING_SIZE_MIN) {
ring_size = RING_SIZE_MIN;
pr_info("Increased ring_size to %u (min allowed)\n",
@@ -2221,10 +2125,11 @@ static int __init netvsc_drv_init(void)
netvsc_ring_reciprocal = reciprocal_value(netvsc_ring_bytes);
ret = vmbus_driver_register(&netvsc_drv);
- if (ret)
+ if (ret) {
+ bypass_unregister_driver(netvsc_bypass);
return ret;
+ }
- register_netdevice_notifier(&netvsc_netdev_notifier);
return 0;
}
--
2.14.3
^ permalink raw reply related
* [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Sridhar Samudrala @ 2018-04-05 21:08 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
In-Reply-To: <1522962503-3963-1-git-send-email-sridhar.samudrala@intel.com>
This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.
The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.
When BACKUP feature is enabled, an additional netdev(bypass netdev) is
created that acts as a master device and tracks the state of the 2 lower
netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
passthru device with the same MAC is registered as 'active' netdev.
This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
drivers/net/Kconfig | 1 +
drivers/net/virtio_net.c | 612 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 612 insertions(+), 1 deletion(-)
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 891846655000..9e2cf61fd1c1 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -331,6 +331,7 @@ config VETH
config VIRTIO_NET
tristate "Virtio network driver"
depends on VIRTIO
+ depends on MAY_USE_BYPASS
---help---
This is the virtual network driver for virtio. It can be used with
QEMU based VMMs (like KVM or Xen). Say Y or M.
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index befb5944f3fd..86b2f8f2947d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,8 +30,11 @@
#include <linux/cpu.h>
#include <linux/average.h>
#include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
#include <net/route.h>
#include <net/xdp.h>
+#include <net/bypass.h>
static int napi_weight = NAPI_POLL_WEIGHT;
module_param(napi_weight, int, 0444);
@@ -206,6 +209,9 @@ struct virtnet_info {
u32 speed;
unsigned long guest_offloads;
+
+ /* upper netdev created when BACKUP feature enabled */
+ struct net_device __rcu *bypass_netdev;
};
struct padded_vnet_hdr {
@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
}
}
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+ size_t len)
+{
+ struct virtnet_info *vi = netdev_priv(dev);
+ int ret;
+
+ if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
+ return -EOPNOTSUPP;
+
+ ret = snprintf(buf, len, "_bkup");
+ if (ret >= len)
+ return -EOPNOTSUPP;
+
+ return 0;
+}
+
static const struct net_device_ops virtnet_netdev = {
.ndo_open = virtnet_open,
.ndo_stop = virtnet_close,
@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
.ndo_xdp_xmit = virtnet_xdp_xmit,
.ndo_xdp_flush = virtnet_xdp_flush,
.ndo_features_check = passthru_features_check,
+ .ndo_get_phys_port_name = virtnet_get_phys_port_name,
};
static void virtnet_config_changed_work(struct work_struct *work)
@@ -2689,6 +2712,576 @@ static int virtnet_validate(struct virtio_device *vdev)
return 0;
}
+/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
+ * When BACKUP feature is enabled, an additional netdev(bypass netdev)
+ * is created that acts as a master device and tracks the state of the
+ * 2 lower netdevs. The original virtio_net netdev is registered as
+ * 'backup' netdev and a passthru device with the same MAC is registered
+ * as 'active' netdev.
+ */
+
+/* bypass state maintained when BACKUP feature is enabled */
+struct virtnet_bypass_info {
+ /* passthru netdev with same MAC */
+ struct net_device __rcu *active_netdev;
+
+ /* virtio_net netdev */
+ struct net_device __rcu *backup_netdev;
+
+ /* active netdev stats */
+ struct rtnl_link_stats64 active_stats;
+
+ /* backup netdev stats */
+ struct rtnl_link_stats64 backup_stats;
+
+ /* aggregated stats */
+ struct rtnl_link_stats64 bypass_stats;
+
+ /* spinlock while updating stats */
+ spinlock_t stats_lock;
+};
+
+static int virtnet_bypass_open(struct net_device *dev)
+{
+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
+ struct net_device *active_netdev, *backup_netdev;
+ int err;
+
+ netif_carrier_off(dev);
+ netif_tx_wake_all_queues(dev);
+
+ active_netdev = rtnl_dereference(vbi->active_netdev);
+ if (active_netdev) {
+ err = dev_open(active_netdev);
+ if (err)
+ goto err_active_open;
+ }
+
+ backup_netdev = rtnl_dereference(vbi->backup_netdev);
+ if (backup_netdev) {
+ err = dev_open(backup_netdev);
+ if (err)
+ goto err_backup_open;
+ }
+
+ return 0;
+
+err_backup_open:
+ dev_close(active_netdev);
+err_active_open:
+ netif_tx_disable(dev);
+ return err;
+}
+
+static int virtnet_bypass_close(struct net_device *dev)
+{
+ struct virtnet_bypass_info *vi = netdev_priv(dev);
+ struct net_device *child_netdev;
+
+ netif_tx_disable(dev);
+
+ child_netdev = rtnl_dereference(vi->active_netdev);
+ if (child_netdev)
+ dev_close(child_netdev);
+
+ child_netdev = rtnl_dereference(vi->backup_netdev);
+ if (child_netdev)
+ dev_close(child_netdev);
+
+ return 0;
+}
+
+static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
+ struct net_device *dev)
+{
+ atomic_long_inc(&dev->tx_dropped);
+ dev_kfree_skb_any(skb);
+ return NETDEV_TX_OK;
+}
+
+static bool virtnet_bypass_xmit_ready(struct net_device *dev)
+{
+ return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
+ struct net_device *dev)
+{
+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
+ struct net_device *xmit_dev;
+
+ /* Try xmit via active netdev followed by backup netdev */
+ xmit_dev = rcu_dereference_bh(vbi->active_netdev);
+ if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
+ xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
+ if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
+ return virtnet_bypass_drop_xmit(skb, dev);
+ }
+
+ skb->dev = xmit_dev;
+ skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+ return dev_queue_xmit(skb);
+}
+
+static u16 virtnet_bypass_select_queue(struct net_device *dev,
+ struct sk_buff *skb, void *accel_priv,
+ select_queue_fallback_t fallback)
+{
+ /* This helper function exists to help dev_pick_tx get the correct
+ * destination queue. Using a helper function skips a call to
+ * skb_tx_hash and will put the skbs in the queue we expect on their
+ * way down to the bonding driver.
+ */
+ u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+ /* Save the original txq to restore before passing to the driver */
+ qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+ if (unlikely(txq >= dev->real_num_tx_queues)) {
+ do {
+ txq -= dev->real_num_tx_queues;
+ } while (txq >= dev->real_num_tx_queues);
+ }
+
+ return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
+ const struct rtnl_link_stats64 *_new,
+ const struct rtnl_link_stats64 *_old)
+{
+ const u64 *new = (const u64 *)_new;
+ const u64 *old = (const u64 *)_old;
+ u64 *res = (u64 *)_res;
+ int i;
+
+ for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+ u64 nv = new[i];
+ u64 ov = old[i];
+ s64 delta = nv - ov;
+
+ /* detects if this particular field is 32bit only */
+ if (((nv | ov) >> 32) == 0)
+ delta = (s64)(s32)((u32)nv - (u32)ov);
+
+ /* filter anomalies, some drivers reset their stats
+ * at down/up events.
+ */
+ if (delta > 0)
+ res[i] += delta;
+ }
+}
+
+static void virtnet_bypass_get_stats(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
+{
+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
+ const struct rtnl_link_stats64 *new;
+ struct rtnl_link_stats64 temp;
+ struct net_device *child_netdev;
+
+ spin_lock(&vbi->stats_lock);
+ memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
+
+ rcu_read_lock();
+
+ child_netdev = rcu_dereference(vbi->active_netdev);
+ if (child_netdev) {
+ new = dev_get_stats(child_netdev, &temp);
+ virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
+ memcpy(&vbi->active_stats, new, sizeof(*new));
+ }
+
+ child_netdev = rcu_dereference(vbi->backup_netdev);
+ if (child_netdev) {
+ new = dev_get_stats(child_netdev, &temp);
+ virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
+ memcpy(&vbi->backup_stats, new, sizeof(*new));
+ }
+
+ rcu_read_unlock();
+
+ memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
+ spin_unlock(&vbi->stats_lock);
+}
+
+static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
+{
+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
+ struct net_device *active_netdev, *backup_netdev;
+ int ret = 0;
+
+ active_netdev = rcu_dereference(vbi->active_netdev);
+ if (active_netdev) {
+ ret = dev_set_mtu(active_netdev, new_mtu);
+ if (ret)
+ return ret;
+ }
+
+ backup_netdev = rcu_dereference(vbi->backup_netdev);
+ if (backup_netdev) {
+ ret = dev_set_mtu(backup_netdev, new_mtu);
+ if (ret) {
+ dev_set_mtu(active_netdev, dev->mtu);
+ return ret;
+ }
+ }
+
+ dev->mtu = new_mtu;
+ return 0;
+}
+
+static void virtnet_bypass_set_rx_mode(struct net_device *dev)
+{
+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
+ struct net_device *child_netdev;
+
+ rcu_read_lock();
+
+ child_netdev = rcu_dereference(vbi->active_netdev);
+ if (child_netdev) {
+ dev_uc_sync_multiple(child_netdev, dev);
+ dev_mc_sync_multiple(child_netdev, dev);
+ }
+
+ child_netdev = rcu_dereference(vbi->backup_netdev);
+ if (child_netdev) {
+ dev_uc_sync_multiple(child_netdev, dev);
+ dev_mc_sync_multiple(child_netdev, dev);
+ }
+
+ rcu_read_unlock();
+}
+
+static const struct net_device_ops virtnet_bypass_netdev_ops = {
+ .ndo_open = virtnet_bypass_open,
+ .ndo_stop = virtnet_bypass_close,
+ .ndo_start_xmit = virtnet_bypass_start_xmit,
+ .ndo_select_queue = virtnet_bypass_select_queue,
+ .ndo_get_stats64 = virtnet_bypass_get_stats,
+ .ndo_change_mtu = virtnet_bypass_change_mtu,
+ .ndo_set_rx_mode = virtnet_bypass_set_rx_mode,
+ .ndo_validate_addr = eth_validate_addr,
+ .ndo_features_check = passthru_features_check,
+};
+
+static int
+virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
+{
+ struct virtnet_bypass_info *vbi = netdev_priv(dev);
+ struct net_device *child_netdev;
+
+ child_netdev = rtnl_dereference(vbi->active_netdev);
+ if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+ child_netdev = rtnl_dereference(vbi->backup_netdev);
+ if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+ cmd->base.duplex = DUPLEX_UNKNOWN;
+ cmd->base.port = PORT_OTHER;
+ cmd->base.speed = SPEED_UNKNOWN;
+
+ return 0;
+ }
+ }
+
+ return __ethtool_get_link_ksettings(child_netdev, cmd);
+}
+
+#define BYPASS_DRV_NAME "virtnet_bypass"
+#define BYPASS_DRV_VERSION "0.1"
+
+static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
+ struct ethtool_drvinfo *drvinfo)
+{
+ strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
+ strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
+ .get_drvinfo = virtnet_bypass_ethtool_get_drvinfo,
+ .get_link = ethtool_op_get_link,
+ .get_link_ksettings = virtnet_bypass_ethtool_get_link_ksettings,
+};
+
+static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
+ struct net_device *child_netdev)
+{
+ struct virtnet_bypass_info *vbi;
+ bool backup;
+
+ vbi = netdev_priv(bypass_netdev);
+ backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
+ if (backup ? rtnl_dereference(vbi->backup_netdev) :
+ rtnl_dereference(vbi->active_netdev)) {
+ netdev_info(bypass_netdev,
+ "%s attempting to join bypass dev when %s already present\n",
+ child_netdev->name, backup ? "backup" : "active");
+ return -EEXIST;
+ }
+
+ dev_hold(child_netdev);
+
+ if (backup) {
+ rcu_assign_pointer(vbi->backup_netdev, child_netdev);
+ dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
+ } else {
+ rcu_assign_pointer(vbi->active_netdev, child_netdev);
+ dev_get_stats(vbi->active_netdev, &vbi->active_stats);
+ bypass_netdev->min_mtu = child_netdev->min_mtu;
+ bypass_netdev->max_mtu = child_netdev->max_mtu;
+ }
+
+ netdev_info(bypass_netdev, "child:%s joined\n", child_netdev->name);
+
+ return 0;
+}
+
+static int virtnet_bypass_register_child(struct net_device *bypass_netdev,
+ struct net_device *child_netdev)
+{
+ struct virtnet_bypass_info *vbi;
+ bool backup;
+
+ vbi = netdev_priv(bypass_netdev);
+ backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
+ if (backup ? rtnl_dereference(vbi->backup_netdev) :
+ rtnl_dereference(vbi->active_netdev)) {
+ netdev_info(bypass_netdev,
+ "%s attempting to register bypass dev when %s already present\n",
+ child_netdev->name, backup ? "backup" : "active");
+ return -EEXIST;
+ }
+
+ /* Avoid non pci devices as active netdev */
+ if (!backup && (!child_netdev->dev.parent ||
+ !dev_is_pci(child_netdev->dev.parent)))
+ return -EINVAL;
+
+ netdev_info(bypass_netdev, "child:%s registered\n", child_netdev->name);
+
+ return 0;
+}
+
+static int virtnet_bypass_release_child(struct net_device *bypass_netdev,
+ struct net_device *child_netdev)
+{
+ struct net_device *backup_netdev, *active_netdev;
+ struct virtnet_bypass_info *vbi;
+
+ vbi = netdev_priv(bypass_netdev);
+ active_netdev = rtnl_dereference(vbi->active_netdev);
+ backup_netdev = rtnl_dereference(vbi->backup_netdev);
+
+ if (child_netdev != active_netdev && child_netdev != backup_netdev)
+ return -EINVAL;
+
+ netdev_info(bypass_netdev, "child:%s released\n", child_netdev->name);
+
+ return 0;
+}
+
+static int virtnet_bypass_unregister_child(struct net_device *bypass_netdev,
+ struct net_device *child_netdev)
+{
+ struct net_device *backup_netdev, *active_netdev;
+ struct virtnet_bypass_info *vbi;
+
+ vbi = netdev_priv(bypass_netdev);
+ active_netdev = rtnl_dereference(vbi->active_netdev);
+ backup_netdev = rtnl_dereference(vbi->backup_netdev);
+
+ if (child_netdev != active_netdev && child_netdev != backup_netdev)
+ return -EINVAL;
+
+ if (child_netdev == backup_netdev) {
+ RCU_INIT_POINTER(vbi->backup_netdev, NULL);
+ } else {
+ RCU_INIT_POINTER(vbi->active_netdev, NULL);
+ if (backup_netdev) {
+ bypass_netdev->min_mtu = backup_netdev->min_mtu;
+ bypass_netdev->max_mtu = backup_netdev->max_mtu;
+ }
+ }
+
+ dev_put(child_netdev);
+
+ netdev_info(bypass_netdev, "child:%s unregistered\n",
+ child_netdev->name);
+
+ return 0;
+}
+
+static int virtnet_bypass_update_link(struct net_device *bypass_netdev,
+ struct net_device *child_netdev)
+{
+ struct net_device *active_netdev, *backup_netdev;
+ struct virtnet_bypass_info *vbi;
+
+ if (!netif_running(bypass_netdev))
+ return 0;
+
+ vbi = netdev_priv(bypass_netdev);
+
+ active_netdev = rtnl_dereference(vbi->active_netdev);
+ backup_netdev = rtnl_dereference(vbi->backup_netdev);
+
+ if (child_netdev != active_netdev && child_netdev != backup_netdev)
+ return -EINVAL;
+
+ if ((active_netdev && virtnet_bypass_xmit_ready(active_netdev)) ||
+ (backup_netdev && virtnet_bypass_xmit_ready(backup_netdev))) {
+ netif_carrier_on(bypass_netdev);
+ netif_tx_wake_all_queues(bypass_netdev);
+ } else {
+ netif_carrier_off(bypass_netdev);
+ netif_tx_stop_all_queues(bypass_netdev);
+ }
+
+ return 0;
+}
+
+/* Called when child dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
+{
+ struct sk_buff *skb = *pskb;
+ struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+ skb->dev = ndev;
+
+ return RX_HANDLER_ANOTHER;
+}
+
+static const struct bypass_ops virtnet_bypass_ops = {
+ .register_child = virtnet_bypass_register_child,
+ .join_child = virtnet_bypass_join_child,
+ .unregister_child = virtnet_bypass_unregister_child,
+ .release_child = virtnet_bypass_release_child,
+ .update_link = virtnet_bypass_update_link,
+ .handle_frame = virtnet_bypass_handle_frame,
+};
+
+static struct bypass *virtnet_bypass;
+
+static int virtnet_bypass_create(struct virtnet_info *vi)
+{
+ struct net_device *backup_netdev = vi->dev;
+ struct device *dev = &vi->vdev->dev;
+ struct net_device *bypass_netdev;
+ int res;
+
+ /* Alloc at least 2 queues, for now we are going with 16 assuming
+ * that most devices being bonded won't have too many queues.
+ */
+ bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
+ 16);
+ if (!bypass_netdev) {
+ dev_err(dev, "Unable to allocate bypass_netdev!\n");
+ return -ENOMEM;
+ }
+
+ dev_net_set(bypass_netdev, dev_net(backup_netdev));
+ SET_NETDEV_DEV(bypass_netdev, dev);
+
+ bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
+ bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
+
+ /* Initialize the device options */
+ bypass_netdev->flags |= IFF_MASTER;
+ bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+ bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+ IFF_TX_SKB_SHARING);
+
+ /* don't acquire bypass netdev's netif_tx_lock when transmitting */
+ bypass_netdev->features |= NETIF_F_LLTX;
+
+ /* Don't allow bypass devices to change network namespaces. */
+ bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
+
+ bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+ NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+ NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+ bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+ bypass_netdev->features |= bypass_netdev->hw_features;
+
+ /* For now treat bypass netdev as VLAN challenged since we
+ * cannot assume VLAN functionality with a VF
+ */
+ bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
+
+ memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
+ bypass_netdev->addr_len);
+
+ bypass_netdev->min_mtu = backup_netdev->min_mtu;
+ bypass_netdev->max_mtu = backup_netdev->max_mtu;
+
+ res = register_netdev(bypass_netdev);
+ if (res < 0) {
+ dev_err(dev, "Unable to register bypass_netdev!\n");
+ goto err_register_netdev;
+ }
+
+ netif_carrier_off(bypass_netdev);
+
+ res = bypass_register_instance(virtnet_bypass, bypass_netdev);
+ if (res < 0)
+ goto err_bypass;
+
+ rcu_assign_pointer(vi->bypass_netdev, bypass_netdev);
+
+ return 0;
+
+err_bypass:
+ unregister_netdev(bypass_netdev);
+err_register_netdev:
+ free_netdev(bypass_netdev);
+
+ return res;
+}
+
+static void virtnet_bypass_destroy(struct virtnet_info *vi)
+{
+ struct net_device *bypass_netdev;
+ struct virtnet_bypass_info *vbi;
+ struct net_device *child_netdev;
+
+ bypass_netdev = rcu_dereference(vi->bypass_netdev);
+ /* no device found, nothing to free */
+ if (!bypass_netdev)
+ return;
+
+ vbi = netdev_priv(bypass_netdev);
+
+ netif_device_detach(bypass_netdev);
+
+ rtnl_lock();
+
+ child_netdev = rtnl_dereference(vbi->active_netdev);
+ if (child_netdev)
+ bypass_unregister_child(child_netdev);
+
+ child_netdev = rtnl_dereference(vbi->backup_netdev);
+ if (child_netdev)
+ bypass_unregister_child(child_netdev);
+
+ unregister_netdevice(bypass_netdev);
+
+ bypass_unregister_instance(virtnet_bypass, bypass_netdev);
+
+ rtnl_unlock();
+
+ free_netdev(bypass_netdev);
+}
+
+/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */
+
static int virtnet_probe(struct virtio_device *vdev)
{
int i, err = -ENOMEM;
@@ -2839,10 +3432,15 @@ static int virtnet_probe(struct virtio_device *vdev)
virtnet_init_settings(dev);
+ if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
+ if (virtnet_bypass_create(vi) != 0)
+ goto free_vqs;
+ }
+
err = register_netdev(dev);
if (err) {
pr_debug("virtio_net: registering device failed\n");
- goto free_vqs;
+ goto free_bypass;
}
virtio_device_ready(vdev);
@@ -2879,6 +3477,8 @@ static int virtnet_probe(struct virtio_device *vdev)
vi->vdev->config->reset(vdev);
unregister_netdev(dev);
+free_bypass:
+ virtnet_bypass_destroy(vi);
free_vqs:
cancel_delayed_work_sync(&vi->refill);
free_receive_page_frags(vi);
@@ -2913,6 +3513,8 @@ static void virtnet_remove(struct virtio_device *vdev)
unregister_netdev(vi->dev);
+ virtnet_bypass_destroy(vi);
+
remove_vq_common(vi);
free_netdev(vi->dev);
@@ -2996,6 +3598,11 @@ static __init int virtio_net_driver_init(void)
{
int ret;
+ virtnet_bypass = bypass_register_driver(&virtnet_bypass_ops,
+ &virtnet_bypass_netdev_ops);
+ if (!virtnet_bypass)
+ return -ENOMEM;
+
ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "virtio/net:online",
virtnet_cpu_online,
virtnet_cpu_down_prep);
@@ -3010,12 +3617,14 @@ static __init int virtio_net_driver_init(void)
ret = register_virtio_driver(&virtio_net_driver);
if (ret)
goto err_virtio;
+
return 0;
err_virtio:
cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
err_dead:
cpuhp_remove_multi_state(virtionet_online);
out:
+ bypass_unregister_driver(virtnet_bypass);
return ret;
}
module_init(virtio_net_driver_init);
@@ -3025,6 +3634,7 @@ static __exit void virtio_net_driver_exit(void)
unregister_virtio_driver(&virtio_net_driver);
cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
cpuhp_remove_multi_state(virtionet_online);
+ bypass_unregister_driver(virtnet_bypass);
}
module_exit(virtio_net_driver_exit);
--
2.14.3
^ permalink raw reply related
* [RFC PATCH net-next v5 2/4] net: Introduce generic bypass module
From: Sridhar Samudrala @ 2018-04-05 21:08 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
In-Reply-To: <1522962503-3963-1-git-send-email-sridhar.samudrala@intel.com>
This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation. A
paravirtual driver can use this module by registering a set of ops and
each instance of the device when it is probed.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
include/net/bypass.h | 80 ++++++++++
net/Kconfig | 18 +++
net/core/Makefile | 1 +
net/core/bypass.c | 406 +++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 505 insertions(+)
create mode 100644 include/net/bypass.h
create mode 100644 net/core/bypass.c
diff --git a/include/net/bypass.h b/include/net/bypass.h
new file mode 100644
index 000000000000..e2dd122f951a
--- /dev/null
+++ b/include/net/bypass.h
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_BYPASS_H
+#define _NET_BYPASS_H
+
+#include <linux/netdevice.h>
+
+struct bypass_ops {
+ int (*register_child)(struct net_device *bypass_netdev,
+ struct net_device *child_netdev);
+ int (*join_child)(struct net_device *bypass_netdev,
+ struct net_device *child_netdev);
+ int (*unregister_child)(struct net_device *bypass_netdev,
+ struct net_device *child_netdev);
+ int (*release_child)(struct net_device *bypass_netdev,
+ struct net_device *child_netdev);
+ int (*update_link)(struct net_device *bypass_netdev,
+ struct net_device *child_netdev);
+ rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
+};
+
+struct bypass_instance {
+ struct list_head list;
+ struct net_device __rcu *bypass_netdev;
+ struct bypass *bypass;
+};
+
+struct bypass {
+ struct list_head list;
+ const struct bypass_ops *ops;
+ const struct net_device_ops *netdev_ops;
+ struct list_head instance_list;
+ struct mutex lock;
+};
+
+#if IS_ENABLED(CONFIG_NET_BYPASS)
+
+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
+ const struct net_device_ops *netdev_ops);
+void bypass_unregister_driver(struct bypass *bypass);
+
+int bypass_register_instance(struct bypass *bypass, struct net_device *dev);
+int bypass_unregister_instance(struct bypass *bypass, struct net_device *dev);
+
+int bypass_unregister_child(struct net_device *child_netdev);
+
+#else
+
+static inline
+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
+ const struct net_device_ops *netdev_ops)
+{
+ return NULL;
+}
+
+static inline void bypass_unregister_driver(struct bypass *bypass)
+{
+}
+
+static inline int bypass_register_instance(struct bypass *bypass,
+ struct net_device *dev)
+{
+ return 0;
+}
+
+static inline int bypass_unregister_instance(struct bypass *bypass,
+ struct net_device *dev)
+{
+ return 0;
+}
+
+static inline int bypass_unregister_child(struct net_device *child_netdev)
+{
+ return 0;
+}
+
+#endif
+
+#endif /* _NET_BYPASS_H */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..994445f4a96a 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
on MAY_USE_DEVLINK to ensure they do not cause link errors when
devlink is a loadable module and the driver using it is built-in.
+config NET_BYPASS
+ tristate "Bypass interface"
+ ---help---
+ This provides a generic interface for paravirtual drivers to listen
+ for netdev register/unregister/link change events from pci ethernet
+ devices with the same MAC and takeover their datapath. This also
+ enables live migration of a VM with direct attached VF by failing
+ over to the paravirtual datapath when the VF is unplugged.
+
+config MAY_USE_BYPASS
+ tristate
+ default m if NET_BYPASS=m
+ default y if NET_BYPASS=y || NET_BYPASS=n
+ help
+ Drivers using the bypass infrastructure should have a dependency
+ on MAY_USE_BYPASS to ensure they do not cause link errors when
+ bypass is a loadable module and the driver using it is built-in.
+
endif # if NET
# Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..a9727ed1c8fc 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
obj-$(CONFIG_HWBM) += hwbm.o
obj-$(CONFIG_NET_DEVLINK) += devlink.o
obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_NET_BYPASS) += bypass.o
diff --git a/net/core/bypass.c b/net/core/bypass.c
new file mode 100644
index 000000000000..7bde962ec3d4
--- /dev/null
+++ b/net/core/bypass.c
@@ -0,0 +1,406 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* A common module to handle registrations and notifications for paravirtual
+ * drivers to enable accelerated datapath and support VF live migration.
+ *
+ * The notifier and event handling code is based on netvsc driver.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/netpoll.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
+#include <net/sch_generic.h>
+#include <uapi/linux/if_arp.h>
+#include <net/bypass.h>
+
+static LIST_HEAD(bypass_list);
+
+static DEFINE_MUTEX(bypass_mutex);
+
+struct bypass_instance *bypass_instance_alloc(struct net_device *dev)
+{
+ struct bypass_instance *bypass_instance;
+
+ bypass_instance = kzalloc(sizeof(*bypass_instance), GFP_KERNEL);
+ if (!bypass_instance)
+ return NULL;
+
+ dev_hold(dev);
+ rcu_assign_pointer(bypass_instance->bypass_netdev, dev);
+
+ return bypass_instance;
+}
+
+void bypass_instance_free(struct bypass_instance *bypass_instance)
+{
+ struct net_device *bypass_netdev;
+
+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
+
+ dev_put(bypass_netdev);
+ kfree(bypass_instance);
+}
+
+static struct bypass_instance *bypass_get_instance_bymac(u8 *mac)
+{
+ struct bypass_instance *bypass_instance;
+ struct net_device *bypass_netdev;
+ struct bypass *bypass;
+
+ list_for_each_entry(bypass, &bypass_list, list) {
+ mutex_lock(&bypass->lock);
+ list_for_each_entry(bypass_instance, &bypass->instance_list,
+ list) {
+ bypass_netdev =
+ rcu_dereference(bypass_instance->bypass_netdev);
+ if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
+ mutex_unlock(&bypass->lock);
+ goto out;
+ }
+ }
+ mutex_unlock(&bypass->lock);
+ }
+
+ bypass_instance = NULL;
+out:
+ return bypass_instance;
+}
+
+static int bypass_register_child(struct net_device *child_netdev)
+{
+ struct bypass_instance *bypass_instance;
+ struct bypass *bypass;
+ struct net_device *bypass_netdev;
+ int ret, orig_mtu;
+
+ ASSERT_RTNL();
+
+ mutex_lock(&bypass_mutex);
+ bypass_instance = bypass_get_instance_bymac(child_netdev->perm_addr);
+ if (!bypass_instance) {
+ mutex_unlock(&bypass_mutex);
+ goto done;
+ }
+
+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
+ bypass = bypass_instance->bypass;
+ mutex_unlock(&bypass_mutex);
+
+ if (!bypass->ops->register_child)
+ goto done;
+
+ ret = bypass->ops->register_child(bypass_netdev, child_netdev);
+ if (ret != 0)
+ goto done;
+
+ ret = netdev_rx_handler_register(child_netdev,
+ bypass->ops->handle_frame,
+ bypass_netdev);
+ if (ret != 0) {
+ netdev_err(child_netdev,
+ "can not register bypass rx handler (err = %d)\n",
+ ret);
+ goto rx_handler_failed;
+ }
+
+ ret = netdev_upper_dev_link(child_netdev, bypass_netdev, NULL);
+ if (ret != 0) {
+ netdev_err(child_netdev,
+ "can not set master device %s (err = %d)\n",
+ bypass_netdev->name, ret);
+ goto upper_link_failed;
+ }
+
+ child_netdev->flags |= IFF_SLAVE;
+
+ if (netif_running(bypass_netdev)) {
+ ret = dev_open(child_netdev);
+ if (ret && (ret != -EBUSY)) {
+ netdev_err(bypass_netdev,
+ "Opening child %s failed ret:%d\n",
+ child_netdev->name, ret);
+ goto err_interface_up;
+ }
+ }
+
+ /* Align MTU of child with master */
+ orig_mtu = child_netdev->mtu;
+ ret = dev_set_mtu(child_netdev, bypass_netdev->mtu);
+ if (ret != 0) {
+ netdev_err(bypass_netdev,
+ "unable to change mtu of %s to %u register failed\n",
+ child_netdev->name, bypass_netdev->mtu);
+ goto err_set_mtu;
+ }
+
+ ret = bypass->ops->join_child(bypass_netdev, child_netdev);
+ if (ret != 0)
+ goto err_join;
+
+ call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
+
+ goto done;
+
+err_join:
+ dev_set_mtu(child_netdev, orig_mtu);
+err_set_mtu:
+ dev_close(child_netdev);
+err_interface_up:
+ netdev_upper_dev_unlink(child_netdev, bypass_netdev);
+ child_netdev->flags &= ~IFF_SLAVE;
+upper_link_failed:
+ netdev_rx_handler_unregister(child_netdev);
+rx_handler_failed:
+ bypass->ops->unregister_child(bypass_netdev, child_netdev);
+
+done:
+ return NOTIFY_DONE;
+}
+
+int bypass_unregister_child(struct net_device *child_netdev)
+{
+ struct bypass_instance *bypass_instance;
+ struct net_device *bypass_netdev;
+ struct bypass *bypass;
+ int ret;
+
+ ASSERT_RTNL();
+
+ mutex_lock(&bypass_mutex);
+ bypass_instance = bypass_get_instance_bymac(child_netdev->perm_addr);
+ if (!bypass_instance) {
+ mutex_unlock(&bypass_mutex);
+ goto done;
+ }
+
+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
+ bypass = bypass_instance->bypass;
+ mutex_unlock(&bypass_mutex);
+
+ ret = bypass->ops->release_child(bypass_netdev, child_netdev);
+ if (ret != 0)
+ goto done;
+
+ netdev_rx_handler_unregister(child_netdev);
+ netdev_upper_dev_unlink(child_netdev, bypass_netdev);
+ child_netdev->flags &= ~IFF_SLAVE;
+
+ if (!bypass->ops->unregister_child)
+ goto done;
+
+ bypass->ops->unregister_child(bypass_netdev, child_netdev);
+
+done:
+ return NOTIFY_DONE;
+}
+EXPORT_SYMBOL(bypass_unregister_child);
+
+static int bypass_update_link(struct net_device *child_netdev)
+{
+ struct bypass_instance *bypass_instance;
+ struct net_device *bypass_netdev;
+ struct bypass *bypass;
+
+ ASSERT_RTNL();
+
+ mutex_lock(&bypass_mutex);
+ bypass_instance = bypass_get_instance_bymac(child_netdev->perm_addr);
+ if (!bypass_instance) {
+ mutex_unlock(&bypass_mutex);
+ goto done;
+ }
+
+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
+ bypass = bypass_instance->bypass;
+ mutex_unlock(&bypass_mutex);
+
+ if (!bypass->ops->update_link)
+ goto done;
+
+ bypass->ops->update_link(bypass_netdev, child_netdev);
+
+done:
+ return NOTIFY_DONE;
+}
+
+static bool bypass_validate_child_dev(struct net_device *dev)
+{
+ /* Avoid non-Ethernet type devices */
+ if (dev->type != ARPHRD_ETHER)
+ return false;
+
+ /* Avoid Vlan dev with same MAC registering as VF */
+ if (is_vlan_dev(dev))
+ return false;
+
+ /* Avoid Bonding master dev with same MAC registering as child dev */
+ if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
+ return false;
+
+ return true;
+}
+
+static int
+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
+{
+ struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+ struct bypass *bypass;
+
+ /* Skip Parent events */
+ mutex_lock(&bypass_mutex);
+ list_for_each_entry(bypass, &bypass_list, list) {
+ if (event_dev->netdev_ops == bypass->netdev_ops) {
+ mutex_unlock(&bypass_mutex);
+ return NOTIFY_DONE;
+ }
+ }
+ mutex_unlock(&bypass_mutex);
+
+ if (!bypass_validate_child_dev(event_dev))
+ return NOTIFY_DONE;
+
+ switch (event) {
+ case NETDEV_REGISTER:
+ return bypass_register_child(event_dev);
+ case NETDEV_UNREGISTER:
+ return bypass_unregister_child(event_dev);
+ case NETDEV_UP:
+ case NETDEV_DOWN:
+ case NETDEV_CHANGE:
+ return bypass_update_link(event_dev);
+ default:
+ return NOTIFY_DONE;
+ }
+}
+
+static struct notifier_block bypass_notifier = {
+ .notifier_call = bypass_event,
+};
+
+static void bypass_register_existing_child(struct net_device *bypass_netdev)
+{
+ struct net *net = dev_net(bypass_netdev);
+ struct net_device *dev;
+
+ rtnl_lock();
+ for_each_netdev(net, dev) {
+ if (dev == bypass_netdev)
+ continue;
+ if (!bypass_validate_child_dev(dev))
+ continue;
+ if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
+ bypass_register_child(dev);
+ }
+ rtnl_unlock();
+}
+
+int bypass_register_instance(struct bypass *bypass, struct net_device *dev)
+{
+ struct bypass_instance *bypass_instance;
+ struct net_device *bypass_netdev;
+ int ret = 0;
+
+ mutex_lock(&bypass->lock);
+ list_for_each_entry(bypass_instance, &bypass->instance_list, list) {
+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
+ if (bypass_netdev == dev) {
+ ret = -EEXIST;
+ goto done;
+ }
+ }
+
+ bypass_instance = bypass_instance_alloc(dev);
+ if (!bypass_instance) {
+ ret = -ENOMEM;
+ goto done;
+ }
+
+ bypass_instance->bypass = bypass;
+ list_add_tail(&bypass_instance->list, &bypass->instance_list);
+
+done:
+ mutex_unlock(&bypass->lock);
+ bypass_register_existing_child(dev);
+ return ret;
+}
+EXPORT_SYMBOL(bypass_register_instance);
+
+int bypass_unregister_instance(struct bypass *bypass, struct net_device *dev)
+{
+ struct bypass_instance *bypass_instance;
+ struct net_device *bypass_netdev;
+ int ret = 0;
+
+ mutex_lock(&bypass->lock);
+ list_for_each_entry(bypass_instance, &bypass->instance_list, list) {
+ bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
+ if (bypass_netdev == dev) {
+ list_del(&bypass_instance->list);
+ bypass_instance_free(bypass_instance);
+ goto done;
+ }
+ }
+
+ ret = -ENOENT;
+done:
+ mutex_unlock(&bypass->lock);
+ return ret;
+}
+EXPORT_SYMBOL(bypass_unregister_instance);
+
+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
+ const struct net_device_ops *netdev_ops)
+{
+ struct bypass *bypass;
+
+ bypass = kzalloc(sizeof(*bypass), GFP_KERNEL);
+ if (!bypass)
+ return NULL;
+
+ bypass->ops = ops;
+ bypass->netdev_ops = netdev_ops;
+ INIT_LIST_HEAD(&bypass->instance_list);
+
+ mutex_lock(&bypass_mutex);
+ list_add_tail(&bypass->list, &bypass_list);
+ mutex_unlock(&bypass_mutex);
+
+ return bypass;
+}
+EXPORT_SYMBOL_GPL(bypass_register_driver);
+
+void bypass_unregister_driver(struct bypass *bypass)
+{
+ mutex_lock(&bypass_mutex);
+ list_del(&bypass->list);
+ mutex_unlock(&bypass_mutex);
+
+ kfree(bypass);
+}
+EXPORT_SYMBOL_GPL(bypass_unregister_driver);
+
+static __init int
+bypass_init(void)
+{
+ register_netdevice_notifier(&bypass_notifier);
+
+ return 0;
+}
+module_init(bypass_init);
+
+static __exit
+void bypass_exit(void)
+{
+ unregister_netdevice_notifier(&bypass_notifier);
+}
+module_exit(bypass_exit);
+
+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
+MODULE_LICENSE("GPL v2");
--
2.14.3
^ permalink raw reply related
* [RFC PATCH net-next v5 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Sridhar Samudrala @ 2018-04-05 21:08 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
In-Reply-To: <1522962503-3963-1-git-send-email-sridhar.samudrala@intel.com>
This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.
VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
drivers/net/virtio_net.c | 2 +-
include/uapi/linux/virtio_net.h | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..befb5944f3fd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2962,7 +2962,7 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
- VIRTIO_NET_F_SPEED_DUPLEX
+ VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
* Steering */
#define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
+#define VIRTIO_NET_F_BACKUP 62 /* Act as backup for another device
+ * with the same MAC.
+ */
#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
#ifndef VIRTIO_NET_NO_LEGACY
--
2.14.3
^ permalink raw reply related
* [RFC PATCH net-next v5 0/4] Enable virtio_net to act as a backup for a passthru device
From: Sridhar Samudrala @ 2018-04-05 21:08 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.
Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.
Patch 2 introduces a bypass module that provides a generic interface for
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. A paravirtual driver can use this module by
registering a set of ops and each instance of the device when it is probed.
Patch 3 extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices. The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device. The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.
Patch 4 refactors netvsc to use the registration/notification framework
supported by bypass module.
As this patch series is initially focusing on usecases where hypervisor
fully controls the VM networking and the guest is not expected to directly
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered.
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.
This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
v5 RFC:
Based on Jiri's comments, moved the common functionality to a 'bypass'
module so that the same notifier and event handlers to handle child
register/unregister/link change events can be shared between virtio_net
and netvsc.
Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
Jakub's suggestion for the naming issue with 3 netdev solution,
proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu
Sridhar Samudrala (4):
virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
net: Introduce generic bypass module
virtio_net: Extend virtio to use VF datapath when available
netvsc: refactor notifier/event handling code to use the bypass
framework
drivers/net/Kconfig | 1 +
drivers/net/hyperv/Kconfig | 1 +
drivers/net/hyperv/netvsc_drv.c | 219 ++++----------
drivers/net/virtio_net.c | 614 +++++++++++++++++++++++++++++++++++++++-
include/net/bypass.h | 80 ++++++
include/uapi/linux/virtio_net.h | 3 +
net/Kconfig | 18 ++
net/core/Makefile | 1 +
net/core/bypass.c | 406 ++++++++++++++++++++++++++
9 files changed, 1184 insertions(+), 159 deletions(-)
create mode 100644 include/net/bypass.h
create mode 100644 net/core/bypass.c
--
2.14.3
^ permalink raw reply
* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Michael S. Tsirkin @ 2018-04-05 18:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, linux-kernel, virtualization, joe, david, linuxppc-dev,
elfring, Anshuman Khandual
In-Reply-To: <1522940983.21446.205.camel@kernel.crashing.org>
On Fri, Apr 06, 2018 at 01:09:43AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2018-04-05 at 17:54 +0300, Michael S. Tsirkin wrote:
> > On Thu, Apr 05, 2018 at 08:09:30PM +0530, Anshuman Khandual wrote:
> > > On 04/05/2018 04:26 PM, Anshuman Khandual wrote:
> > > > There are certian platforms which would like to use SWIOTLB based DMA API
> > > > for bouncing purpose without actually requiring an IOMMU back end. But the
> > > > virtio core does not allow such mechanism. Right now DMA MAP API is only
> > > > selected for devices which have an IOMMU and then the QEMU/host back end
> > > > will process all incoming SG buffer addresses as IOVA instead of simple
> > > > GPA which is the case for simple bounce buffers after being processed with
> > > > SWIOTLB API. To enable this usage, it introduces an architecture specific
> > > > function which will just make virtio core front end select DMA operations
> > > > structure.
> > > >
> > > > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> > >
> > > + "Michael S. Tsirkin" <mst@redhat.com>
> >
> > I'm confused by this.
> >
> > static bool vring_use_dma_api(struct virtio_device *vdev)
> > {
> > if (!virtio_has_iommu_quirk(vdev))
> > return true;
> >
> >
> > Why doesn't setting VIRTIO_F_IOMMU_PLATFORM on the
> > hypervisor side sufficient?
>
> In this specific case, because that would make qemu expect an iommu,
> and there isn't one.
I think that you can set iommu_platform in qemu without an iommu.
> Anshuman, you need to provide more background here. I don't have time
> right now it's late, but explain about the fact that this is for a
> specific type of secure VM which has only a limited pool of (insecure)
> memory that can be shared with qemu, so all IOs need to bounce via that
> pool, which can be achieved by using swiotlb.
>
> Note: this isn't urgent, we can discuss alternative approaches, this is
> just to start the conversation.
>
> Cheers,
> Ben.
^ permalink raw reply
* Re: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-04-05 18:30 UTC (permalink / raw)
To: Wang, Wei W
Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
nilal@redhat.com, liliang.opensource@gmail.com,
linux-kernel@vger.kernel.org, mhocko@kernel.org,
linux-mm@kvack.org, huangzhichao@huawei.com, pbonzini@redhat.com,
akpm@linux-foundation.org,
virtualization@lists.linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7394A889E@shsmsx102.ccr.corp.intel.com>
On Thu, Apr 05, 2018 at 03:47:28PM +0000, Wang, Wei W wrote:
> On Thursday, April 5, 2018 10:04 PM, Michael S. Tsirkin wrote:
> > On Thu, Apr 05, 2018 at 02:05:03AM +0000, Wang, Wei W wrote:
> > > On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Apr 05, 2018 at 12:30:27AM +0000, Wang, Wei W wrote:
> > > > > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > > > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > > I'm afraid the driver couldn't be aware if the added hints are
> > > > > stale or not,
> > > >
> > > >
> > > > No - I mean that driver has code that compares two values and stops
> > > > reporting. Can one of the values be stale?
> > >
> > > The driver compares "vb->cmd_id_use != vb->cmd_id_received" to decide
> > > if it needs to stop reporting hints, and cmd_id_received is what the
> > > driver reads from host (host notifies the driver to read for the
> > > latest value). If host sends a new cmd id, it will notify the guest to
> > > read again. I'm not sure how that could be a stale cmd id (or maybe I
> > > misunderstood your point here?)
> > >
> > > Best,
> > > Wei
> >
> > The comparison is done in one thread, the update in another one.
>
> I think this isn't something that could be solved by adding a lock,
> unless host waits for the driver's ACK about finishing the update
> (this is not agreed in the QEMU part discussion).
>
> Actually virtio_balloon has F_IOMMU_PLATFORM disabled, maybe we don't
> need to worry about that using DMA api case (we only have gpa added to
> the vq, and having some entries stay in the vq seems fine). For this
> feature, I think it would not work with F_IOMMU enabled either.
Adding a code comment explaining all this might be a good idea.
> If there is any further need (I couldn't think of a need so far), I
> think we could consider to let host inject a vq interrupt at some
> point, and then the driver handler can do the virtqueue_get_buf work.
>
> Best,
> Wei
^ permalink raw reply
* Re: [Intel-gfx] [PATCH 08/13] drm/virtio: Stop updating plane->crtc
From: Daniel Vetter @ 2018-04-05 16:46 UTC (permalink / raw)
To: Ville Syrjala; +Cc: David Airlie, intel-gfx, dri-devel, virtualization
In-Reply-To: <20180405151400.11326-8-ville.syrjala@linux.intel.com>
On Thu, Apr 05, 2018 at 06:13:55PM +0300, Ville Syrjala wrote:
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
>
> We want to get rid of plane->crtc on atomic drivers. Stop setting it.
>
> v2: s/fb/crtc/ in the commit message (Gerd)
>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: virtualization@lists.linux-foundation.org
> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
> drivers/gpu/drm/virtio/virtgpu_display.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
> index 8cc8c34d67f5..42e842ceb53c 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_display.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_display.c
> @@ -302,8 +302,6 @@ static int vgdev_output_init(struct virtio_gpu_device *vgdev, int index)
> drm_crtc_init_with_planes(dev, crtc, primary, cursor,
> &virtio_gpu_crtc_funcs, NULL);
> drm_crtc_helper_add(crtc, &virtio_gpu_crtc_helper_funcs);
> - primary->crtc = crtc;
> - cursor->crtc = crtc;
>
> drm_connector_init(dev, connector, &virtio_gpu_connector_funcs,
> DRM_MODE_CONNECTOR_VIRTUAL);
> --
> 2.16.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply
* RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-04-05 15:47 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
nilal@redhat.com, liliang.opensource@gmail.com,
linux-kernel@vger.kernel.org, mhocko@kernel.org,
linux-mm@kvack.org, huangzhichao@huawei.com, pbonzini@redhat.com,
akpm@linux-foundation.org,
virtualization@lists.linux-foundation.org
In-Reply-To: <20180405170248-mutt-send-email-mst@kernel.org>
On Thursday, April 5, 2018 10:04 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 02:05:03AM +0000, Wang, Wei W wrote:
> > On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 05, 2018 at 12:30:27AM +0000, Wang, Wei W wrote:
> > > > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > I'm afraid the driver couldn't be aware if the added hints are
> > > > stale or not,
> > >
> > >
> > > No - I mean that driver has code that compares two values and stops
> > > reporting. Can one of the values be stale?
> >
> > The driver compares "vb->cmd_id_use != vb->cmd_id_received" to decide
> > if it needs to stop reporting hints, and cmd_id_received is what the
> > driver reads from host (host notifies the driver to read for the
> > latest value). If host sends a new cmd id, it will notify the guest to
> > read again. I'm not sure how that could be a stale cmd id (or maybe I
> > misunderstood your point here?)
> >
> > Best,
> > Wei
>
> The comparison is done in one thread, the update in another one.
I think this isn't something that could be solved by adding a lock, unless host waits for the driver's ACK about finishing the update (this is not agreed in the QEMU part discussion).
Actually virtio_balloon has F_IOMMU_PLATFORM disabled, maybe we don't need to worry about that using DMA api case (we only have gpa added to the vq, and having some entries stay in the vq seems fine). For this feature, I think it would not work with F_IOMMU enabled either.
If there is any further need (I couldn't think of a need so far), I think we could consider to let host inject a vq interrupt at some point, and then the driver handler can do the virtqueue_get_buf work.
Best,
Wei
^ permalink raw reply
* [PATCH 08/13] drm/virtio: Stop updating plane->crtc
From: Ville Syrjala @ 2018-04-05 15:13 UTC (permalink / raw)
To: dri-devel; +Cc: David Airlie, intel-gfx, virtualization
In-Reply-To: <20180405151400.11326-1-ville.syrjala@linux.intel.com>
From: Ville Syrjälä <ville.syrjala@linux.intel.com>
We want to get rid of plane->crtc on atomic drivers. Stop setting it.
v2: s/fb/crtc/ in the commit message (Gerd)
Cc: David Airlie <airlied@linux.ie>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
drivers/gpu/drm/virtio/virtgpu_display.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
index 8cc8c34d67f5..42e842ceb53c 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -302,8 +302,6 @@ static int vgdev_output_init(struct virtio_gpu_device *vgdev, int index)
drm_crtc_init_with_planes(dev, crtc, primary, cursor,
&virtio_gpu_crtc_funcs, NULL);
drm_crtc_helper_add(crtc, &virtio_gpu_crtc_helper_funcs);
- primary->crtc = crtc;
- cursor->crtc = crtc;
drm_connector_init(dev, connector, &virtio_gpu_connector_funcs,
DRM_MODE_CONNECTOR_VIRTUAL);
--
2.16.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related
* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Benjamin Herrenschmidt @ 2018-04-05 15:09 UTC (permalink / raw)
To: Michael S. Tsirkin, Anshuman Khandual
Cc: robh, linux-kernel, virtualization, joe, linuxppc-dev, elfring,
david
In-Reply-To: <20180405175326-mutt-send-email-mst@kernel.org>
On Thu, 2018-04-05 at 17:54 +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 08:09:30PM +0530, Anshuman Khandual wrote:
> > On 04/05/2018 04:26 PM, Anshuman Khandual wrote:
> > > There are certian platforms which would like to use SWIOTLB based DMA API
> > > for bouncing purpose without actually requiring an IOMMU back end. But the
> > > virtio core does not allow such mechanism. Right now DMA MAP API is only
> > > selected for devices which have an IOMMU and then the QEMU/host back end
> > > will process all incoming SG buffer addresses as IOVA instead of simple
> > > GPA which is the case for simple bounce buffers after being processed with
> > > SWIOTLB API. To enable this usage, it introduces an architecture specific
> > > function which will just make virtio core front end select DMA operations
> > > structure.
> > >
> > > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> >
> > + "Michael S. Tsirkin" <mst@redhat.com>
>
> I'm confused by this.
>
> static bool vring_use_dma_api(struct virtio_device *vdev)
> {
> if (!virtio_has_iommu_quirk(vdev))
> return true;
>
>
> Why doesn't setting VIRTIO_F_IOMMU_PLATFORM on the
> hypervisor side sufficient?
In this specific case, because that would make qemu expect an iommu,
and there isn't one.
Anshuman, you need to provide more background here. I don't have time
right now it's late, but explain about the fact that this is for a
specific type of secure VM which has only a limited pool of (insecure)
memory that can be shared with qemu, so all IOs need to bounce via that
pool, which can be achieved by using swiotlb.
Note: this isn't urgent, we can discuss alternative approaches, this is
just to start the conversation.
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Michael S. Tsirkin @ 2018-04-05 14:54 UTC (permalink / raw)
To: Anshuman Khandual
Cc: robh, linux-kernel, virtualization, joe, linuxppc-dev, elfring,
david
In-Reply-To: <3e1b113b-79ca-b700-5be9-10c66d74aabe@linux.vnet.ibm.com>
On Thu, Apr 05, 2018 at 08:09:30PM +0530, Anshuman Khandual wrote:
> On 04/05/2018 04:26 PM, Anshuman Khandual wrote:
> > There are certian platforms which would like to use SWIOTLB based DMA API
> > for bouncing purpose without actually requiring an IOMMU back end. But the
> > virtio core does not allow such mechanism. Right now DMA MAP API is only
> > selected for devices which have an IOMMU and then the QEMU/host back end
> > will process all incoming SG buffer addresses as IOVA instead of simple
> > GPA which is the case for simple bounce buffers after being processed with
> > SWIOTLB API. To enable this usage, it introduces an architecture specific
> > function which will just make virtio core front end select DMA operations
> > structure.
> >
> > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
>
> + "Michael S. Tsirkin" <mst@redhat.com>
I'm confused by this.
static bool vring_use_dma_api(struct virtio_device *vdev)
{
if (!virtio_has_iommu_quirk(vdev))
return true;
Why doesn't setting VIRTIO_F_IOMMU_PLATFORM on the
hypervisor side sufficient?
^ permalink raw reply
* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Anshuman Khandual @ 2018-04-05 14:39 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: robh, Michael S. Tsirkin, joe, linuxppc-dev, elfring, david
In-Reply-To: <20180405105631.9514-1-khandual@linux.vnet.ibm.com>
On 04/05/2018 04:26 PM, Anshuman Khandual wrote:
> There are certian platforms which would like to use SWIOTLB based DMA API
> for bouncing purpose without actually requiring an IOMMU back end. But the
> virtio core does not allow such mechanism. Right now DMA MAP API is only
> selected for devices which have an IOMMU and then the QEMU/host back end
> will process all incoming SG buffer addresses as IOVA instead of simple
> GPA which is the case for simple bounce buffers after being processed with
> SWIOTLB API. To enable this usage, it introduces an architecture specific
> function which will just make virtio core front end select DMA operations
> structure.
>
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
+ "Michael S. Tsirkin" <mst@redhat.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox