* Re: [PATCH 1/4] iommu: Add virtio-iommu driver
From: Jean-Philippe Brucker @ 2018-04-11 18:33 UTC (permalink / raw)
To: Robin Murphy, iommu@lists.linux-foundation.org,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
virtio-dev@lists.oasis-open.org, kvmarm@lists.cs.columbia.edu
Cc: jayachandran.nair@cavium.com, Lorenzo Pieralisi, Tomasz.Nowicki,
tnowicki@caviumnetworks.com, mst@redhat.com, Marc Zyngier,
Will Deacon, jintack@cs.columbia.edu, eric.auger@redhat.com,
joro@8bytes.org, eric.auger.pro@gmail.com
In-Reply-To: <d0cfe602-f6e8-2d6e-0942-23012f3facef@arm.com>
On 23/03/18 14:48, Robin Murphy wrote:
[..]
>> + * Virtio driver for the paravirtualized IOMMU
>> + *
>> + * Copyright (C) 2018 ARM Limited
>> + * Author: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> + *
>> + * SPDX-License-Identifier: GPL-2.0
>
> This wants to be a // comment at the very top of the file (thankfully
> the policy is now properly documented in-tree since
> Documentation/process/license-rules.rst got merged)
Ok
[...]
>> +
>> +/*
>> + * viommu_del_mappings - remove mappings from the internal tree
>> + *
>> + * @vdomain: the domain
>> + * @iova: start of the range
>> + * @size: size of the range. A size of 0 corresponds to the entire address
>> + * space.
>> + * @out_mapping: if not NULL, the first removed mapping is returned in there.
>> + * This allows the caller to reuse the buffer for the unmap request. When
>> + * the returned size is greater than zero, if a mapping is returned, the
>> + * caller must free it.
>
> This "free multiple mappings except maybe hand one of them off to the
> caller" interface is really unintuitive. AFAICS it's only used by
> viommu_unmap() to grab mapping->req, but that doesn't seem to care about
> mapping itself, so I wonder whether it wouldn't make more sense to just
> have a global kmem_cache of struct virtio_iommu_req_unmap for that and
> avoid a lot of complexity...
Well it's a small complication for what I hoped would be a meanignful
performance difference, but more below.
>> +
>> +/* IOMMU API */
>> +
>> +static bool viommu_capable(enum iommu_cap cap)
>> +{
>> + return false;
>> +}
>
> The .capable callback is optional, so it's only worth implementing once
> you want it to do something beyond the default behaviour.
>
Ah, right
[...]
>> +static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
>> + size_t size)
>> +{
>> + int ret = 0;
>> + size_t unmapped;
>> + struct viommu_mapping *mapping = NULL;
>> + struct viommu_domain *vdomain = to_viommu_domain(domain);
>> +
>> + unmapped = viommu_del_mappings(vdomain, iova, size, &mapping);
>> + if (unmapped < size) {
>> + ret = -EINVAL;
>> + goto out_free;
>> + }
>> +
>> + /* Device already removed all mappings after detach. */
>> + if (!vdomain->endpoints)
>> + goto out_free;
>> +
>> + if (WARN_ON(!mapping))
>> + return 0;
>> +
>> + mapping->req.unmap = (struct virtio_iommu_req_unmap) {
>> + .head.type = VIRTIO_IOMMU_T_UNMAP,
>> + .domain = cpu_to_le32(vdomain->id),
>> + .virt_start = cpu_to_le64(iova),
>> + .virt_end = cpu_to_le64(iova + unmapped - 1),
>> + };
>
> ...In fact, the kmem_cache idea might be moot since it looks like with a
> bit of restructuring you could get away with just a single per-viommu
> virtio_iommu_req_unmap structure; this lot could be passed around on the
> stack until request_lock is taken, at which point it would be copied
> into the 'real' DMA-able structure. The equivalent might apply to
> viommu_map() too - now that I'm looking at it, it seems like it would
> take pretty minimal effort to encapsulate the whole business cleanly in
> viommu_send_req_sync(), which could do something like this instead of
> going through viommu_send_reqs_sync():
>
> ...
> spin_lock_irqsave(&viommu->request_lock, flags);
> viommu_copy_req(viommu->dma_req, req);
> ret = _viommu_send_reqs_sync(viommu, viommu->dma_req, 1, &sent);
> spin_unlock_irqrestore(&viommu->request_lock, flags);
> ...
I'll have to come back to that sorry, still conducting some experiments
with map/unmap.
I'd rather avoid introducing a single dma_req per viommu, because I'd
like to move to the iotlb_range_add/iotlb_sync interface as soon as
possible, and the request logic changes a lot when multiple threads are
susceptible to interleave map/unmap requests.
I ran some tests, and adding a kmem_cache (or simply using kmemdup, it
doesn't make a noticeable difference at our scale) reduces netperf
stream/maerts throughput by 1.1%/1.4% (+/- 0.5% over 30 tests). That's
for a virtio-net device (1 tx/rx vq), and with a vfio device the
difference isn't measurable. At this point I'm not fussy about such
small difference, so don't mind simplifying viommu_del_mapping.
[...]
>> + /*
>> + * Last step creates a default domain and attaches to it. Everything
>> + * must be ready.
>> + */
>> + group = iommu_group_get_for_dev(dev);
>> + if (!IS_ERR(group))
>> + iommu_group_put(group);
>
> Since you create the sysfs IOMMU device, maybe also create the links for
> the masters?
Ok
>> +
>> + return PTR_ERR_OR_ZERO(group);
>> +}
>> +
>> +static void viommu_remove_device(struct device *dev)
>> +{
>
> You need to remove dev from its group, too (basically, .remove_device
> should always undo everything .add_device did)
>
> It would also be good practice to verify that dev->iommu_fwspec exists
> and is one of yours before touching anything, although having checked
> the core code I see we do currently just about get away with it thanks
> to the horrible per-bus ops.
Ok
>
>> + kfree(dev->iommu_fwspec->iommu_priv);
>> +}
>> +
>> +static struct iommu_group *viommu_device_group(struct device *dev)
>> +{
>> + if (dev_is_pci(dev))
>> + return pci_device_group(dev);
>> + else
>> + return generic_device_group(dev);
>> +}
>> +
>> +static int viommu_of_xlate(struct device *dev, struct of_phandle_args *args)
>> +{
>> + return iommu_fwspec_add_ids(dev, args->args, 1);
>
> I'm sure a DT binding somewhere needs to document the appropriate value
> and meaning for #iommu-cells - I guess that probably falls under the
> virtio-mmio binding?
Yes I guess mmio.txt would be the best place for this.
[...]
>> +/*
>> + * Virtio-iommu definition v0.6
>> + *
>> + * Copyright (C) 2018 ARM Ltd.
>> + *
>> + * SPDX-License-Identifier: BSD-3-Clause
>
> Again, at the top, although in /* */ here since it's a header.
Right
Thanks for the review,
Jean
^ permalink raw reply
* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-11 15:51 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <1523386790-12396-3-git-send-email-sridhar.samudrala@intel.com>
Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. existing netvsc driver that uses 2 netdev model. In this model, no
>master netdev is created. The paravirtual driver registers each bypass
>instance along with a set of ops to manage the slave events.
> bypass_master_register()
> bypass_master_unregister()
>2. new virtio_net based solution that uses 3 netdev model. In this model,
>the bypass module provides interfaces to create/destroy additional master
>netdev and all the slave events are managed internally.
> bypass_master_create()
> bypass_master_destroy()
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> include/linux/netdevice.h | 14 +
> include/net/bypass.h | 96 ++++++
> net/Kconfig | 18 +
> net/core/Makefile | 1 +
> net/core/bypass.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 973 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index cf44503ea81a..587293728f70 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> IFF_PHONY_HEADROOM = 1<<24,
> IFF_MACSEC = 1<<25,
> IFF_NO_RX_HANDLER = 1<<26,
>+ IFF_BYPASS = 1 << 27,
>+ IFF_BYPASS_SLAVE = 1 << 28,
I wonder, why you don't follow the existing coding style... Also, please
add these to into the comment above.
> };
>
> #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
>@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
> #define IFF_MACSEC IFF_MACSEC
> #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>+#define IFF_BYPASS IFF_BYPASS
>+#define IFF_BYPASS_SLAVE IFF_BYPASS_SLAVE
>
> /**
> * struct net_device - The DEVICE structure.
>@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> return dev->priv_flags & IFF_RXFH_CONFIGURED;
> }
>
>+static inline bool netif_is_bypass_master(const struct net_device *dev)
>+{
>+ return dev->priv_flags & IFF_BYPASS;
>+}
>+
>+static inline bool netif_is_bypass_slave(const struct net_device *dev)
>+{
>+ return dev->priv_flags & IFF_BYPASS_SLAVE;
>+}
>+
> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> static inline void netif_keep_dst(struct net_device *dev)
> {
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index 000000000000..86b02cb894cf
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,96 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include <linux/netdevice.h>
>+
>+struct bypass_ops {
>+ int (*slave_pre_register)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ int (*slave_join)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ int (*slave_pre_unregister)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ int (*slave_release)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ int (*slave_link_change)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_master {
>+ struct list_head list;
>+ struct net_device __rcu *bypass_netdev;
>+ struct bypass_ops __rcu *ops;
>+};
>+
>+/* bypass state */
>+struct bypass_info {
>+ /* passthru netdev with same MAC */
>+ struct net_device __rcu *active_netdev;
You still use "active"/"backup" names which is highly misleading as
it has completely different meaning that in bond for example.
I noted that in my previous review already. Please change it.
>+
>+ /* virtio_net netdev */
>+ struct net_device __rcu *backup_netdev;
>+
>+ /* active netdev stats */
>+ struct rtnl_link_stats64 active_stats;
>+
>+ /* backup netdev stats */
>+ struct rtnl_link_stats64 backup_stats;
>+
>+ /* aggregated stats */
>+ struct rtnl_link_stats64 bypass_stats;
>+
>+ /* spinlock while updating stats */
>+ spinlock_t stats_lock;
>+};
>+
>+#if IS_ENABLED(CONFIG_NET_BYPASS)
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+ struct bypass_master **pbypass_master);
>+void bypass_master_destroy(struct bypass_master *bypass_master);
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+ struct bypass_master **pbypass_master);
>+void bypass_master_unregister(struct bypass_master *bypass_master);
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev);
>+
>+#else
>+
>+static inline
>+int bypass_master_create(struct net_device *backup_netdev,
>+ struct bypass_master **pbypass_master);
>+{
>+ return 0;
>+}
>+
>+static inline
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+ struct pbypass_master **pbypass_master);
>+{
>+ return 0;
>+}
>+
>+static inline
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+ return 0;
>+}
>+
>+#endif
>+
>+#endif /* _NET_BYPASS_H */
>diff --git a/net/Kconfig b/net/Kconfig
>index 0428f12c25c2..994445f4a96a 100644
>--- a/net/Kconfig
>+++ b/net/Kconfig
>@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
> on MAY_USE_DEVLINK to ensure they do not cause link errors when
> devlink is a loadable module and the driver using it is built-in.
>
>+config NET_BYPASS
>+ tristate "Bypass interface"
>+ ---help---
>+ This provides a generic interface for paravirtual drivers to listen
>+ for netdev register/unregister/link change events from pci ethernet
>+ devices with the same MAC and takeover their datapath. This also
>+ enables live migration of a VM with direct attached VF by failing
>+ over to the paravirtual datapath when the VF is unplugged.
>+
>+config MAY_USE_BYPASS
>+ tristate
>+ default m if NET_BYPASS=m
>+ default y if NET_BYPASS=y || NET_BYPASS=n
>+ help
>+ Drivers using the bypass infrastructure should have a dependency
>+ on MAY_USE_BYPASS to ensure they do not cause link errors when
>+ bypass is a loadable module and the driver using it is built-in.
>+
> endif # if NET
>
> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>diff --git a/net/core/Makefile b/net/core/Makefile
>index 6dbbba8c57ae..a9727ed1c8fc 100644
>--- a/net/core/Makefile
>+++ b/net/core/Makefile
>@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
> obj-$(CONFIG_HWBM) += hwbm.o
> obj-$(CONFIG_NET_DEVLINK) += devlink.o
> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>+obj-$(CONFIG_NET_BYPASS) += bypass.o
>diff --git a/net/core/bypass.c b/net/core/bypass.c
>new file mode 100644
>index 000000000000..b5b9cb554c3f
>--- /dev/null
>+++ b/net/core/bypass.c
>@@ -0,0 +1,844 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+/* A common module to handle registrations and notifications for paravirtual
>+ * drivers to enable accelerated datapath and support VF live migration.
>+ *
>+ * The notifier and event handling code is based on netvsc driver.
>+ */
>+
>+#include <linux/netdevice.h>
>+#include <linux/etherdevice.h>
>+#include <linux/ethtool.h>
>+#include <linux/module.h>
>+#include <linux/slab.h>
>+#include <linux/netdevice.h>
>+#include <linux/netpoll.h>
>+#include <linux/rtnetlink.h>
>+#include <linux/if_vlan.h>
>+#include <linux/pci.h>
>+#include <net/sch_generic.h>
>+#include <uapi/linux/if_arp.h>
>+#include <net/bypass.h>
>+
>+static LIST_HEAD(bypass_master_list);
>+static DEFINE_SPINLOCK(bypass_lock);
>+
>+static int bypass_slave_pre_register(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev,
>+ struct bypass_ops *bypass_ops)
>+{
>+ struct bypass_info *bi;
>+ bool backup;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_pre_register)
>+ return -EINVAL;
>+
>+ return bypass_ops->slave_pre_register(slave_netdev,
>+ bypass_netdev);
>+ }
>+
>+ bi = netdev_priv(bypass_netdev);
>+ backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+ if (backup ? rtnl_dereference(bi->backup_netdev) :
>+ rtnl_dereference(bi->active_netdev)) {
>+ netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>+ slave_netdev->name, backup ? "backup" : "active");
>+ return -EEXIST;
>+ }
>+
>+ /* Avoid non pci devices as active netdev */
>+ if (!backup && (!slave_netdev->dev.parent ||
>+ !dev_is_pci(slave_netdev->dev.parent)))
>+ return -EINVAL;
>+
>+ return 0;
>+}
>+
>+static int bypass_slave_join(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev,
>+ struct bypass_ops *bypass_ops)
>+{
>+ struct bypass_info *bi;
>+ bool backup;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_join)
>+ return -EINVAL;
>+
>+ return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>+ }
>+
>+ bi = netdev_priv(bypass_netdev);
>+ backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+
>+ dev_hold(slave_netdev);
>+
>+ if (backup) {
>+ rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>+ dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>+ } else {
>+ rcu_assign_pointer(bi->active_netdev, slave_netdev);
>+ dev_get_stats(bi->active_netdev, &bi->active_stats);
>+ bypass_netdev->min_mtu = slave_netdev->min_mtu;
>+ bypass_netdev->max_mtu = slave_netdev->max_mtu;
>+ }
>+
>+ netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>+ slave_netdev->name);
>+
>+ return 0;
>+}
>+
>+/* Called when slave dev is injecting data into network stack.
>+ * Change the associated network device from lower dev to virtio.
>+ * note: already called with rcu_read_lock
>+ */
>+static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>+{
>+ struct sk_buff *skb = *pskb;
>+ struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>+
>+ skb->dev = ndev;
>+
>+ return RX_HANDLER_ANOTHER;
>+}
>+
>+static struct net_device *bypass_master_get_bymac(u8 *mac,
>+ struct bypass_ops **ops)
>+{
>+ struct bypass_master *bypass_master;
>+ struct net_device *bypass_netdev;
>+
>+ spin_lock(&bypass_lock);
>+ list_for_each_entry(bypass_master, &bypass_master_list, list) {
As I wrote the last time, you don't need this list, spinlock.
You can do just something like:
for_each_net(net) {
for_each_netdev(net, dev) {
if (netif_is_bypass_master(dev)) {
>+ bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+ if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>+ *ops = rcu_dereference(bypass_master->ops);
I don't see how rcu_dereference is ok here.
1) I don't see rcu_read_lock taken
2) Looks like bypass_master->ops has the same value across the whole
existence.
>+ spin_unlock(&bypass_lock);
>+ return bypass_netdev;
>+ }
>+ }
>+ spin_unlock(&bypass_lock);
>+ return NULL;
>+}
>+
>+static int bypass_slave_register(struct net_device *slave_netdev)
>+{
>+ struct net_device *bypass_netdev;
>+ struct bypass_ops *bypass_ops;
>+ int ret, orig_mtu;
>+
>+ ASSERT_RTNL();
>+
>+ bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+ &bypass_ops);
For master, could you use word "master" in the variables so it is clear?
Also, "dev" is fine instead of "netdev".
Something like "bpmaster_dev"
>+ if (!bypass_netdev)
>+ goto done;
>+
>+ ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>+ bypass_ops);
>+ if (ret != 0)
Just "if (ret)" will do. You have this on more places.
>+ goto done;
>+
>+ ret = netdev_rx_handler_register(slave_netdev,
>+ bypass_ops ? bypass_ops->handle_frame :
>+ bypass_handle_frame, bypass_netdev);
>+ if (ret != 0) {
>+ netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>+ ret);
>+ goto done;
>+ }
>+
>+ ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>+ if (ret != 0) {
>+ netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>+ bypass_netdev->name, ret);
>+ goto upper_link_failed;
>+ }
>+
>+ slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>+
>+ if (netif_running(bypass_netdev)) {
>+ ret = dev_open(slave_netdev);
>+ if (ret && (ret != -EBUSY)) {
>+ netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>+ slave_netdev->name, ret);
>+ goto err_interface_up;
>+ }
>+ }
>+
>+ /* Align MTU of slave with master */
>+ orig_mtu = slave_netdev->mtu;
>+ ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>+ if (ret != 0) {
>+ netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>+ slave_netdev->name, bypass_netdev->mtu);
>+ goto err_set_mtu;
>+ }
>+
>+ ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>+ if (ret != 0)
>+ goto err_join;
>+
>+ call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>+
>+ netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>+ slave_netdev->name);
>+
>+ goto done;
>+
>+err_join:
>+ dev_set_mtu(slave_netdev, orig_mtu);
>+err_set_mtu:
>+ dev_close(slave_netdev);
>+err_interface_up:
>+ netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+ slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+upper_link_failed:
>+ netdev_rx_handler_unregister(slave_netdev);
>+done:
>+ return NOTIFY_DONE;
>+}
>+
>+static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev,
>+ struct bypass_ops *bypass_ops)
>+{
>+ struct net_device *backup_netdev, *active_netdev;
>+ struct bypass_info *bi;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_pre_unregister)
>+ return -EINVAL;
>+
>+ return bypass_ops->slave_pre_unregister(slave_netdev,
>+ bypass_netdev);
>+ }
>+
>+ bi = netdev_priv(bypass_netdev);
>+ active_netdev = rtnl_dereference(bi->active_netdev);
>+ backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+ if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+ return -EINVAL;
>+
>+ return 0;
>+}
>+
>+static int bypass_slave_release(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev,
>+ struct bypass_ops *bypass_ops)
>+{
>+ struct net_device *backup_netdev, *active_netdev;
>+ struct bypass_info *bi;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_release)
>+ return -EINVAL;
I think it would be good to make the API to the driver more strict and
have a separate set of ops for "active" and "backup" netdevices.
That should stop people thinking about extending this to more slaves in
the future.
>+
>+ return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>+ }
>+
>+ bi = netdev_priv(bypass_netdev);
>+ active_netdev = rtnl_dereference(bi->active_netdev);
>+ backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+ if (slave_netdev == backup_netdev) {
>+ RCU_INIT_POINTER(bi->backup_netdev, NULL);
>+ } else {
>+ RCU_INIT_POINTER(bi->active_netdev, NULL);
>+ if (backup_netdev) {
>+ bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+ bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+ }
>+ }
>+
>+ dev_put(slave_netdev);
>+
>+ netdev_info(bypass_netdev, "bypass slave:%s released\n",
>+ slave_netdev->name);
>+
>+ return 0;
>+}
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+ struct net_device *bypass_netdev;
>+ struct bypass_ops *bypass_ops;
>+ int ret;
>+
>+ if (!netif_is_bypass_slave(slave_netdev))
>+ goto done;
>+
>+ ASSERT_RTNL();
>+
>+ bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+ &bypass_ops);
>+ if (!bypass_netdev)
>+ goto done;
>+
>+ ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>+ bypass_ops);
>+ if (ret != 0)
>+ goto done;
>+
>+ netdev_rx_handler_unregister(slave_netdev);
>+ netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+ slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+
>+ bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>+
>+ netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>+ slave_netdev->name);
>+
>+done:
>+ return NOTIFY_DONE;
>+}
>+EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>+
>+static bool bypass_xmit_ready(struct net_device *dev)
>+{
>+ return netif_running(dev) && netif_carrier_ok(dev);
>+}
>+
>+static int bypass_slave_link_change(struct net_device *slave_netdev)
>+{
>+ struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>+ struct bypass_ops *bypass_ops;
>+ struct bypass_info *bi;
>+
>+ if (!netif_is_bypass_slave(slave_netdev))
>+ goto done;
>+
>+ ASSERT_RTNL();
>+
>+ bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+ &bypass_ops);
>+ if (!bypass_netdev)
>+ goto done;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_link_change)
>+ goto done;
>+
>+ return bypass_ops->slave_link_change(slave_netdev,
>+ bypass_netdev);
>+ }
>+
>+ if (!netif_running(bypass_netdev))
>+ return 0;
>+
>+ bi = netdev_priv(bypass_netdev);
>+
>+ active_netdev = rtnl_dereference(bi->active_netdev);
>+ backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+ if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+ goto done;
You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
above is enough.
>+
>+ if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>+ (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>+ netif_carrier_on(bypass_netdev);
>+ netif_tx_wake_all_queues(bypass_netdev);
>+ } else {
>+ netif_carrier_off(bypass_netdev);
>+ netif_tx_stop_all_queues(bypass_netdev);
>+ }
>+
>+done:
>+ return NOTIFY_DONE;
>+}
>+
>+static bool bypass_validate_event_dev(struct net_device *dev)
>+{
>+ /* Skip parent events */
>+ if (netif_is_bypass_master(dev))
>+ return false;
>+
>+ /* Avoid non-Ethernet type devices */
>+ if (dev->type != ARPHRD_ETHER)
>+ return false;
>+
>+ /* Avoid Vlan dev with same MAC registering as VF */
>+ if (is_vlan_dev(dev))
>+ return false;
>+
>+ /* Avoid Bonding master dev with same MAC registering as slave dev */
>+ if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
Yeah, this is certainly incorrect. One thing is, you should be using the
helpers netif_is_bond_master().
But what about the rest? macsec, macvlan, team, bridge, ovs and others?
You need to do it not by blacklisting, but with whitelisting. You need
to whitelist VF devices. My port flavours patchset might help with this.
>+ return false;
>+
>+ return true;
>+}
>+
>+static int
>+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>+{
>+ struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>+
>+ if (!bypass_validate_event_dev(event_dev))
>+ return NOTIFY_DONE;
>+
>+ switch (event) {
>+ case NETDEV_REGISTER:
>+ return bypass_slave_register(event_dev);
>+ case NETDEV_UNREGISTER:
>+ return bypass_slave_unregister(event_dev);
>+ case NETDEV_UP:
>+ case NETDEV_DOWN:
>+ case NETDEV_CHANGE:
>+ return bypass_slave_link_change(event_dev);
>+ default:
>+ return NOTIFY_DONE;
>+ }
>+}
>+
>+static struct notifier_block bypass_notifier = {
>+ .notifier_call = bypass_event,
>+};
>+
>+int bypass_open(struct net_device *dev)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
>+ struct net_device *active_netdev, *backup_netdev;
>+ int err;
>+
>+ netif_carrier_off(dev);
>+ netif_tx_wake_all_queues(dev);
>+
>+ active_netdev = rtnl_dereference(bi->active_netdev);
>+ if (active_netdev) {
>+ err = dev_open(active_netdev);
>+ if (err)
>+ goto err_active_open;
>+ }
>+
>+ backup_netdev = rtnl_dereference(bi->backup_netdev);
>+ if (backup_netdev) {
>+ err = dev_open(backup_netdev);
>+ if (err)
>+ goto err_backup_open;
>+ }
>+
>+ return 0;
>+
>+err_backup_open:
>+ dev_close(active_netdev);
>+err_active_open:
>+ netif_tx_disable(dev);
>+ return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_open);
>+
>+int bypass_close(struct net_device *dev)
>+{
>+ struct bypass_info *vi = netdev_priv(dev);
This should be probably "bi"
>+ struct net_device *slave_netdev;
>+
>+ netif_tx_disable(dev);
>+
>+ slave_netdev = rtnl_dereference(vi->active_netdev);
>+ if (slave_netdev)
>+ dev_close(slave_netdev);
>+
>+ slave_netdev = rtnl_dereference(vi->backup_netdev);
>+ if (slave_netdev)
>+ dev_close(slave_netdev);
>+
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_close);
>+
>+static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+ atomic_long_inc(&dev->tx_dropped);
>+ dev_kfree_skb_any(skb);
>+ return NETDEV_TX_OK;
>+}
>+
>+netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
If you rename the other variable to "bpmaster_dev", it would be nice to
rename this to bpinfo or something more descriptive. "bi" is too short
to know what that is right away.
>+ struct net_device *xmit_dev;
Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.
>+
>+ /* Try xmit via active netdev followed by backup netdev */
>+ xmit_dev = rcu_dereference_bh(bi->active_netdev);
>+ if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>+ xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>+ if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>+ return bypass_drop_xmit(skb, dev);
>+ }
>+
>+ skb->dev = xmit_dev;
>+ skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>+
>+ return dev_queue_xmit(skb);
>+}
>+EXPORT_SYMBOL_GPL(bypass_start_xmit);
>+
>+u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>+ void *accel_priv, select_queue_fallback_t fallback)
>+{
>+ /* This helper function exists to help dev_pick_tx get the correct
>+ * destination queue. Using a helper function skips a call to
>+ * skb_tx_hash and will put the skbs in the queue we expect on their
>+ * way down to the bonding driver.
>+ */
>+ u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>+
>+ /* Save the original txq to restore before passing to the driver */
>+ qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>+
>+ if (unlikely(txq >= dev->real_num_tx_queues)) {
>+ do {
>+ txq -= dev->real_num_tx_queues;
>+ } while (txq >= dev->real_num_tx_queues);
>+ }
>+
>+ return txq;
>+}
>+EXPORT_SYMBOL_GPL(bypass_select_queue);
>+
>+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>+ * that some drivers can provide 32bit values only.
>+ */
>+static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>+ const struct rtnl_link_stats64 *_new,
>+ const struct rtnl_link_stats64 *_old)
>+{
>+ const u64 *new = (const u64 *)_new;
>+ const u64 *old = (const u64 *)_old;
>+ u64 *res = (u64 *)_res;
>+ int i;
>+
>+ for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>+ u64 nv = new[i];
>+ u64 ov = old[i];
>+ s64 delta = nv - ov;
>+
>+ /* detects if this particular field is 32bit only */
>+ if (((nv | ov) >> 32) == 0)
>+ delta = (s64)(s32)((u32)nv - (u32)ov);
>+
>+ /* filter anomalies, some drivers reset their stats
>+ * at down/up events.
>+ */
>+ if (delta > 0)
>+ res[i] += delta;
>+ }
>+}
>+
>+void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
You can WARN_ON and return in case the dev is not bypass master, just
to catch buggy drivers. Same with other helpers.
>+ const struct rtnl_link_stats64 *new;
>+ struct rtnl_link_stats64 temp;
>+ struct net_device *slave_netdev;
>+
>+ spin_lock(&bi->stats_lock);
>+ memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>+
>+ rcu_read_lock();
>+
>+ slave_netdev = rcu_dereference(bi->active_netdev);
>+ if (slave_netdev) {
>+ new = dev_get_stats(slave_netdev, &temp);
>+ bypass_fold_stats(stats, new, &bi->active_stats);
>+ memcpy(&bi->active_stats, new, sizeof(*new));
>+ }
>+
>+ slave_netdev = rcu_dereference(bi->backup_netdev);
>+ if (slave_netdev) {
>+ new = dev_get_stats(slave_netdev, &temp);
>+ bypass_fold_stats(stats, new, &bi->backup_stats);
>+ memcpy(&bi->backup_stats, new, sizeof(*new));
>+ }
>+
>+ rcu_read_unlock();
>+
>+ memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>+ spin_unlock(&bi->stats_lock);
>+}
>+EXPORT_SYMBOL_GPL(bypass_get_stats);
>+
>+int bypass_change_mtu(struct net_device *dev, int new_mtu)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
>+ struct net_device *active_netdev, *backup_netdev;
>+ int ret = 0;
Pointless initialization.
>+
>+ active_netdev = rcu_dereference(bi->active_netdev);
>+ if (active_netdev) {
>+ ret = dev_set_mtu(active_netdev, new_mtu);
>+ if (ret)
>+ return ret;
>+ }
>+
>+ backup_netdev = rcu_dereference(bi->backup_netdev);
>+ if (backup_netdev) {
>+ ret = dev_set_mtu(backup_netdev, new_mtu);
>+ if (ret) {
>+ dev_set_mtu(active_netdev, dev->mtu);
>+ return ret;
>+ }
>+ }
>+
>+ dev->mtu = new_mtu;
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_change_mtu);
>+
>+void bypass_set_rx_mode(struct net_device *dev)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
>+ struct net_device *slave_netdev;
>+
>+ rcu_read_lock();
>+
>+ slave_netdev = rcu_dereference(bi->active_netdev);
>+ if (slave_netdev) {
>+ dev_uc_sync_multiple(slave_netdev, dev);
>+ dev_mc_sync_multiple(slave_netdev, dev);
>+ }
>+
>+ slave_netdev = rcu_dereference(bi->backup_netdev);
>+ if (slave_netdev) {
>+ dev_uc_sync_multiple(slave_netdev, dev);
>+ dev_mc_sync_multiple(slave_netdev, dev);
>+ }
>+
>+ rcu_read_unlock();
>+}
>+EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>+
>+static const struct net_device_ops bypass_netdev_ops = {
>+ .ndo_open = bypass_open,
>+ .ndo_stop = bypass_close,
>+ .ndo_start_xmit = bypass_start_xmit,
>+ .ndo_select_queue = bypass_select_queue,
>+ .ndo_get_stats64 = bypass_get_stats,
>+ .ndo_change_mtu = bypass_change_mtu,
>+ .ndo_set_rx_mode = bypass_set_rx_mode,
>+ .ndo_validate_addr = eth_validate_addr,
>+ .ndo_features_check = passthru_features_check,
>+};
>+
>+#define BYPASS_DRV_NAME "bypass"
>+#define BYPASS_DRV_VERSION "0.1"
>+
>+static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>+ struct ethtool_drvinfo *drvinfo)
>+{
>+ strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>+ strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>+}
>+
>+int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>+ struct ethtool_link_ksettings *cmd)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
>+ struct net_device *slave_netdev;
>+
>+ slave_netdev = rtnl_dereference(bi->active_netdev);
>+ if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+ slave_netdev = rtnl_dereference(bi->backup_netdev);
>+ if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+ cmd->base.duplex = DUPLEX_UNKNOWN;
>+ cmd->base.port = PORT_OTHER;
>+ cmd->base.speed = SPEED_UNKNOWN;
>+
>+ return 0;
>+ }
>+ }
>+
>+ return __ethtool_get_link_ksettings(slave_netdev, cmd);
>+}
>+EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>+
>+static const struct ethtool_ops bypass_ethtool_ops = {
>+ .get_drvinfo = bypass_ethtool_get_drvinfo,
>+ .get_link = ethtool_op_get_link,
>+ .get_link_ksettings = bypass_ethtool_get_link_ksettings,
>+};
>+
>+static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>+{
>+ struct net *net = dev_net(bypass_netdev);
>+ struct net_device *dev;
>+
>+ rtnl_lock();
>+ for_each_netdev(net, dev) {
>+ if (dev == bypass_netdev)
>+ continue;
>+ if (!bypass_validate_event_dev(dev))
>+ continue;
>+ if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>+ bypass_slave_register(dev);
>+ }
>+ rtnl_unlock();
>+}
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+ struct bypass_master **pbypass_master)
>+{
>+ struct bypass_master *bypass_master;
>+
>+ bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>+ if (!bypass_master)
>+ return -ENOMEM;
>+
>+ rcu_assign_pointer(bypass_master->ops, ops);
>+ dev_hold(dev);
>+ dev->priv_flags |= IFF_BYPASS;
>+ rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>+
>+ spin_lock(&bypass_lock);
>+ list_add_tail(&bypass_master->list, &bypass_master_list);
>+ spin_unlock(&bypass_lock);
>+
>+ bypass_register_existing_slave(dev);
>+
>+ *pbypass_master = bypass_master;
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_register);
>+
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+ struct net_device *bypass_netdev;
>+
>+ bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+
>+ bypass_netdev->priv_flags &= ~IFF_BYPASS;
>+ dev_put(bypass_netdev);
>+
>+ spin_lock(&bypass_lock);
>+ list_del(&bypass_master->list);
>+ spin_unlock(&bypass_lock);
>+
>+ kfree(bypass_master);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_unregister);
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+ struct bypass_master **pbypass_master)
>+{
>+ struct device *dev = backup_netdev->dev.parent;
>+ struct net_device *bypass_netdev;
>+ int err;
>+
>+ /* Alloc at least 2 queues, for now we are going with 16 assuming
>+ * that most devices being bonded won't have too many queues.
>+ */
>+ bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>+ if (!bypass_netdev) {
>+ dev_err(dev, "Unable to allocate bypass_netdev!\n");
>+ return -ENOMEM;
>+ }
>+
>+ dev_net_set(bypass_netdev, dev_net(backup_netdev));
>+ SET_NETDEV_DEV(bypass_netdev, dev);
>+
>+ bypass_netdev->netdev_ops = &bypass_netdev_ops;
>+ bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>+
>+ /* Initialize the device options */
>+ bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>+ bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>+ IFF_TX_SKB_SHARING);
>+
>+ /* don't acquire bypass netdev's netif_tx_lock when transmitting */
>+ bypass_netdev->features |= NETIF_F_LLTX;
>+
>+ /* Don't allow bypass devices to change network namespaces. */
>+ bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>+
>+ bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>+ NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>+ NETIF_F_HIGHDMA | NETIF_F_LRO;
>+
>+ bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>+ bypass_netdev->features |= bypass_netdev->hw_features;
>+
>+ memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>+ bypass_netdev->addr_len);
>+
>+ bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+ bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+
>+ err = register_netdev(bypass_netdev);
>+ if (err < 0) {
>+ dev_err(dev, "Unable to register bypass_netdev!\n");
>+ goto err_register_netdev;
>+ }
>+
>+ netif_carrier_off(bypass_netdev);
>+
>+ err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>+ if (err < 0)
just "if (err)" would do.
>+ goto err_bypass;
>+
>+ return 0;
>+
>+err_bypass:
>+ unregister_netdev(bypass_netdev);
>+err_register_netdev:
>+ free_netdev(bypass_netdev);
>+
>+ return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_create);
>+
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+ struct net_device *bypass_netdev;
>+ struct net_device *slave_netdev;
>+ struct bypass_info *bi;
>+
>+ if (!bypass_master)
>+ return;
>+
>+ bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+ bi = netdev_priv(bypass_netdev);
>+
>+ netif_device_detach(bypass_netdev);
>+
>+ rtnl_lock();
>+
>+ slave_netdev = rtnl_dereference(bi->active_netdev);
>+ if (slave_netdev)
>+ bypass_slave_unregister(slave_netdev);
>+
>+ slave_netdev = rtnl_dereference(bi->backup_netdev);
>+ if (slave_netdev)
>+ bypass_slave_unregister(slave_netdev);
>+
>+ bypass_master_unregister(bypass_master);
>+
>+ unregister_netdevice(bypass_netdev);
>+
>+ rtnl_unlock();
>+
>+ free_netdev(bypass_netdev);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_destroy);
>+
>+static __init int
>+bypass_init(void)
>+{
>+ register_netdevice_notifier(&bypass_notifier);
>+
>+ return 0;
>+}
>+module_init(bypass_init);
>+
>+static __exit
>+void bypass_exit(void)
>+{
>+ unregister_netdevice_notifier(&bypass_notifier);
>+}
>+module_exit(bypass_exit);
>+
>+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>+MODULE_LICENSE("GPL v2");
>--
>2.14.3
>
^ permalink raw reply
* Re: [PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check
From: David Miller @ 2018-04-11 14:55 UTC (permalink / raw)
To: mst
Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
stefanha, torvalds
In-Reply-To: <20180411162345-mutt-send-email-mst@kernel.org>
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 11 Apr 2018 16:24:02 +0300
> On Wed, Apr 11, 2018 at 10:35:39AM +0800, Stefan Hajnoczi wrote:
>> v3:
>> * Rebased onto net/master and resolved conflict [DaveM]
>>
>> v2:
>> * Rewrote the conditional to make the vq access check clearer [Linus]
>> * Added Patch 2 to make the return type consistent and harder to misuse [Linus]
>>
>> The first patch fixes the vhost virtqueue access check which was recently
>> broken. The second patch replaces the int return type with bool to prevent
>> future bugs.
>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>
> We need the 1st one on stable I think.
Series applied and patch #1 queued up for -stable.
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-11 14:51 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180411174157-mutt-send-email-mst@kernel.org>
Wed, Apr 11, 2018 at 04:45:07PM CEST, mst@redhat.com wrote:
>On Wed, Apr 11, 2018 at 10:03:32AM +0200, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 08:24:43AM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/10/2018 11:03 PM, Jiri Pirko wrote:
>> >> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> >> > > > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> >> > > > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> >> > > > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> >> > > > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > > > > [...]
>> >> > > > > > > > >
>> >> > > > > > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> >> > > > > > > > > > > > + struct net_device *child_netdev)
>> >> > > > > > > > > > > > +{
>> >> > > > > > > > > > > > + struct virtnet_bypass_info *vbi;
>> >> > > > > > > > > > > > + bool backup;
>> >> > > > > > > > > > > > +
>> >> > > > > > > > > > > > + vbi = netdev_priv(bypass_netdev);
>> >> > > > > > > > > > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> >> > > > > > > > > > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> >> > > > > > > > > > > > + rtnl_dereference(vbi->active_netdev)) {
>> >> > > > > > > > > > > > + netdev_info(bypass_netdev,
>> >> > > > > > > > > > > > + "%s attempting to join bypass dev when %s already present\n",
>> >> > > > > > > > > > > > + child_netdev->name, backup ? "backup" : "active");
>> >> > > > > > > > > > > Bypass module should check if there is already some other netdev
>> >> > > > > > > > > > > enslaved and refuse right there.
>> >> > > > > > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> >> > > > > > > > > > as its bypass_netdev is same as the backup_netdev.
>> >> > > > > > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> >> > > > > > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> >> > > > > > > > > > for 3 netdev scenario.
>> >> > > > > > > > > Just let me undestand it clearly. What I expect the difference would be
>> >> > > > > > > > > between 2netdev and3 netdev model is this:
>> >> > > > > > > > > 2netdev:
>> >> > > > > > > > > bypass_master
>> >> > > > > > > > > /
>> >> > > > > > > > > /
>> >> > > > > > > > > VF_slave
>> >> > > > > > > > >
>> >> > > > > > > > > 3netdev:
>> >> > > > > > > > > bypass_master
>> >> > > > > > > > > / \
>> >> > > > > > > > > / \
>> >> > > > > > > > > VF_slave backup_slave
>> >> > > > > > > > >
>> >> > > > > > > > > Is that correct? If not, how does it look like?
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > Looks correct.
>> >> > > > > > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
>> >> > > > > > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> >> > > > > > > > marked as the 2 slaves of this new netdev.
>> >> > > > > > > You say it looks correct and in another sentence you provide completely
>> >> > > > > > > different description. Could you please look again?
>> >> > > > > > >
>> >> > > > > > To be exact, 2 netdev model with netvsc looks like this.
>> >> > > > > >
>> >> > > > > > netvsc_netdev
>> >> > > > > > /
>> >> > > > > > /
>> >> > > > > > VF_slave
>> >> > > > > >
>> >> > > > > > With virtio_net, 3 netdev model
>> >> > > > > >
>> >> > > > > > bypass_netdev
>> >> > > > > > / \
>> >> > > > > > / \
>> >> > > > > > VF_slave virtio_net netdev
>> >> > > > > Could you also mark the original netdev which is there now? is it
>> >> > > > > bypass_netdev or virtio_net_netdev ?
>> >> > > > bypass_netdev
>> >> > > > / \
>> >> > > > / \
>> >> > > > VF_slave virtio_net netdev (original)
>> >> > > That does not make sense.
>> >> > > 1) You diverge from the behaviour of the netvsc, where the original
>> >> > > netdev is a master of the VF
>> >> > > 2) If the original netdev is a slave, you cannot have any IP address
>> >> > > configured on it (well you could, but the rx_handler would eat every
>> >> > > incoming packet). So you will break the user bacause he would have to
>> >> > > move the configuration to the new master device.
>> >> > > This only makes sense that the original netdev becomes the master for both
>> >> > > netvsc and virtio_net.
>> >> > Forgot to mention that bypass_netdev takes over the name of the original
>> >> > netdev and
>> >> > virtio_net netdev will get the backup name.
>> >> What do you mean by "name"?
>> >
>> >bypass_netdev also is associated with the same pci device as the original virtio_net
>> >netdev via SET_NETDEV_DEV(). Also, we added ndo_get_phys_port_name() to virtio_net
>> >that will return _bkup when BACKUP feature is enabled.
>>
>> Okay.
>>
>> >
>> >So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by udev
>> >without BACKUP feature, when BACKUP feature is enabled, the bypass_netdev will be
>> >named 'ens12' and the original virtio_net will get named as ens12n_bkup.
>>
>> Got it.
>>
>> I don't like the bypass_master to look differently in netvsc and
>> virtio_net :/ The best would be to convert netvsc to 3 netdev model and
>> treat them the same. The more I think about it, the more the 2 netdev
>> model feels wrong.
>
>If you believe that, then this patchset is a step in the right
>direction.
>
>With something like this patchset applied, converting netvsc to a 3
>device model will presumably be just a flag flip away. Afterwards
If done properly. Yes.
>we'll be able to drop dead code handling the bypass_master flag.
>
>>
>> >
>> >
>> >>
>> >> > So the userspace network configuration doesn't need to change.
>> >> >
>> >> >
>> >
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2018-04-11 14:45 UTC (permalink / raw)
To: Jiri Pirko
Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180411080332.GL2028@nanopsycho>
On Wed, Apr 11, 2018 at 10:03:32AM +0200, Jiri Pirko wrote:
> Wed, Apr 11, 2018 at 08:24:43AM CEST, sridhar.samudrala@intel.com wrote:
> >On 4/10/2018 11:03 PM, Jiri Pirko wrote:
> >> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudrala@intel.com wrote:
> >> > On 4/10/2018 8:43 AM, Jiri Pirko wrote:
> >> > > Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
> >> > > > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
> >> > > > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
> >> > > > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
> >> > > > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > > > > > > [...]
> >> > > > > > > > >
> >> > > > > > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
> >> > > > > > > > > > > > + struct net_device *child_netdev)
> >> > > > > > > > > > > > +{
> >> > > > > > > > > > > > + struct virtnet_bypass_info *vbi;
> >> > > > > > > > > > > > + bool backup;
> >> > > > > > > > > > > > +
> >> > > > > > > > > > > > + vbi = netdev_priv(bypass_netdev);
> >> > > > > > > > > > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
> >> > > > > > > > > > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
> >> > > > > > > > > > > > + rtnl_dereference(vbi->active_netdev)) {
> >> > > > > > > > > > > > + netdev_info(bypass_netdev,
> >> > > > > > > > > > > > + "%s attempting to join bypass dev when %s already present\n",
> >> > > > > > > > > > > > + child_netdev->name, backup ? "backup" : "active");
> >> > > > > > > > > > > Bypass module should check if there is already some other netdev
> >> > > > > > > > > > > enslaved and refuse right there.
> >> > > > > > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
> >> > > > > > > > > > as its bypass_netdev is same as the backup_netdev.
> >> > > > > > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
> >> > > > > > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
> >> > > > > > > > > > for 3 netdev scenario.
> >> > > > > > > > > Just let me undestand it clearly. What I expect the difference would be
> >> > > > > > > > > between 2netdev and3 netdev model is this:
> >> > > > > > > > > 2netdev:
> >> > > > > > > > > bypass_master
> >> > > > > > > > > /
> >> > > > > > > > > /
> >> > > > > > > > > VF_slave
> >> > > > > > > > >
> >> > > > > > > > > 3netdev:
> >> > > > > > > > > bypass_master
> >> > > > > > > > > / \
> >> > > > > > > > > / \
> >> > > > > > > > > VF_slave backup_slave
> >> > > > > > > > >
> >> > > > > > > > > Is that correct? If not, how does it look like?
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > Looks correct.
> >> > > > > > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
> >> > > > > > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
> >> > > > > > > > marked as the 2 slaves of this new netdev.
> >> > > > > > > You say it looks correct and in another sentence you provide completely
> >> > > > > > > different description. Could you please look again?
> >> > > > > > >
> >> > > > > > To be exact, 2 netdev model with netvsc looks like this.
> >> > > > > >
> >> > > > > > netvsc_netdev
> >> > > > > > /
> >> > > > > > /
> >> > > > > > VF_slave
> >> > > > > >
> >> > > > > > With virtio_net, 3 netdev model
> >> > > > > >
> >> > > > > > bypass_netdev
> >> > > > > > / \
> >> > > > > > / \
> >> > > > > > VF_slave virtio_net netdev
> >> > > > > Could you also mark the original netdev which is there now? is it
> >> > > > > bypass_netdev or virtio_net_netdev ?
> >> > > > bypass_netdev
> >> > > > / \
> >> > > > / \
> >> > > > VF_slave virtio_net netdev (original)
> >> > > That does not make sense.
> >> > > 1) You diverge from the behaviour of the netvsc, where the original
> >> > > netdev is a master of the VF
> >> > > 2) If the original netdev is a slave, you cannot have any IP address
> >> > > configured on it (well you could, but the rx_handler would eat every
> >> > > incoming packet). So you will break the user bacause he would have to
> >> > > move the configuration to the new master device.
> >> > > This only makes sense that the original netdev becomes the master for both
> >> > > netvsc and virtio_net.
> >> > Forgot to mention that bypass_netdev takes over the name of the original
> >> > netdev and
> >> > virtio_net netdev will get the backup name.
> >> What do you mean by "name"?
> >
> >bypass_netdev also is associated with the same pci device as the original virtio_net
> >netdev via SET_NETDEV_DEV(). Also, we added ndo_get_phys_port_name() to virtio_net
> >that will return _bkup when BACKUP feature is enabled.
>
> Okay.
>
> >
> >So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by udev
> >without BACKUP feature, when BACKUP feature is enabled, the bypass_netdev will be
> >named 'ens12' and the original virtio_net will get named as ens12n_bkup.
>
> Got it.
>
> I don't like the bypass_master to look differently in netvsc and
> virtio_net :/ The best would be to convert netvsc to 3 netdev model and
> treat them the same. The more I think about it, the more the 2 netdev
> model feels wrong.
If you believe that, then this patchset is a step in the right
direction.
With something like this patchset applied, converting netvsc to a 3
device model will presumably be just a flag flip away. Afterwards
we'll be able to drop dead code handling the bypass_master flag.
>
> >
> >
> >>
> >> > So the userspace network configuration doesn't need to change.
> >> >
> >> >
> >
^ permalink raw reply
* Re: [PATCH] vhost: Fix vhost_copy_to_user()
From: Michael S. Tsirkin @ 2018-04-11 13:51 UTC (permalink / raw)
To: Eric Auger
Cc: kvm, netdev, linux-kernel, virtualization, stefanha, kvmarm,
eric.auger.pro
In-Reply-To: <1523453438-4266-1-git-send-email-eric.auger@redhat.com>
On Wed, Apr 11, 2018 at 03:30:38PM +0200, Eric Auger wrote:
> vhost_copy_to_user is used to copy vring used elements to userspace.
> We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.
>
> Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>
> This fixes a stall observed when running an aarch64 guest with
> virtual smmu
> ---
> drivers/vhost/vhost.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index bec722e..f44aead 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -744,7 +744,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> struct iov_iter t;
> void __user *uaddr = vhost_vq_meta_fetch(vq,
> (u64)(uintptr_t)to, size,
> - VHOST_ADDR_DESC);
> + VHOST_ADDR_USED);
>
> if (uaddr)
> return __copy_to_user(uaddr, from, size);
> --
> 2.5.5
^ permalink raw reply
* Re: [PATCH] vhost: Fix vhost_copy_to_user()
From: Auger Eric @ 2018-04-11 13:45 UTC (permalink / raw)
To: Jason Wang, eric.auger.pro, linux-kernel, virtualization, netdev,
kvm, mst, kvmarm
Cc: stefanha
In-Reply-To: <85c033b9-b230-7ef9-744c-4e2799684609@redhat.com>
Hi Jason,
On 11/04/18 15:44, Jason Wang wrote:
>
>
> On 2018年04月11日 21:30, Eric Auger wrote:
>> vhost_copy_to_user is used to copy vring used elements to userspace.
>> We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.
>>
>> Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>>
>> This fixes a stall observed when running an aarch64 guest with
>> virtual smmu
>> ---
>> drivers/vhost/vhost.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index bec722e..f44aead 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -744,7 +744,7 @@ static int vhost_copy_to_user(struct
>> vhost_virtqueue *vq, void __user *to,
>> struct iov_iter t;
>> void __user *uaddr = vhost_vq_meta_fetch(vq,
>> (u64)(uintptr_t)to, size,
>> - VHOST_ADDR_DESC);
>> + VHOST_ADDR_USED);
>> if (uaddr)
>> return __copy_to_user(uaddr, from, size);
>
> Acked-by: Jason Wang <jasowang@redhat.com>
>
> Thanks!
>
> Stable material I think.
yes I think so.
Thanks
Eric
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] vhost: Fix vhost_copy_to_user()
From: Jason Wang @ 2018-04-11 13:44 UTC (permalink / raw)
To: Eric Auger, eric.auger.pro, linux-kernel, virtualization, netdev,
kvm, mst, kvmarm
Cc: stefanha
In-Reply-To: <1523453438-4266-1-git-send-email-eric.auger@redhat.com>
On 2018年04月11日 21:30, Eric Auger wrote:
> vhost_copy_to_user is used to copy vring used elements to userspace.
> We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.
>
> Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>
> ---
>
> This fixes a stall observed when running an aarch64 guest with
> virtual smmu
> ---
> drivers/vhost/vhost.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index bec722e..f44aead 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -744,7 +744,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> struct iov_iter t;
> void __user *uaddr = vhost_vq_meta_fetch(vq,
> (u64)(uintptr_t)to, size,
> - VHOST_ADDR_DESC);
> + VHOST_ADDR_USED);
>
> if (uaddr)
> return __copy_to_user(uaddr, from, size);
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks!
Stable material I think.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH] vhost: Fix vhost_copy_to_user()
From: Eric Auger @ 2018-04-11 13:30 UTC (permalink / raw)
To: eric.auger.pro, eric.auger, linux-kernel, virtualization, netdev,
kvm, jasowang, mst, kvmarm
Cc: stefanha
vhost_copy_to_user is used to copy vring used elements to userspace.
We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.
Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
This fixes a stall observed when running an aarch64 guest with
virtual smmu
---
drivers/vhost/vhost.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index bec722e..f44aead 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -744,7 +744,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
struct iov_iter t;
void __user *uaddr = vhost_vq_meta_fetch(vq,
(u64)(uintptr_t)to, size,
- VHOST_ADDR_DESC);
+ VHOST_ADDR_USED);
if (uaddr)
return __copy_to_user(uaddr, from, size);
--
2.5.5
^ permalink raw reply related
* Re: [PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-11 13:24 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
Linus Torvalds
In-Reply-To: <20180411023541.15776-1-stefanha@redhat.com>
On Wed, Apr 11, 2018 at 10:35:39AM +0800, Stefan Hajnoczi wrote:
> v3:
> * Rebased onto net/master and resolved conflict [DaveM]
>
> v2:
> * Rewrote the conditional to make the vq access check clearer [Linus]
> * Added Patch 2 to make the return type consistent and harder to misuse [Linus]
>
> The first patch fixes the vhost virtqueue access check which was recently
> broken. The second patch replaces the int return type with bool to prevent
> future bugs.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
We need the 1st one on stable I think.
> Stefan Hajnoczi (2):
> vhost: fix vhost_vq_access_ok() log check
> vhost: return bool from *_access_ok() functions
>
> drivers/vhost/vhost.h | 4 +--
> drivers/vhost/vhost.c | 70 ++++++++++++++++++++++++++-------------------------
> 2 files changed, 38 insertions(+), 36 deletions(-)
>
> --
> 2.14.3
^ permalink raw reply
* Re: [PATCH v3 2/2] vhost: return bool from *_access_ok() functions
From: Michael S. Tsirkin @ 2018-04-11 13:23 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
Linus Torvalds
In-Reply-To: <20180411023541.15776-3-stefanha@redhat.com>
On Wed, Apr 11, 2018 at 10:35:41AM +0800, Stefan Hajnoczi wrote:
> Currently vhost *_access_ok() functions return int. This is error-prone
> because there are two popular conventions:
>
> 1. 0 means failure, 1 means success
> 2. -errno means failure, 0 means success
>
> Although vhost mostly uses #1, it does not do so consistently.
> umem_access_ok() uses #2.
>
> This patch changes the return type from int to bool so that false means
> failure and true means success. This eliminates a potential source of
> errors.
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> drivers/vhost/vhost.h | 4 ++--
> drivers/vhost/vhost.c | 66 +++++++++++++++++++++++++--------------------------
> 2 files changed, 35 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index d8ee85ae8fdc..6c844b90a168 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
> void vhost_dev_stop(struct vhost_dev *);
> long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
> long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
> -int vhost_vq_access_ok(struct vhost_virtqueue *vq);
> -int vhost_log_access_ok(struct vhost_dev *);
> +bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
> +bool vhost_log_access_ok(struct vhost_dev *);
>
> int vhost_get_vq_desc(struct vhost_virtqueue *,
> struct iovec iov[], unsigned int iov_count,
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index fc805b7fad9d..0fcb51a9940c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> }
> EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
>
> -static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> +static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> {
> u64 a = addr / VHOST_PAGE_SIZE / 8;
>
> /* Make sure 64 bit math will not overflow. */
> if (a > ULONG_MAX - (unsigned long)log_base ||
> a + (unsigned long)log_base > ULONG_MAX)
> - return 0;
> + return false;
>
> return access_ok(VERIFY_WRITE, log_base + a,
> (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
> @@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
> }
>
> /* Caller should have vq mutex and device mutex. */
> -static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
> - int log_all)
> +static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
> + int log_all)
> {
> struct vhost_umem_node *node;
>
> if (!umem)
> - return 0;
> + return false;
>
> list_for_each_entry(node, &umem->umem_list, link) {
> unsigned long a = node->userspace_addr;
>
> if (vhost_overflow(node->userspace_addr, node->size))
> - return 0;
> + return false;
>
>
> if (!access_ok(VERIFY_WRITE, (void __user *)a,
> node->size))
> - return 0;
> + return false;
> else if (log_all && !log_access_ok(log_base,
> node->start,
> node->size))
> - return 0;
> + return false;
> }
> - return 1;
> + return true;
> }
>
> static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
> @@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
>
> /* Can we switch to this memory table? */
> /* Caller should have device mutex but not vq mutex */
> -static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> - int log_all)
> +static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> + int log_all)
> {
> int i;
>
> for (i = 0; i < d->nvqs; ++i) {
> - int ok;
> + bool ok;
> bool log;
>
> mutex_lock(&d->vqs[i]->mutex);
> @@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> ok = vq_memory_access_ok(d->vqs[i]->log_base,
> umem, log);
> else
> - ok = 1;
> + ok = true;
> mutex_unlock(&d->vqs[i]->mutex);
> if (!ok)
> - return 0;
> + return false;
> }
> - return 1;
> + return true;
> }
>
> static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> @@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
> spin_unlock(&d->iotlb_lock);
> }
>
> -static int umem_access_ok(u64 uaddr, u64 size, int access)
> +static bool umem_access_ok(u64 uaddr, u64 size, int access)
> {
> unsigned long a = uaddr;
>
> /* Make sure 64 bit math will not overflow. */
> if (vhost_overflow(uaddr, size))
> - return -EFAULT;
> + return false;
>
> if ((access & VHOST_ACCESS_RO) &&
> !access_ok(VERIFY_READ, (void __user *)a, size))
> - return -EFAULT;
> + return false;
> if ((access & VHOST_ACCESS_WO) &&
> !access_ok(VERIFY_WRITE, (void __user *)a, size))
> - return -EFAULT;
> - return 0;
> + return false;
> + return true;
> }
>
> static int vhost_process_iotlb_msg(struct vhost_dev *dev,
> @@ -988,7 +988,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
> ret = -EFAULT;
> break;
> }
> - if (umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
> + if (!umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
> ret = -EFAULT;
> break;
> }
> @@ -1135,10 +1135,10 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
> return 0;
> }
>
> -static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
> - struct vring_desc __user *desc,
> - struct vring_avail __user *avail,
> - struct vring_used __user *used)
> +static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
> + struct vring_desc __user *desc,
> + struct vring_avail __user *avail,
> + struct vring_used __user *used)
>
> {
> size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> @@ -1161,8 +1161,8 @@ static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
> vq->meta_iotlb[type] = node;
> }
>
> -static int iotlb_access_ok(struct vhost_virtqueue *vq,
> - int access, u64 addr, u64 len, int type)
> +static bool iotlb_access_ok(struct vhost_virtqueue *vq,
> + int access, u64 addr, u64 len, int type)
> {
> const struct vhost_umem_node *node;
> struct vhost_umem *umem = vq->iotlb;
> @@ -1220,7 +1220,7 @@ EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
>
> /* Can we log writes? */
> /* Caller should have device mutex but not vq mutex */
> -int vhost_log_access_ok(struct vhost_dev *dev)
> +bool vhost_log_access_ok(struct vhost_dev *dev)
> {
> return memory_access_ok(dev, dev->umem, 1);
> }
> @@ -1228,8 +1228,8 @@ EXPORT_SYMBOL_GPL(vhost_log_access_ok);
>
> /* Verify access for write logging. */
> /* Caller should have vq mutex and device mutex */
> -static int vq_log_access_ok(struct vhost_virtqueue *vq,
> - void __user *log_base)
> +static bool vq_log_access_ok(struct vhost_virtqueue *vq,
> + void __user *log_base)
> {
> size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
>
> @@ -1242,14 +1242,14 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
>
> /* Can we start vq? */
> /* Caller should have vq mutex and device mutex */
> -int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> +bool vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> if (!vq_log_access_ok(vq, vq->log_base))
> - return 0;
> + return false;
>
> /* Access validation occurs at prefetch time with IOTLB */
> if (vq->iotlb)
> - return 1;
> + return true;
>
> return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> }
> --
> 2.14.3
^ permalink raw reply
* Re: [PATCH v3 1/2] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-11 13:23 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
Linus Torvalds
In-Reply-To: <20180411023541.15776-2-stefanha@redhat.com>
On Wed, Apr 11, 2018 at 10:35:40AM +0800, Stefan Hajnoczi wrote:
> Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> when IOTLB is enabled") introduced a regression. The logic was
> originally:
>
> if (vq->iotlb)
> return 1;
> return A && B;
>
> After the patch the short-circuit logic for A was inverted:
>
> if (A || vq->iotlb)
> return A;
> return B;
>
> This patch fixes the regression by rewriting the checks in the obvious
> way, no longer returning A when vq->iotlb is non-NULL (which is hard to
> understand).
>
> Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
> Cc: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> drivers/vhost/vhost.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index bec722e41f58..fc805b7fad9d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
> /* Caller should have vq mutex and device mutex */
> int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> - int ret = vq_log_access_ok(vq, vq->log_base);
> + if (!vq_log_access_ok(vq, vq->log_base))
> + return 0;
>
> - if (ret || vq->iotlb)
> - return ret;
> + /* Access validation occurs at prefetch time with IOTLB */
> + if (vq->iotlb)
> + return 1;
>
> return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> }
> --
> 2.14.3
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-11 8:03 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <5d56e572-24f1-745d-49ae-c2dada5db03c@intel.com>
Wed, Apr 11, 2018 at 08:24:43AM CEST, sridhar.samudrala@intel.com wrote:
>On 4/10/2018 11:03 PM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> > > > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> > > > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > > > > [...]
>> > > > > > > > >
>> > > > > > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > > > > > > > > > + struct net_device *child_netdev)
>> > > > > > > > > > > > +{
>> > > > > > > > > > > > + struct virtnet_bypass_info *vbi;
>> > > > > > > > > > > > + bool backup;
>> > > > > > > > > > > > +
>> > > > > > > > > > > > + vbi = netdev_priv(bypass_netdev);
>> > > > > > > > > > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > > > > > > > > > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > > > > > > > + rtnl_dereference(vbi->active_netdev)) {
>> > > > > > > > > > > > + netdev_info(bypass_netdev,
>> > > > > > > > > > > > + "%s attempting to join bypass dev when %s already present\n",
>> > > > > > > > > > > > + child_netdev->name, backup ? "backup" : "active");
>> > > > > > > > > > > Bypass module should check if there is already some other netdev
>> > > > > > > > > > > enslaved and refuse right there.
>> > > > > > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> > > > > > > > > > as its bypass_netdev is same as the backup_netdev.
>> > > > > > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> > > > > > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> > > > > > > > > > for 3 netdev scenario.
>> > > > > > > > > Just let me undestand it clearly. What I expect the difference would be
>> > > > > > > > > between 2netdev and3 netdev model is this:
>> > > > > > > > > 2netdev:
>> > > > > > > > > bypass_master
>> > > > > > > > > /
>> > > > > > > > > /
>> > > > > > > > > VF_slave
>> > > > > > > > >
>> > > > > > > > > 3netdev:
>> > > > > > > > > bypass_master
>> > > > > > > > > / \
>> > > > > > > > > / \
>> > > > > > > > > VF_slave backup_slave
>> > > > > > > > >
>> > > > > > > > > Is that correct? If not, how does it look like?
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > Looks correct.
>> > > > > > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
>> > > > > > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> > > > > > > > marked as the 2 slaves of this new netdev.
>> > > > > > > You say it looks correct and in another sentence you provide completely
>> > > > > > > different description. Could you please look again?
>> > > > > > >
>> > > > > > To be exact, 2 netdev model with netvsc looks like this.
>> > > > > >
>> > > > > > netvsc_netdev
>> > > > > > /
>> > > > > > /
>> > > > > > VF_slave
>> > > > > >
>> > > > > > With virtio_net, 3 netdev model
>> > > > > >
>> > > > > > bypass_netdev
>> > > > > > / \
>> > > > > > / \
>> > > > > > VF_slave virtio_net netdev
>> > > > > Could you also mark the original netdev which is there now? is it
>> > > > > bypass_netdev or virtio_net_netdev ?
>> > > > bypass_netdev
>> > > > / \
>> > > > / \
>> > > > VF_slave virtio_net netdev (original)
>> > > That does not make sense.
>> > > 1) You diverge from the behaviour of the netvsc, where the original
>> > > netdev is a master of the VF
>> > > 2) If the original netdev is a slave, you cannot have any IP address
>> > > configured on it (well you could, but the rx_handler would eat every
>> > > incoming packet). So you will break the user bacause he would have to
>> > > move the configuration to the new master device.
>> > > This only makes sense that the original netdev becomes the master for both
>> > > netvsc and virtio_net.
>> > Forgot to mention that bypass_netdev takes over the name of the original
>> > netdev and
>> > virtio_net netdev will get the backup name.
>> What do you mean by "name"?
>
>bypass_netdev also is associated with the same pci device as the original virtio_net
>netdev via SET_NETDEV_DEV(). Also, we added ndo_get_phys_port_name() to virtio_net
>that will return _bkup when BACKUP feature is enabled.
Okay.
>
>So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by udev
>without BACKUP feature, when BACKUP feature is enabled, the bypass_netdev will be
>named 'ens12' and the original virtio_net will get named as ens12n_bkup.
Got it.
I don't like the bypass_master to look differently in netvsc and
virtio_net :/ The best would be to convert netvsc to 3 netdev model and
treat them the same. The more I think about it, the more the 2 netdev
model feels wrong.
>
>
>>
>> > So the userspace network configuration doesn't need to change.
>> >
>> >
>
^ permalink raw reply
* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Jiri Pirko @ 2018-04-11 7:53 UTC (permalink / raw)
To: Stephen Hemminger
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, Sridhar Samudrala,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180410142608.50f15b45@xeon-e3>
Tue, Apr 10, 2018 at 11:26:08PM CEST, stephen@networkplumber.org wrote:
>On Tue, 10 Apr 2018 11:59:50 -0700
>Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>
>Thanks for doing this. Your current version has couple show stopper
>issues.
>
>First, the slave device is instantly taking over the slave.
>This doesn't allow udev/systemd to do its device rename of the slave
>device. Netvsc uses a delayed work to workaround this.
Wait. Why the fact a device is enslaved has to affect the udev in any
way? If it does, smells like a bug in udev.
>
>Secondly, the select queue needs to call queue selection in VF.
>The bonding/teaming logic doesn't work well for UDP flows.
>Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>fixed this performance problem.
>
>Lastly, more indirection is bad in current climate.
>
>I am not completely adverse to this but it needs to be fast, simple
>and completely transparent.
^ permalink raw reply
* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Jiri Pirko @ 2018-04-11 7:50 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, kubakici, Sridhar Samudrala,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180411022807-mutt-send-email-mst@kernel.org>
Wed, Apr 11, 2018 at 01:28:51AM CEST, mst@redhat.com wrote:
>On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>>
>> Thanks for doing this. Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
>Interesting. Does this mean udev must act within a specific time window
>then?
Yeah. That is scarry. Also, wrong.
>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-11 6:24 UTC (permalink / raw)
To: Jiri Pirko
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <20180411060327.GH2028@nanopsycho>
On 4/10/2018 11:03 PM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>>>> On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>>>>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>> On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>>>>>> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>>> On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>>>>>>>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>>>>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>>>>>>>>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>>>>>>>>>> + struct net_device *child_netdev)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + struct virtnet_bypass_info *vbi;
>>>>>>>>>>>> + bool backup;
>>>>>>>>>>>> +
>>>>>>>>>>>> + vbi = netdev_priv(bypass_netdev);
>>>>>>>>>>>> + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>>>>>>>>>> + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>>>>>>>>>> + rtnl_dereference(vbi->active_netdev)) {
>>>>>>>>>>>> + netdev_info(bypass_netdev,
>>>>>>>>>>>> + "%s attempting to join bypass dev when %s already present\n",
>>>>>>>>>>>> + child_netdev->name, backup ? "backup" : "active");
>>>>>>>>>>> Bypass module should check if there is already some other netdev
>>>>>>>>>>> enslaved and refuse right there.
>>>>>>>>>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>>>>>>>>> as its bypass_netdev is same as the backup_netdev.
>>>>>>>>>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>>>>>>>>>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>>>>>>>>> for 3 netdev scenario.
>>>>>>>>> Just let me undestand it clearly. What I expect the difference would be
>>>>>>>>> between 2netdev and3 netdev model is this:
>>>>>>>>> 2netdev:
>>>>>>>>> bypass_master
>>>>>>>>> /
>>>>>>>>> /
>>>>>>>>> VF_slave
>>>>>>>>>
>>>>>>>>> 3netdev:
>>>>>>>>> bypass_master
>>>>>>>>> / \
>>>>>>>>> / \
>>>>>>>>> VF_slave backup_slave
>>>>>>>>>
>>>>>>>>> Is that correct? If not, how does it look like?
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Looks correct.
>>>>>>>> VF_slave and backup_slave are the original netdevs and are present in both the models.
>>>>>>>> In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>>>>>>>> marked as the 2 slaves of this new netdev.
>>>>>>> You say it looks correct and in another sentence you provide completely
>>>>>>> different description. Could you please look again?
>>>>>>>
>>>>>> To be exact, 2 netdev model with netvsc looks like this.
>>>>>>
>>>>>> netvsc_netdev
>>>>>> /
>>>>>> /
>>>>>> VF_slave
>>>>>>
>>>>>> With virtio_net, 3 netdev model
>>>>>>
>>>>>> bypass_netdev
>>>>>> / \
>>>>>> / \
>>>>>> VF_slave virtio_net netdev
>>>>> Could you also mark the original netdev which is there now? is it
>>>>> bypass_netdev or virtio_net_netdev ?
>>>> bypass_netdev
>>>> / \
>>>> / \
>>>> VF_slave virtio_net netdev (original)
>>> That does not make sense.
>>> 1) You diverge from the behaviour of the netvsc, where the original
>>> netdev is a master of the VF
>>> 2) If the original netdev is a slave, you cannot have any IP address
>>> configured on it (well you could, but the rx_handler would eat every
>>> incoming packet). So you will break the user bacause he would have to
>>> move the configuration to the new master device.
>>> This only makes sense that the original netdev becomes the master for both
>>> netvsc and virtio_net.
>> Forgot to mention that bypass_netdev takes over the name of the original
>> netdev and
>> virtio_net netdev will get the backup name.
> What do you mean by "name"?
bypass_netdev also is associated with the same pci device as the original virtio_net
netdev via SET_NETDEV_DEV(). Also, we added ndo_get_phys_port_name() to virtio_net
that will return _bkup when BACKUP feature is enabled.
So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by udev
without BACKUP feature, when BACKUP feature is enabled, the bypass_netdev will be
named 'ens12' and the original virtio_net will get named as ens12n_bkup.
>
>> So the userspace network configuration doesn't need to change.
>>
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [virtio-dev] Re: [PATCH v2] virtio_balloon: export hugetlb page allocation counts
From: Jason Wang @ 2018-04-11 3:12 UTC (permalink / raw)
To: Jonathan Helman, Michael S. Tsirkin
Cc: Mihai Carabas, virtio-dev, linux-kernel, virtualization
In-Reply-To: <b1414be4-84bc-5d02-afff-9e42c1fd7210@oracle.com>
On 2018年04月10日 05:11, Jonathan Helman wrote:
>
>
> On 03/22/2018 07:38 PM, Jason Wang wrote:
>>
>>
>> On 2018年03月22日 11:10, Michael S. Tsirkin wrote:
>>> On Thu, Mar 22, 2018 at 09:52:18AM +0800, Jason Wang wrote:
>>>> On 2018年03月20日 12:26, Jonathan Helman wrote:
>>>>>> On Mar 19, 2018, at 7:31 PM, Jason Wang<jasowang@redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2018年03月20日 06:14, Jonathan Helman wrote:
>>>>>>> Export the number of successful and failed hugetlb page
>>>>>>> allocations via the virtio balloon driver. These 2 counts
>>>>>>> come directly from the vm_events HTLB_BUDDY_PGALLOC and
>>>>>>> HTLB_BUDDY_PGALLOC_FAIL.
>>>>>>>
>>>>>>> Signed-off-by: Jonathan Helman<jonathan.helman@oracle.com>
>>>>>> Reviewed-by: Jason Wang<jasowang@redhat.com>
>>>>> Thanks.
>>>>>
>>>>>>> ---
>>>>>>> drivers/virtio/virtio_balloon.c | 6 ++++++
>>>>>>> include/uapi/linux/virtio_balloon.h | 4 +++-
>>>>>>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/virtio/virtio_balloon.c
>>>>>>> b/drivers/virtio/virtio_balloon.c
>>>>>>> index dfe5684..6b237e3 100644
>>>>>>> --- a/drivers/virtio/virtio_balloon.c
>>>>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>>>>> @@ -272,6 +272,12 @@ static unsigned int
>>>>>>> update_balloon_stats(struct virtio_balloon *vb)
>>>>>>> pages_to_bytes(events[PSWPOUT]));
>>>>>>> update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT,
>>>>>>> events[PGMAJFAULT]);
>>>>>>> update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT,
>>>>>>> events[PGFAULT]);
>>>>>>> +#ifdef CONFIG_HUGETLB_PAGE
>>>>>>> + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
>>>>>>> + events[HTLB_BUDDY_PGALLOC]);
>>>>>>> + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
>>>>>>> + events[HTLB_BUDDY_PGALLOC_FAIL]);
>>>>>>> +#endif
>>>>>>> #endif
>>>>>>> update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
>>>>>>> pages_to_bytes(i.freeram));
>>>>>>> diff --git a/include/uapi/linux/virtio_balloon.h
>>>>>>> b/include/uapi/linux/virtio_balloon.h
>>>>>>> index 4e8b830..40297a3 100644
>>>>>>> --- a/include/uapi/linux/virtio_balloon.h
>>>>>>> +++ b/include/uapi/linux/virtio_balloon.h
>>>>>>> @@ -53,7 +53,9 @@ struct virtio_balloon_config {
>>>>>>> #define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of
>>>>>>> memory */
>>>>>>> #define VIRTIO_BALLOON_S_AVAIL 6 /* Available memory as
>>>>>>> in /proc */
>>>>>>> #define VIRTIO_BALLOON_S_CACHES 7 /* Disk caches */
>>>>>>> -#define VIRTIO_BALLOON_S_NR 8
>>>>>>> +#define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page
>>>>>>> allocations */
>>>>>>> +#define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page
>>>>>>> allocation failures */
>>>>>>> +#define VIRTIO_BALLOON_S_NR 10
>>>>>>> /*
>>>>>>> * Memory statistics structure.
>>>>>> Not for this patch, but it looks to me that exporting such nr
>>>>>> through uapi is fragile.
>>>>> Sorry, can you explain what you mean here?
>>>>>
>>>>> Jon
>>>> Spec said "Within an output buffer submitted to the statsq, the
>>>> device MUST
>>>> ignore entries with tag values that it does not recognize". So
>>>> exporting
>>>> VIRTIO_BALLOON_S_NR seems useless and device implementation can not
>>>> depend
>>>> on such number in uapi.
>>>>
>>>> Thanks
>>> Suggestions? I don't like to break build for people ...
>>>
>>
>> Didn't have a good idea. But maybe we should keep VIRTIO_BALLOON_S_NR
>> unchanged, and add a comment here.
>>
>> Thanks
>
> I think Jason's comment is for a future patch. Didn't see this patch
> get applied, so wondering if it could be.
>
> Thanks,
> Jon
Hi Jon:
Have you tested new driver with old qemu?
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH v3 2/2] vhost: return bool from *_access_ok() functions
From: Stefan Hajnoczi @ 2018-04-11 2:35 UTC (permalink / raw)
To: virtualization
Cc: kvm, mst, netdev, syzkaller-bugs, linux-kernel, Stefan Hajnoczi,
Linus Torvalds
In-Reply-To: <20180411023541.15776-1-stefanha@redhat.com>
Currently vhost *_access_ok() functions return int. This is error-prone
because there are two popular conventions:
1. 0 means failure, 1 means success
2. -errno means failure, 0 means success
Although vhost mostly uses #1, it does not do so consistently.
umem_access_ok() uses #2.
This patch changes the return type from int to bool so that false means
failure and true means success. This eliminates a potential source of
errors.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
drivers/vhost/vhost.h | 4 ++--
drivers/vhost/vhost.c | 66 +++++++++++++++++++++++++--------------------------
2 files changed, 35 insertions(+), 35 deletions(-)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d8ee85ae8fdc..6c844b90a168 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
void vhost_dev_stop(struct vhost_dev *);
long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
-int vhost_vq_access_ok(struct vhost_virtqueue *vq);
-int vhost_log_access_ok(struct vhost_dev *);
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
+bool vhost_log_access_ok(struct vhost_dev *);
int vhost_get_vq_desc(struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index fc805b7fad9d..0fcb51a9940c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
-static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
+static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
{
u64 a = addr / VHOST_PAGE_SIZE / 8;
/* Make sure 64 bit math will not overflow. */
if (a > ULONG_MAX - (unsigned long)log_base ||
a + (unsigned long)log_base > ULONG_MAX)
- return 0;
+ return false;
return access_ok(VERIFY_WRITE, log_base + a,
(sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
@@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
}
/* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
- int log_all)
+static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
+ int log_all)
{
struct vhost_umem_node *node;
if (!umem)
- return 0;
+ return false;
list_for_each_entry(node, &umem->umem_list, link) {
unsigned long a = node->userspace_addr;
if (vhost_overflow(node->userspace_addr, node->size))
- return 0;
+ return false;
if (!access_ok(VERIFY_WRITE, (void __user *)a,
node->size))
- return 0;
+ return false;
else if (log_all && !log_access_ok(log_base,
node->start,
node->size))
- return 0;
+ return false;
}
- return 1;
+ return true;
}
static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
@@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
/* Can we switch to this memory table? */
/* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
- int log_all)
+static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
+ int log_all)
{
int i;
for (i = 0; i < d->nvqs; ++i) {
- int ok;
+ bool ok;
bool log;
mutex_lock(&d->vqs[i]->mutex);
@@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
ok = vq_memory_access_ok(d->vqs[i]->log_base,
umem, log);
else
- ok = 1;
+ ok = true;
mutex_unlock(&d->vqs[i]->mutex);
if (!ok)
- return 0;
+ return false;
}
- return 1;
+ return true;
}
static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
@@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
spin_unlock(&d->iotlb_lock);
}
-static int umem_access_ok(u64 uaddr, u64 size, int access)
+static bool umem_access_ok(u64 uaddr, u64 size, int access)
{
unsigned long a = uaddr;
/* Make sure 64 bit math will not overflow. */
if (vhost_overflow(uaddr, size))
- return -EFAULT;
+ return false;
if ((access & VHOST_ACCESS_RO) &&
!access_ok(VERIFY_READ, (void __user *)a, size))
- return -EFAULT;
+ return false;
if ((access & VHOST_ACCESS_WO) &&
!access_ok(VERIFY_WRITE, (void __user *)a, size))
- return -EFAULT;
- return 0;
+ return false;
+ return true;
}
static int vhost_process_iotlb_msg(struct vhost_dev *dev,
@@ -988,7 +988,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
ret = -EFAULT;
break;
}
- if (umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
+ if (!umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
ret = -EFAULT;
break;
}
@@ -1135,10 +1135,10 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
return 0;
}
-static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
- struct vring_desc __user *desc,
- struct vring_avail __user *avail,
- struct vring_used __user *used)
+static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
@@ -1161,8 +1161,8 @@ static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
vq->meta_iotlb[type] = node;
}
-static int iotlb_access_ok(struct vhost_virtqueue *vq,
- int access, u64 addr, u64 len, int type)
+static bool iotlb_access_ok(struct vhost_virtqueue *vq,
+ int access, u64 addr, u64 len, int type)
{
const struct vhost_umem_node *node;
struct vhost_umem *umem = vq->iotlb;
@@ -1220,7 +1220,7 @@ EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
/* Can we log writes? */
/* Caller should have device mutex but not vq mutex */
-int vhost_log_access_ok(struct vhost_dev *dev)
+bool vhost_log_access_ok(struct vhost_dev *dev)
{
return memory_access_ok(dev, dev->umem, 1);
}
@@ -1228,8 +1228,8 @@ EXPORT_SYMBOL_GPL(vhost_log_access_ok);
/* Verify access for write logging. */
/* Caller should have vq mutex and device mutex */
-static int vq_log_access_ok(struct vhost_virtqueue *vq,
- void __user *log_base)
+static bool vq_log_access_ok(struct vhost_virtqueue *vq,
+ void __user *log_base)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
@@ -1242,14 +1242,14 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
/* Can we start vq? */
/* Caller should have vq mutex and device mutex */
-int vhost_vq_access_ok(struct vhost_virtqueue *vq)
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq)
{
if (!vq_log_access_ok(vq, vq->log_base))
- return 0;
+ return false;
/* Access validation occurs at prefetch time with IOTLB */
if (vq->iotlb)
- return 1;
+ return true;
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
}
--
2.14.3
^ permalink raw reply related
* [PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check
From: Stefan Hajnoczi @ 2018-04-11 2:35 UTC (permalink / raw)
To: virtualization
Cc: kvm, mst, netdev, syzkaller-bugs, linux-kernel, Stefan Hajnoczi,
Linus Torvalds
v3:
* Rebased onto net/master and resolved conflict [DaveM]
v2:
* Rewrote the conditional to make the vq access check clearer [Linus]
* Added Patch 2 to make the return type consistent and harder to misuse [Linus]
The first patch fixes the vhost virtqueue access check which was recently
broken. The second patch replaces the int return type with bool to prevent
future bugs.
Stefan Hajnoczi (2):
vhost: fix vhost_vq_access_ok() log check
vhost: return bool from *_access_ok() functions
drivers/vhost/vhost.h | 4 +--
drivers/vhost/vhost.c | 70 ++++++++++++++++++++++++++-------------------------
2 files changed, 38 insertions(+), 36 deletions(-)
--
2.14.3
^ permalink raw reply
* Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2018-04-11 2:18 UTC (permalink / raw)
To: Liang, Cunming, Michael S. Tsirkin
Cc: Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
kvm@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, Wang, Xiao W,
ddutile@redhat.com, Tan, Jianfeng, Wang, Zhihong, Paolo Bonzini
In-Reply-To: <D0158A423229094DA7ABF71CF2FA0DA34E9226AF@SHSMSX104.ccr.corp.intel.com>
On 2018年04月10日 22:23, Liang, Cunming wrote:
>> -----Original Message-----
>> From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> Sent: Tuesday, April 10, 2018 9:36 PM
>> To: Liang, Cunming<cunming.liang@intel.com>
>> Cc: Paolo Bonzini<pbonzini@redhat.com>; Bie, Tiwei<tiwei.bie@intel.com>;
>> Jason Wang<jasowang@redhat.com>;alex.williamson@redhat.com;
>> ddutile@redhat.com; Duyck, Alexander H<alexander.h.duyck@intel.com>;
>> virtio-dev@lists.oasis-open.org;linux-kernel@vger.kernel.org;
>> kvm@vger.kernel.org;virtualization@lists.linux-foundation.org;
>> netdev@vger.kernel.org; Daly, Dan<dan.daly@intel.com>; Wang, Zhihong
>> <zhihong.wang@intel.com>; Tan, Jianfeng<jianfeng.tan@intel.com>; Wang, Xiao
>> W<xiao.w.wang@intel.com>
>> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost
>> backend
>>
>> On Tue, Apr 10, 2018 at 09:23:53AM +0000, Liang, Cunming wrote:
>>>> -----Original Message-----
>>>> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
>>>> Sent: Tuesday, April 10, 2018 3:52 PM
>>>> To: Bie, Tiwei<tiwei.bie@intel.com>; Jason Wang
>>>> <jasowang@redhat.com>
>>>> Cc:mst@redhat.com;alex.williamson@redhat.com;ddutile@redhat.com;
>>>> Duyck, Alexander H<alexander.h.duyck@intel.com>;
>>>> virtio-dev@lists.oasis- open.org;linux-kernel@vger.kernel.org;
>>>> kvm@vger.kernel.org;virtualization@lists.linux-foundation.org;
>>>> netdev@vger.kernel.org; Daly, Dan<dan.daly@intel.com>; Liang,
>>>> Cunming<cunming.liang@intel.com>; Wang, Zhihong
>>>> <zhihong.wang@intel.com>; Tan, Jianfeng<jianfeng.tan@intel.com>;
>>>> Wang, Xiao W<xiao.w.wang@intel.com>
>>>> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based
>>>> hardware vhost backend
>>>>
>>>> On 10/04/2018 06:57, Tiwei Bie wrote:
>>>>>> So you just move the abstraction layer from qemu to kernel, and
>>>>>> you still need different drivers in kernel for different device
>>>>>> interfaces of accelerators. This looks even more complex than
>>>>>> leaving it in qemu. As you said, another idea is to implement
>>>>>> userspace vhost backend for accelerators which seems easier and
>>>>>> could co-work with other parts of qemu without inventing new type of
>> messages.
>>>>> I'm not quite sure. Do you think it's acceptable to add various
>>>>> vendor specific hardware drivers in QEMU?
>>>> I think so. We have vendor-specific quirks, and at some point there
>>>> was an idea of using quirks to implement (vendor-specific) live
>>>> migration support for assigned devices.
>>> Vendor-specific quirks of accessing VGA is a small portion. Other major portions
>> are still handled by guest driver.
>>> While in this case, when saying various vendor specific drivers in QEMU, it says
>> QEMU takes over and provides the entire user space device drivers. Some parts
>> are even not relevant to vhost, they're basic device function enabling. Moreover,
>> it could be different kinds of devices(network/block/...) under vhost. No matter
>> # of vendors or # of types, total LOC is not small.
>>> The idea is to avoid introducing these extra complexity out of QEMU, keeping
>> vhost adapter simple. As vhost protocol is de factor standard, it leverages kernel
>> device driver to provide the diversity. Changing once in QEMU, then it supports
>> multi-vendor devices whose drivers naturally providing kernel driver there.
>>> If QEMU is going to build a user space driver framework there, we're open mind
>> on that, even leveraging DPDK as the underlay library. Looking forward to more
>> others' comments from community.
>>> Steve
>> Dependency on a kernel driver is fine IMHO. It's the dependency on a DPDK
>> backend that makes people unhappy, since the functionality in question is setup-
>> time only.
> Agreed, we don't see dependency on kernel driver is a problem.
At engineering level, kernel driver is harder than userspace driver.
>
> mdev based vhost backend (this patch set) is independent with vhost-user extension patch set. In fact, there're a few vhost-user providers, DPDK librte_vhost is one of them. FD.IO/VPP and snabbswitch have their own vhost-user providers. So I can't agree on vhost-user extension patch depends on DPDK backend. But anyway, that's the topic of another mail thread.
>
Well we can treat mdev as another kind of transport of vhost-user. And
technically we can even implement a relay mdev than forward vhost-user
messages to dpdk.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2018-04-11 2:08 UTC (permalink / raw)
To: Liang, Cunming, Paolo Bonzini, Bie, Tiwei
Cc: Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
kvm@vger.kernel.org, mst@redhat.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, Wang, Xiao W,
ddutile@redhat.com, Tan, Jianfeng, Wang, Zhihong
In-Reply-To: <D0158A423229094DA7ABF71CF2FA0DA34E9222F4@SHSMSX104.ccr.corp.intel.com>
On 2018年04月10日 17:23, Liang, Cunming wrote:
>
>> -----Original Message-----
>> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
>> Sent: Tuesday, April 10, 2018 3:52 PM
>> To: Bie, Tiwei <tiwei.bie@intel.com>; Jason Wang <jasowang@redhat.com>
>> Cc: mst@redhat.com; alex.williamson@redhat.com; ddutile@redhat.com;
>> Duyck, Alexander H <alexander.h.duyck@intel.com>; virtio-dev@lists.oasis-
>> open.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
>> virtualization@lists.linux-foundation.org; netdev@vger.kernel.org; Daly, Dan
>> <dan.daly@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Wang,
>> Zhihong <zhihong.wang@intel.com>; Tan, Jianfeng <jianfeng.tan@intel.com>;
>> Wang, Xiao W <xiao.w.wang@intel.com>
>> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware
>> vhost backend
>>
>> On 10/04/2018 06:57, Tiwei Bie wrote:
>>>> So you just move the abstraction layer from qemu to kernel, and you
>>>> still need different drivers in kernel for different device
>>>> interfaces of accelerators. This looks even more complex than leaving
>>>> it in qemu. As you said, another idea is to implement userspace vhost
>>>> backend for accelerators which seems easier and could co-work with
>>>> other parts of qemu without inventing new type of messages.
>>> I'm not quite sure. Do you think it's acceptable to add various vendor
>>> specific hardware drivers in QEMU?
>> I think so. We have vendor-specific quirks, and at some point there was an
>> idea of using quirks to implement (vendor-specific) live migration support for
>> assigned devices.
> Vendor-specific quirks of accessing VGA is a small portion. Other major portions are still handled by guest driver.
>
> While in this case, when saying various vendor specific drivers in QEMU, it says QEMU takes over and provides the entire user space device drivers. Some parts are even not relevant to vhost, they're basic device function enabling. Moreover, it could be different kinds of devices(network/block/...) under vhost. No matter # of vendors or # of types, total LOC is not small.
>
> The idea is to avoid introducing these extra complexity out of QEMU, keeping vhost adapter simple. As vhost protocol is de factor standard, it leverages kernel device driver to provide the diversity. Changing once in QEMU, then it supports multi-vendor devices whose drivers naturally providing kernel driver there.
Let me clarify my question, it's not qemu vs kenrel but userspace vs
kernel. It could be a library which could be linked to qemu. Doing it in
userspace have the following obvious advantages:
- attack surface was limited to userspace
- easier to be maintained (compared to kernel driver)
- easier to be extended without introducing new userspace/kernel interfaces
- not tied to a specific operating system
If we want to do it in the kernel, need to consider to unify code
between mdev device driver and generic driver. For net driver, maybe we
can even consider to do it on top of exist drivers.
>
> If QEMU is going to build a user space driver framework there, we're open mind on that, even leveraging DPDK as the underlay library. Looking forward to more others' comments from community.
I'm doing this now by implementing vhost inside qemu IOThreads. Hope I
can post RFC in few months.
Thanks
> Steve
>
>> Paolo
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Stefan Hajnoczi @ 2018-04-11 2:01 UTC (permalink / raw)
To: Liang, Cunming
Cc: Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
kvm@vger.kernel.org, mst@redhat.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, Wang, Xiao W,
ddutile@redhat.com, Tan, Jianfeng, Wang, Zhihong, Paolo Bonzini
In-Reply-To: <D0158A423229094DA7ABF71CF2FA0DA34E9222F4@SHSMSX104.ccr.corp.intel.com>
[-- Attachment #1.1: Type: text/plain, Size: 401 bytes --]
On Tue, Apr 10, 2018 at 09:23:53AM +0000, Liang, Cunming wrote:
> If QEMU is going to build a user space driver framework there, we're open mind on that, even leveraging DPDK as the underlay library. Looking forward to more others' comments from community.
There is already an NVMe VFIO driver in QEMU (see block/nvme.c). So in
principle there's no reason against userspace drivers in QEMU.
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Tian, Kevin @ 2018-04-11 1:38 UTC (permalink / raw)
To: Liang, Cunming, Michael S. Tsirkin
Cc: Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
kvm@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, Wang, Xiao W,
ddutile@redhat.com, Tan, Jianfeng, Wang, Zhihong, Paolo Bonzini
In-Reply-To: <D0158A423229094DA7ABF71CF2FA0DA34E9226AF@SHSMSX104.ccr.corp.intel.com>
> From: Liang, Cunming
> Sent: Tuesday, April 10, 2018 10:24 PM
>
[...]
> >
> > As others said, we do not need to go overeboard. A couple of small
> vendor-
> > specific quirks in qemu isn't a big deal.
>
> It's quite challenge to identify it's small or not, there's no uniform metric.
>
> It's only dependent on QEMU itself, that's the obvious benefit. Tradeoff is
> to build the entire device driver. We don't object to do that in QEMU, but
> wanna make sure to understand the boundary size clearly.
>
It might be helpful if you can post some sample code using proposed
framework - then people have a clear feeling about what size is talked
about here and whether it falls into the concept of 'small quirks'.
Thanks
Kevin
^ permalink raw reply
* Re: [PATCH v29 1/4] mm: support reporting free page blocks
From: Wei Wang @ 2018-04-11 1:22 UTC (permalink / raw)
To: Michael S. Tsirkin, Andrew Morton
Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
liliang.opensource, linux-kernel, mhocko, linux-mm, huangzhichao,
pbonzini, virtualization
In-Reply-To: <20180411022440-mutt-send-email-mst@kernel.org>
On 04/11/2018 07:25 AM, Michael S. Tsirkin wrote:
> On Tue, Apr 10, 2018 at 01:54:29PM -0700, Andrew Morton wrote:
>> On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> Andrew, were your questions answered? If yes could I bother you for an ack on this?
>>>
>> Still not very happy that readers are told that "this function may
>> sleep" when it clearly doesn't do so. If we wish to be able to change
>> it to sleep in the future then that should be mentioned. And even put a
>> might_sleep() in there, to catch people who didn't read the comments...
>>
>> Otherwise it looks OK.
> Oh, might_sleep with a comment explaining it's for the future sounds
> good to me. I queued this - Wei, could you post a patch on top pls?
>
I'm just thinking if it would be necessary to add another might_sleep,
because we've had a cond_resched there which has wrapped a __might_sleep.
Best,
Wei
^ permalink raw reply
* Re: [PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check
From: Stefan Hajnoczi @ 2018-04-11 1:21 UTC (permalink / raw)
To: David Miller
Cc: kvm, mst, netdev, syzkaller-bugs, linux-kernel, virtualization,
torvalds
In-Reply-To: <20180410.105043.1798364865868187298.davem@davemloft.net>
[-- Attachment #1.1: Type: text/plain, Size: 1182 bytes --]
On Tue, Apr 10, 2018 at 10:50:43AM -0400, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Tue, 10 Apr 2018 14:40:10 +0800
>
> > On 2018年04月10日 13:26, Stefan Hajnoczi wrote:
> >> v2:
> >> * Rewrote the conditional to make the vq access check clearer [Linus]
> >> * Added Patch 2 to make the return type consistent and harder to misuse
> >> * [Linus]
> >>
> >> The first patch fixes the vhost virtqueue access check which was
> >> recently
> >> broken. The second patch replaces the int return type with bool to
> >> prevent
> >> future bugs.
> >>
> >> Stefan Hajnoczi (2):
> >> vhost: fix vhost_vq_access_ok() log check
> >> vhost: return bool from *_access_ok() functions
> >>
> >> drivers/vhost/vhost.h | 4 +--
> >> drivers/vhost/vhost.c | 70
> >> ++++++++++++++++++++++++++-------------------------
> >> 2 files changed, 38 insertions(+), 36 deletions(-)
> >>
> >
> > Acked-by: Jason Wang <jasowang@redhat.com>
>
> This sereis doesn't apply cleanly to the net tree, please respin
> and add the appropriate Acks and Fixes: tags that Michael and Jason
> have provided.
Sorry, will fix.
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox