* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Samudrala, Sridhar @ 2018-04-11 19:13 UTC (permalink / raw)
To: Jiri Pirko
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <20180411155127.GQ2028@nanopsycho>
On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> This provides a generic interface for paravirtual drivers to listen
>> for netdev register/unregister/link change events from pci ethernet
>> devices with the same MAC and takeover their datapath. The notifier and
>> event handling code is based on the existing netvsc implementation.
>>
>> It exposes 2 sets of interfaces to the paravirtual drivers.
>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> master netdev is created. The paravirtual driver registers each bypass
>> instance along with a set of ops to manage the slave events.
>> bypass_master_register()
>> bypass_master_unregister()
>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> the bypass module provides interfaces to create/destroy additional master
>> netdev and all the slave events are managed internally.
>> bypass_master_create()
>> bypass_master_destroy()
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>> include/linux/netdevice.h | 14 +
>> include/net/bypass.h | 96 ++++++
>> net/Kconfig | 18 +
>> net/core/Makefile | 1 +
>> net/core/bypass.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 973 insertions(+)
>> create mode 100644 include/net/bypass.h
>> create mode 100644 net/core/bypass.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cf44503ea81a..587293728f70 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> IFF_PHONY_HEADROOM = 1<<24,
>> IFF_MACSEC = 1<<25,
>> IFF_NO_RX_HANDLER = 1<<26,
>> + IFF_BYPASS = 1 << 27,
>> + IFF_BYPASS_SLAVE = 1 << 28,
> I wonder, why you don't follow the existing coding style... Also, please
> add these to into the comment above.
To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
to the existing coding style to be consistent.
>
>
>> };
>>
>> #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
>> #define IFF_MACSEC IFF_MACSEC
>> #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>> +#define IFF_BYPASS IFF_BYPASS
>> +#define IFF_BYPASS_SLAVE IFF_BYPASS_SLAVE
>>
>> /**
>> * struct net_device - The DEVICE structure.
>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> }
>>
>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> +{
>> + return dev->priv_flags & IFF_BYPASS;
>> +}
>> +
>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> +{
>> + return dev->priv_flags & IFF_BYPASS_SLAVE;
>> +}
>> +
>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> static inline void netif_keep_dst(struct net_device *dev)
>> {
>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>> new file mode 100644
>> index 000000000000..86b02cb894cf
>> --- /dev/null
>> +++ b/include/net/bypass.h
>> @@ -0,0 +1,96 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +#ifndef _NET_BYPASS_H
>> +#define _NET_BYPASS_H
>> +
>> +#include <linux/netdevice.h>
>> +
>> +struct bypass_ops {
>> + int (*slave_pre_register)(struct net_device *slave_netdev,
>> + struct net_device *bypass_netdev);
>> + int (*slave_join)(struct net_device *slave_netdev,
>> + struct net_device *bypass_netdev);
>> + int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> + struct net_device *bypass_netdev);
>> + int (*slave_release)(struct net_device *slave_netdev,
>> + struct net_device *bypass_netdev);
>> + int (*slave_link_change)(struct net_device *slave_netdev,
>> + struct net_device *bypass_netdev);
>> + rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> +};
>> +
>> +struct bypass_master {
>> + struct list_head list;
>> + struct net_device __rcu *bypass_netdev;
>> + struct bypass_ops __rcu *ops;
>> +};
>> +
>> +/* bypass state */
>> +struct bypass_info {
>> + /* passthru netdev with same MAC */
>> + struct net_device __rcu *active_netdev;
> You still use "active"/"backup" names which is highly misleading as
> it has completely different meaning that in bond for example.
> I noted that in my previous review already. Please change it.
I guess the issue is with only the 'active' name. 'backup' should be fine as it also
matches with the BACKUP feature bit we are adding to virtio_net.
With regards to alternate names for 'active', you suggested 'stolen', but i
am not too happy with it.
netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>
>
>> +
>> + /* virtio_net netdev */
>> + struct net_device __rcu *backup_netdev;
>> +
>> + /* active netdev stats */
>> + struct rtnl_link_stats64 active_stats;
>> +
>> + /* backup netdev stats */
>> + struct rtnl_link_stats64 backup_stats;
>> +
>> + /* aggregated stats */
>> + struct rtnl_link_stats64 bypass_stats;
>> +
>> + /* spinlock while updating stats */
>> + spinlock_t stats_lock;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET_BYPASS)
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> + struct bypass_master **pbypass_master);
>> +void bypass_master_destroy(struct bypass_master *bypass_master);
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> + struct bypass_master **pbypass_master);
>> +void bypass_master_unregister(struct bypass_master *bypass_master);
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev);
>> +
>> +#else
>> +
>> +static inline
>> +int bypass_master_create(struct net_device *backup_netdev,
>> + struct bypass_master **pbypass_master);
>> +{
>> + return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> + struct pbypass_master **pbypass_master);
>> +{
>> + return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> + return 0;
>> +}
>> +
>> +#endif
>> +
>> +#endif /* _NET_BYPASS_H */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 0428f12c25c2..994445f4a96a 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>> on MAY_USE_DEVLINK to ensure they do not cause link errors when
>> devlink is a loadable module and the driver using it is built-in.
>>
>> +config NET_BYPASS
>> + tristate "Bypass interface"
>> + ---help---
>> + This provides a generic interface for paravirtual drivers to listen
>> + for netdev register/unregister/link change events from pci ethernet
>> + devices with the same MAC and takeover their datapath. This also
>> + enables live migration of a VM with direct attached VF by failing
>> + over to the paravirtual datapath when the VF is unplugged.
>> +
>> +config MAY_USE_BYPASS
>> + tristate
>> + default m if NET_BYPASS=m
>> + default y if NET_BYPASS=y || NET_BYPASS=n
>> + help
>> + Drivers using the bypass infrastructure should have a dependency
>> + on MAY_USE_BYPASS to ensure they do not cause link errors when
>> + bypass is a loadable module and the driver using it is built-in.
>> +
>> endif # if NET
>>
>> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 6dbbba8c57ae..a9727ed1c8fc 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>> obj-$(CONFIG_HWBM) += hwbm.o
>> obj-$(CONFIG_NET_DEVLINK) += devlink.o
>> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> +obj-$(CONFIG_NET_BYPASS) += bypass.o
>> diff --git a/net/core/bypass.c b/net/core/bypass.c
>> new file mode 100644
>> index 000000000000..b5b9cb554c3f
>> --- /dev/null
>> +++ b/net/core/bypass.c
>> @@ -0,0 +1,844 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +/* A common module to handle registrations and notifications for paravirtual
>> + * drivers to enable accelerated datapath and support VF live migration.
>> + *
>> + * The notifier and event handling code is based on netvsc driver.
>> + */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/if_vlan.h>
>> +#include <linux/pci.h>
>> +#include <net/sch_generic.h>
>> +#include <uapi/linux/if_arp.h>
>> +#include <net/bypass.h>
>> +
>> +static LIST_HEAD(bypass_master_list);
>> +static DEFINE_SPINLOCK(bypass_lock);
>> +
>> +static int bypass_slave_pre_register(struct net_device *slave_netdev,
>> + struct net_device *bypass_netdev,
>> + struct bypass_ops *bypass_ops)
>> +{
>> + struct bypass_info *bi;
>> + bool backup;
>> +
>> + if (bypass_ops) {
>> + if (!bypass_ops->slave_pre_register)
>> + return -EINVAL;
>> +
>> + return bypass_ops->slave_pre_register(slave_netdev,
>> + bypass_netdev);
>> + }
>> +
>> + bi = netdev_priv(bypass_netdev);
>> + backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> + if (backup ? rtnl_dereference(bi->backup_netdev) :
>> + rtnl_dereference(bi->active_netdev)) {
>> + netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>> + slave_netdev->name, backup ? "backup" : "active");
>> + return -EEXIST;
>> + }
>> +
>> + /* Avoid non pci devices as active netdev */
>> + if (!backup && (!slave_netdev->dev.parent ||
>> + !dev_is_pci(slave_netdev->dev.parent)))
>> + return -EINVAL;
>> +
>> + return 0;
>> +}
>> +
>> +static int bypass_slave_join(struct net_device *slave_netdev,
>> + struct net_device *bypass_netdev,
>> + struct bypass_ops *bypass_ops)
>> +{
>> + struct bypass_info *bi;
>> + bool backup;
>> +
>> + if (bypass_ops) {
>> + if (!bypass_ops->slave_join)
>> + return -EINVAL;
>> +
>> + return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>> + }
>> +
>> + bi = netdev_priv(bypass_netdev);
>> + backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> +
>> + dev_hold(slave_netdev);
>> +
>> + if (backup) {
>> + rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>> + dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>> + } else {
>> + rcu_assign_pointer(bi->active_netdev, slave_netdev);
>> + dev_get_stats(bi->active_netdev, &bi->active_stats);
>> + bypass_netdev->min_mtu = slave_netdev->min_mtu;
>> + bypass_netdev->max_mtu = slave_netdev->max_mtu;
>> + }
>> +
>> + netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>> + slave_netdev->name);
>> +
>> + return 0;
>> +}
>> +
>> +/* Called when slave dev is injecting data into network stack.
>> + * Change the associated network device from lower dev to virtio.
>> + * note: already called with rcu_read_lock
>> + */
>> +static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>> +{
>> + struct sk_buff *skb = *pskb;
>> + struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> +
>> + skb->dev = ndev;
>> +
>> + return RX_HANDLER_ANOTHER;
>> +}
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> + struct bypass_ops **ops)
>> +{
>> + struct bypass_master *bypass_master;
>> + struct net_device *bypass_netdev;
>> +
>> + spin_lock(&bypass_lock);
>> + list_for_each_entry(bypass_master, &bypass_master_list, list) {
> As I wrote the last time, you don't need this list, spinlock.
> You can do just something like:
> for_each_net(net) {
> for_each_netdev(net, dev) {
> if (netif_is_bypass_master(dev)) {
This function returns the upper netdev as well as the ops associated
with that netdev.
bypass_master_list is a list of 'struct bypass_master' that associates
'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
We need 'ops' only to support the 2 netdev model of netvsc. ops will be
NULL for 3-netdev model.
>
>
>
>
>> + bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> + if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> + *ops = rcu_dereference(bypass_master->ops);
> I don't see how rcu_dereference is ok here.
> 1) I don't see rcu_read_lock taken
> 2) Looks like bypass_master->ops has the same value across the whole
> existence.
We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
Yes. ops doesn't change.
>
>
>> + spin_unlock(&bypass_lock);
>> + return bypass_netdev;
>> + }
>> + }
>> + spin_unlock(&bypass_lock);
>> + return NULL;
>> +}
>> +
>> +static int bypass_slave_register(struct net_device *slave_netdev)
>> +{
>> + struct net_device *bypass_netdev;
>> + struct bypass_ops *bypass_ops;
>> + int ret, orig_mtu;
>> +
>> + ASSERT_RTNL();
>> +
>> + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> + &bypass_ops);
> For master, could you use word "master" in the variables so it is clear?
> Also, "dev" is fine instead of "netdev".
> Something like "bpmaster_dev"
bypass_master is of type struct bypass_master, bypass_netdev is of type struct net_device.
I can change all _netdev suffixes to _dev to make the names shorter.
>
>
>> + if (!bypass_netdev)
>> + goto done;
>> +
>> + ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> + bypass_ops);
>> + if (ret != 0)
> Just "if (ret)" will do. You have this on more places.
OK.
>
>
>> + goto done;
>> +
>> + ret = netdev_rx_handler_register(slave_netdev,
>> + bypass_ops ? bypass_ops->handle_frame :
>> + bypass_handle_frame, bypass_netdev);
>> + if (ret != 0) {
>> + netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> + ret);
>> + goto done;
>> + }
>> +
>> + ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> + if (ret != 0) {
>> + netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> + bypass_netdev->name, ret);
>> + goto upper_link_failed;
>> + }
>> +
>> + slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> +
>> + if (netif_running(bypass_netdev)) {
>> + ret = dev_open(slave_netdev);
>> + if (ret && (ret != -EBUSY)) {
>> + netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> + slave_netdev->name, ret);
>> + goto err_interface_up;
>> + }
>> + }
>> +
>> + /* Align MTU of slave with master */
>> + orig_mtu = slave_netdev->mtu;
>> + ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> + if (ret != 0) {
>> + netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> + slave_netdev->name, bypass_netdev->mtu);
>> + goto err_set_mtu;
>> + }
>> +
>> + ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> + if (ret != 0)
>> + goto err_join;
>> +
>> + call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> +
>> + netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> + slave_netdev->name);
>> +
>> + goto done;
>> +
>> +err_join:
>> + dev_set_mtu(slave_netdev, orig_mtu);
>> +err_set_mtu:
>> + dev_close(slave_netdev);
>> +err_interface_up:
>> + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +upper_link_failed:
>> + netdev_rx_handler_unregister(slave_netdev);
>> +done:
>> + return NOTIFY_DONE;
>> +}
>> +
>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> + struct net_device *bypass_netdev,
>> + struct bypass_ops *bypass_ops)
>> +{
>> + struct net_device *backup_netdev, *active_netdev;
>> + struct bypass_info *bi;
>> +
>> + if (bypass_ops) {
>> + if (!bypass_ops->slave_pre_unregister)
>> + return -EINVAL;
>> +
>> + return bypass_ops->slave_pre_unregister(slave_netdev,
>> + bypass_netdev);
>> + }
>> +
>> + bi = netdev_priv(bypass_netdev);
>> + active_netdev = rtnl_dereference(bi->active_netdev);
>> + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> + return -EINVAL;
>> +
>> + return 0;
>> +}
>> +
>> +static int bypass_slave_release(struct net_device *slave_netdev,
>> + struct net_device *bypass_netdev,
>> + struct bypass_ops *bypass_ops)
>> +{
>> + struct net_device *backup_netdev, *active_netdev;
>> + struct bypass_info *bi;
>> +
>> + if (bypass_ops) {
>> + if (!bypass_ops->slave_release)
>> + return -EINVAL;
> I think it would be good to make the API to the driver more strict and
> have a separate set of ops for "active" and "backup" netdevices.
> That should stop people thinking about extending this to more slaves in
> the future.
We have checks in slave_pre_register() that allows only 1 'backup' and 1
'active' slave.
>
>
>
>> +
>> + return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> + }
>> +
>> + bi = netdev_priv(bypass_netdev);
>> + active_netdev = rtnl_dereference(bi->active_netdev);
>> + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> + if (slave_netdev == backup_netdev) {
>> + RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> + } else {
>> + RCU_INIT_POINTER(bi->active_netdev, NULL);
>> + if (backup_netdev) {
>> + bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> + bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> + }
>> + }
>> +
>> + dev_put(slave_netdev);
>> +
>> + netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> + slave_netdev->name);
>> +
>> + return 0;
>> +}
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> + struct net_device *bypass_netdev;
>> + struct bypass_ops *bypass_ops;
>> + int ret;
>> +
>> + if (!netif_is_bypass_slave(slave_netdev))
>> + goto done;
>> +
>> + ASSERT_RTNL();
>> +
>> + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> + &bypass_ops);
>> + if (!bypass_netdev)
>> + goto done;
>> +
>> + ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> + bypass_ops);
>> + if (ret != 0)
>> + goto done;
>> +
>> + netdev_rx_handler_unregister(slave_netdev);
>> + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +
>> + bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> +
>> + netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> + slave_netdev->name);
>> +
>> +done:
>> + return NOTIFY_DONE;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> +
>> +static bool bypass_xmit_ready(struct net_device *dev)
>> +{
>> + return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> +{
>> + struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> + struct bypass_ops *bypass_ops;
>> + struct bypass_info *bi;
>> +
>> + if (!netif_is_bypass_slave(slave_netdev))
>> + goto done;
>> +
>> + ASSERT_RTNL();
>> +
>> + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> + &bypass_ops);
>> + if (!bypass_netdev)
>> + goto done;
>> +
>> + if (bypass_ops) {
>> + if (!bypass_ops->slave_link_change)
>> + goto done;
>> +
>> + return bypass_ops->slave_link_change(slave_netdev,
>> + bypass_netdev);
>> + }
>> +
>> + if (!netif_running(bypass_netdev))
>> + return 0;
>> +
>> + bi = netdev_priv(bypass_netdev);
>> +
>> + active_netdev = rtnl_dereference(bi->active_netdev);
>> + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> + goto done;
> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> above is enough.
I think we need this check to not allow events from a slave that is not
attached to this master but has the same MAC.
>
>
>> +
>> + if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> + (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> + netif_carrier_on(bypass_netdev);
>> + netif_tx_wake_all_queues(bypass_netdev);
>> + } else {
>> + netif_carrier_off(bypass_netdev);
>> + netif_tx_stop_all_queues(bypass_netdev);
>> + }
>> +
>> +done:
>> + return NOTIFY_DONE;
>> +}
>> +
>> +static bool bypass_validate_event_dev(struct net_device *dev)
>> +{
>> + /* Skip parent events */
>> + if (netif_is_bypass_master(dev))
>> + return false;
>> +
>> + /* Avoid non-Ethernet type devices */
>> + if (dev->type != ARPHRD_ETHER)
>> + return false;
>> +
>> + /* Avoid Vlan dev with same MAC registering as VF */
>> + if (is_vlan_dev(dev))
>> + return false;
>> +
>> + /* Avoid Bonding master dev with same MAC registering as slave dev */
>> + if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> Yeah, this is certainly incorrect. One thing is, you should be using the
> helpers netif_is_bond_master().
> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>
> You need to do it not by blacklisting, but with whitelisting. You need
> to whitelist VF devices. My port flavours patchset might help with this.
May be i can use netdev_has_lower_dev() helper to make sure that the slave
device is not an upper dev.
Can you point to your port flavours patchset? Is it upstream?
>
>
>> + return false;
>> +
>> + return true;
>> +}
>> +
>> +static int
>> +bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>> +{
>> + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> + if (!bypass_validate_event_dev(event_dev))
>> + return NOTIFY_DONE;
>> +
>> + switch (event) {
>> + case NETDEV_REGISTER:
>> + return bypass_slave_register(event_dev);
>> + case NETDEV_UNREGISTER:
>> + return bypass_slave_unregister(event_dev);
>> + case NETDEV_UP:
>> + case NETDEV_DOWN:
>> + case NETDEV_CHANGE:
>> + return bypass_slave_link_change(event_dev);
>> + default:
>> + return NOTIFY_DONE;
>> + }
>> +}
>> +
>> +static struct notifier_block bypass_notifier = {
>> + .notifier_call = bypass_event,
>> +};
>> +
>> +int bypass_open(struct net_device *dev)
>> +{
>> + struct bypass_info *bi = netdev_priv(dev);
>> + struct net_device *active_netdev, *backup_netdev;
>> + int err;
>> +
>> + netif_carrier_off(dev);
>> + netif_tx_wake_all_queues(dev);
>> +
>> + active_netdev = rtnl_dereference(bi->active_netdev);
>> + if (active_netdev) {
>> + err = dev_open(active_netdev);
>> + if (err)
>> + goto err_active_open;
>> + }
>> +
>> + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> + if (backup_netdev) {
>> + err = dev_open(backup_netdev);
>> + if (err)
>> + goto err_backup_open;
>> + }
>> +
>> + return 0;
>> +
>> +err_backup_open:
>> + dev_close(active_netdev);
>> +err_active_open:
>> + netif_tx_disable(dev);
>> + return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_open);
>> +
>> +int bypass_close(struct net_device *dev)
>> +{
>> + struct bypass_info *vi = netdev_priv(dev);
> This should be probably "bi"
Yes.
>
>
>> + struct net_device *slave_netdev;
>> +
>> + netif_tx_disable(dev);
>> +
>> + slave_netdev = rtnl_dereference(vi->active_netdev);
>> + if (slave_netdev)
>> + dev_close(slave_netdev);
>> +
>> + slave_netdev = rtnl_dereference(vi->backup_netdev);
>> + if (slave_netdev)
>> + dev_close(slave_netdev);
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_close);
>> +
>> +static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> + atomic_long_inc(&dev->tx_dropped);
>> + dev_kfree_skb_any(skb);
>> + return NETDEV_TX_OK;
>> +}
>> +
>> +netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> + struct bypass_info *bi = netdev_priv(dev);
> If you rename the other variable to "bpmaster_dev", it would be nice to
> rename this to bpinfo or something more descriptive. "bi" is too short
> to know what that is right away.
Will rename bypass_netdev to bypass_dev. bypass indicates that it is
an upper master dev.
>
>
>> + struct net_device *xmit_dev;
> Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.
OK.
>
>
>
>> +
>> + /* Try xmit via active netdev followed by backup netdev */
>> + xmit_dev = rcu_dereference_bh(bi->active_netdev);
>> + if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>> + xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>> + if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>> + return bypass_drop_xmit(skb, dev);
>> + }
>> +
>> + skb->dev = xmit_dev;
>> + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> + return dev_queue_xmit(skb);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_start_xmit);
>> +
>> +u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>> + void *accel_priv, select_queue_fallback_t fallback)
>> +{
>> + /* This helper function exists to help dev_pick_tx get the correct
>> + * destination queue. Using a helper function skips a call to
>> + * skb_tx_hash and will put the skbs in the queue we expect on their
>> + * way down to the bonding driver.
>> + */
>> + u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> + /* Save the original txq to restore before passing to the driver */
>> + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> + if (unlikely(txq >= dev->real_num_tx_queues)) {
>> + do {
>> + txq -= dev->real_num_tx_queues;
>> + } while (txq >= dev->real_num_tx_queues);
>> + }
>> +
>> + return txq;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_select_queue);
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32bit values only.
>> + */
>> +static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>> + const struct rtnl_link_stats64 *_new,
>> + const struct rtnl_link_stats64 *_old)
>> +{
>> + const u64 *new = (const u64 *)_new;
>> + const u64 *old = (const u64 *)_old;
>> + u64 *res = (u64 *)_res;
>> + int i;
>> +
>> + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> + u64 nv = new[i];
>> + u64 ov = old[i];
>> + s64 delta = nv - ov;
>> +
>> + /* detects if this particular field is 32bit only */
>> + if (((nv | ov) >> 32) == 0)
>> + delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> + /* filter anomalies, some drivers reset their stats
>> + * at down/up events.
>> + */
>> + if (delta > 0)
>> + res[i] += delta;
>> + }
>> +}
>> +
>> +void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>> +{
>> + struct bypass_info *bi = netdev_priv(dev);
> You can WARN_ON and return in case the dev is not bypass master, just
> to catch buggy drivers. Same with other helpers.
I can make this static and not export this helper as well as all
bypass_netdev ops.
>
>
>> + const struct rtnl_link_stats64 *new;
>> + struct rtnl_link_stats64 temp;
>> + struct net_device *slave_netdev;
>> +
>> + spin_lock(&bi->stats_lock);
>> + memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>> +
>> + rcu_read_lock();
>> +
>> + slave_netdev = rcu_dereference(bi->active_netdev);
>> + if (slave_netdev) {
>> + new = dev_get_stats(slave_netdev, &temp);
>> + bypass_fold_stats(stats, new, &bi->active_stats);
>> + memcpy(&bi->active_stats, new, sizeof(*new));
>> + }
>> +
>> + slave_netdev = rcu_dereference(bi->backup_netdev);
>> + if (slave_netdev) {
>> + new = dev_get_stats(slave_netdev, &temp);
>> + bypass_fold_stats(stats, new, &bi->backup_stats);
>> + memcpy(&bi->backup_stats, new, sizeof(*new));
>> + }
>> +
>> + rcu_read_unlock();
>> +
>> + memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>> + spin_unlock(&bi->stats_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_get_stats);
>> +
>> +int bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> + struct bypass_info *bi = netdev_priv(dev);
>> + struct net_device *active_netdev, *backup_netdev;
>> + int ret = 0;
> Pointless initialization.
>
>
>> +
>> + active_netdev = rcu_dereference(bi->active_netdev);
>> + if (active_netdev) {
>> + ret = dev_set_mtu(active_netdev, new_mtu);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + backup_netdev = rcu_dereference(bi->backup_netdev);
>> + if (backup_netdev) {
>> + ret = dev_set_mtu(backup_netdev, new_mtu);
>> + if (ret) {
>> + dev_set_mtu(active_netdev, dev->mtu);
>> + return ret;
>> + }
>> + }
>> +
>> + dev->mtu = new_mtu;
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_change_mtu);
>> +
>> +void bypass_set_rx_mode(struct net_device *dev)
>> +{
>> + struct bypass_info *bi = netdev_priv(dev);
>> + struct net_device *slave_netdev;
>> +
>> + rcu_read_lock();
>> +
>> + slave_netdev = rcu_dereference(bi->active_netdev);
>> + if (slave_netdev) {
>> + dev_uc_sync_multiple(slave_netdev, dev);
>> + dev_mc_sync_multiple(slave_netdev, dev);
>> + }
>> +
>> + slave_netdev = rcu_dereference(bi->backup_netdev);
>> + if (slave_netdev) {
>> + dev_uc_sync_multiple(slave_netdev, dev);
>> + dev_mc_sync_multiple(slave_netdev, dev);
>> + }
>> +
>> + rcu_read_unlock();
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>> +
>> +static const struct net_device_ops bypass_netdev_ops = {
>> + .ndo_open = bypass_open,
>> + .ndo_stop = bypass_close,
>> + .ndo_start_xmit = bypass_start_xmit,
>> + .ndo_select_queue = bypass_select_queue,
>> + .ndo_get_stats64 = bypass_get_stats,
>> + .ndo_change_mtu = bypass_change_mtu,
>> + .ndo_set_rx_mode = bypass_set_rx_mode,
>> + .ndo_validate_addr = eth_validate_addr,
>> + .ndo_features_check = passthru_features_check,
>> +};
>> +
>> +#define BYPASS_DRV_NAME "bypass"
>> +#define BYPASS_DRV_VERSION "0.1"
>> +
>> +static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>> + struct ethtool_drvinfo *drvinfo)
>> +{
>> + strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>> + strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>> +}
>> +
>> +int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>> + struct ethtool_link_ksettings *cmd)
>> +{
>> + struct bypass_info *bi = netdev_priv(dev);
>> + struct net_device *slave_netdev;
>> +
>> + slave_netdev = rtnl_dereference(bi->active_netdev);
>> + if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> + slave_netdev = rtnl_dereference(bi->backup_netdev);
>> + if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> + cmd->base.duplex = DUPLEX_UNKNOWN;
>> + cmd->base.port = PORT_OTHER;
>> + cmd->base.speed = SPEED_UNKNOWN;
>> +
>> + return 0;
>> + }
>> + }
>> +
>> + return __ethtool_get_link_ksettings(slave_netdev, cmd);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>> +
>> +static const struct ethtool_ops bypass_ethtool_ops = {
>> + .get_drvinfo = bypass_ethtool_get_drvinfo,
>> + .get_link = ethtool_op_get_link,
>> + .get_link_ksettings = bypass_ethtool_get_link_ksettings,
>> +};
>> +
>> +static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>> +{
>> + struct net *net = dev_net(bypass_netdev);
>> + struct net_device *dev;
>> +
>> + rtnl_lock();
>> + for_each_netdev(net, dev) {
>> + if (dev == bypass_netdev)
>> + continue;
>> + if (!bypass_validate_event_dev(dev))
>> + continue;
>> + if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>> + bypass_slave_register(dev);
>> + }
>> + rtnl_unlock();
>> +}
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> + struct bypass_master **pbypass_master)
>> +{
>> + struct bypass_master *bypass_master;
>> +
>> + bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>> + if (!bypass_master)
>> + return -ENOMEM;
>> +
>> + rcu_assign_pointer(bypass_master->ops, ops);
>> + dev_hold(dev);
>> + dev->priv_flags |= IFF_BYPASS;
>> + rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>> +
>> + spin_lock(&bypass_lock);
>> + list_add_tail(&bypass_master->list, &bypass_master_list);
>> + spin_unlock(&bypass_lock);
>> +
>> + bypass_register_existing_slave(dev);
>> +
>> + *pbypass_master = bypass_master;
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_register);
>> +
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> + struct net_device *bypass_netdev;
>> +
>> + bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +
>> + bypass_netdev->priv_flags &= ~IFF_BYPASS;
>> + dev_put(bypass_netdev);
>> +
>> + spin_lock(&bypass_lock);
>> + list_del(&bypass_master->list);
>> + spin_unlock(&bypass_lock);
>> +
>> + kfree(bypass_master);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_unregister);
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> + struct bypass_master **pbypass_master)
>> +{
>> + struct device *dev = backup_netdev->dev.parent;
>> + struct net_device *bypass_netdev;
>> + int err;
>> +
>> + /* Alloc at least 2 queues, for now we are going with 16 assuming
>> + * that most devices being bonded won't have too many queues.
>> + */
>> + bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>> + if (!bypass_netdev) {
>> + dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> + return -ENOMEM;
>> + }
>> +
>> + dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> + SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> + bypass_netdev->netdev_ops = &bypass_netdev_ops;
>> + bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>> +
>> + /* Initialize the device options */
>> + bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> + bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> + IFF_TX_SKB_SHARING);
>> +
>> + /* don't acquire bypass netdev's netif_tx_lock when transmitting */
>> + bypass_netdev->features |= NETIF_F_LLTX;
>> +
>> + /* Don't allow bypass devices to change network namespaces. */
>> + bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>> +
>> + bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> + NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> + NETIF_F_HIGHDMA | NETIF_F_LRO;
>> +
>> + bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> + bypass_netdev->features |= bypass_netdev->hw_features;
>> +
>> + memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>> + bypass_netdev->addr_len);
>> +
>> + bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> + bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +
>> + err = register_netdev(bypass_netdev);
>> + if (err < 0) {
>> + dev_err(dev, "Unable to register bypass_netdev!\n");
>> + goto err_register_netdev;
>> + }
>> +
>> + netif_carrier_off(bypass_netdev);
>> +
>> + err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>> + if (err < 0)
> just "if (err)" would do.
OK
>
>
>> + goto err_bypass;
>> +
>> + return 0;
>> +
>> +err_bypass:
>> + unregister_netdev(bypass_netdev);
>> +err_register_netdev:
>> + free_netdev(bypass_netdev);
>> +
>> + return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_create);
>> +
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> + struct net_device *bypass_netdev;
>> + struct net_device *slave_netdev;
>> + struct bypass_info *bi;
>> +
>> + if (!bypass_master)
>> + return;
>> +
>> + bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> + bi = netdev_priv(bypass_netdev);
>> +
>> + netif_device_detach(bypass_netdev);
>> +
>> + rtnl_lock();
>> +
>> + slave_netdev = rtnl_dereference(bi->active_netdev);
>> + if (slave_netdev)
>> + bypass_slave_unregister(slave_netdev);
>> +
>> + slave_netdev = rtnl_dereference(bi->backup_netdev);
>> + if (slave_netdev)
>> + bypass_slave_unregister(slave_netdev);
>> +
>> + bypass_master_unregister(bypass_master);
>> +
>> + unregister_netdevice(bypass_netdev);
>> +
>> + rtnl_unlock();
>> +
>> + free_netdev(bypass_netdev);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_destroy);
>> +
>> +static __init int
>> +bypass_init(void)
>> +{
>> + register_netdevice_notifier(&bypass_notifier);
>> +
>> + return 0;
>> +}
>> +module_init(bypass_init);
>> +
>> +static __exit
>> +void bypass_exit(void)
>> +{
>> + unregister_netdevice_notifier(&bypass_notifier);
>> +}
>> +module_exit(bypass_exit);
>> +
>> +MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>> +MODULE_LICENSE("GPL v2");
>> --
>> 2.14.3
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PULL] virtio: feature pull
From: Michael S. Tsirkin @ 2018-04-11 19:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: kvm, netdev, mst, linux-kernel, virtualization
The following changes since commit dc32bb678e103afbcfa4d814489af0566307f528:
vhost: add vsock compat ioctl (2018-03-20 03:17:42 +0200)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to 6c64fe7f2adcee21d7c4247f1ec021fd18428fc4:
virtio_balloon: export hugetlb page allocation counts (2018-04-10 21:23:55 +0300)
----------------------------------------------------------------
virtio: feature
This adds reporting hugepage stats to virtio-balloon.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
----------------------------------------------------------------
Jonathan Helman (1):
virtio_balloon: export hugetlb page allocation counts
drivers/virtio/virtio_balloon.c | 6 ++++++
include/uapi/linux/virtio_balloon.h | 4 +++-
2 files changed, 9 insertions(+), 1 deletion(-)
^ permalink raw reply
* [PULL] virtio: feature pull
From: Michael S. Tsirkin @ 2018-04-11 19:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: kvm, netdev, mst, linux-kernel, virtualization
The following changes since commit dc32bb678e103afbcfa4d814489af0566307f528:
vhost: add vsock compat ioctl (2018-03-20 03:17:42 +0200)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to 6c64fe7f2adcee21d7c4247f1ec021fd18428fc4:
virtio_balloon: export hugetlb page allocation counts (2018-04-10 21:23:55 +0300)
----------------------------------------------------------------
virtio: feature
This adds reporting hugepage stats to virtio-balloon.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
----------------------------------------------------------------
Jonathan Helman (1):
virtio_balloon: export hugetlb page allocation counts
drivers/virtio/virtio_balloon.c | 6 ++++++
include/uapi/linux/virtio_balloon.h | 4 +++-
2 files changed, 9 insertions(+), 1 deletion(-)
^ permalink raw reply
* Re: [PATCH 1/4] iommu: Add virtio-iommu driver
From: Jean-Philippe Brucker @ 2018-04-11 18:35 UTC (permalink / raw)
To: Tian, Kevin, Robin Murphy, iommu@lists.linux-foundation.org,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
virtio-dev@lists.oasis-open.org, kvmarm@lists.cs.columbia.edu
Cc: jayachandran.nair@cavium.com, tnowicki@caviumnetworks.com,
mst@redhat.com, Marc Zyngier, Will Deacon,
jintack@cs.columbia.edu, eric.auger.pro@gmail.com
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D191090157@SHSMSX101.ccr.corp.intel.com>
On 23/03/18 08:27, Tian, Kevin wrote:
>>> The host kernel needs to have *some* MSI region in place before the
>>> guest can start configuring interrupts, otherwise it won't know what
>>> address to give to the underlying hardware. However, as soon as the host
>>> kernel has picked a region, host userspace needs to know that it can no
>>> longer use addresses in that region for DMA-able guest memory. It's a
>>> lot easier when the address is fixed in hardware and the host userspace
>>> will never be stupid enough to try and VFIO_IOMMU_DMA_MAP it, but in
>>> the
>>> more general case where MSI writes undergo IOMMU address translation
>>> so
>>> it's an arbitrary IOVA, this has the potential to conflict with stuff
>>> like guest memory hotplug.
>>>
>>> What we currently have is just the simplest option, with the host kernel
>>> just picking something up-front and pretending to host userspace that
>>> it's a fixed hardware address. There's certainly scope for it to be a
>>> bit more dynamic in the sense of adding an interface to let userspace
>>> move it around (before attaching any devices, at least), but I don't
>>> think it's feasible for the host kernel to second-guess userspace enough
>>> to make it entirely transparent like it is in the DMA API domain case.
>>>
>>> Of course, that's all assuming the host itself is using a virtio-iommu
>>> (e.g. in a nested virt or emulation scenario). When it's purely within a
>>> guest then an MSI reservation shouldn't matter so much, since the guest
>>> won't be anywhere near the real hardware configuration anyway.
>>>
>>> Robin.
>>
>> Curious since anyway we are defining a new iommu architecture
>> is it possible to avoid those ARM-specific burden completely?
>>
>
> OK, after some study around those tricks below is my learning:
>
> - MSI_IOVA window is used only on request (iommu_dma_get
> _msi_page), not meant to take effect on all architectures once
> initialized. e.g. ARM GIC does it but not x86. So it is reasonable
> for virtio-iommu driver to implement such capability;
>
> - I thought whether hardware MSI doorbell can be always reported
> on virtio-iommu since it's newly defined. Looks there is a problem
> if underlying IOMMU is sw-managed MSI style - valid mapping is
> expected in all level of translations, meaning guest has to manage
> stage-1 mapping in nested configuration since stage-1 is owned
> by guest.
>
> Then virtio-iommu is naturally expected to report the same MSI
> model as supported by underlying hardware. Below are some
> further thoughts along this route (use 'IOMMU' to represent the
> physical one and 'virtio-iommu' for virtual one):
>
> ----
>
> In the scope of current virtio-iommu spec v.6, there is no nested
> consideration yet. Guest driver is expected to use MAP/UNMAP
> interface on assigned endpoints. In this case the MAP requests
> (IOVA->GPA) is caught and maintained within Qemu which then
> further talks to VFIO to map IOVA->HPA in IOMMU.
>
> Qemu can learn the MSI model of IOMMU from sysfs.
>
> For hardware MSI doorbell (x86 and some ARM):
> * Host kernel reports to Qemu as IOMMU_RESV_MSI
> * Qemu report to guest as VIRTIO_IOMMU_RESV_MEM_T_MSI
> * Guest takes the range as IOMMU_RESV_MSI. reserved
> * Qemu MAP database has no mapping for the doorbell
> * Physical IOMMU page table has no mapping for the doorbell
> * MSI from passthrough device bypass IOMMU
> * MSI from emulated device bypass virtio-iommu
>
> For software MSI doorbell (most ARM):
> * Host kernel reports to Qemu as IOMMU_RESV_SW_MSI
> * Qemu report to guest as VIRTIO_IOMMU_RESV_MEM_T_RESERVED
> * Guest takes the range as IOMMU_RESV_RESERVED
> * vGIC requests to map 'GPA of the virtual doorbell'
> * a map request (IOVA->GPA) sent on endpoint
> * Qemu maintains the mapping in MAP database
> * but no VFIO_MAP request since it's purely virtual
> * GIC requests to map 'HPA of the physical doorbell'
> * e.g. triggered by VFIO enable msi
> * IOMMU now includes a valid mapping (IOVA->HPA)
> * MSI from emulated device go through Qemu MAP
> database (IOVA->'GPA of virtual doorbell') and then hit vGIC
> * MSI from passthrough device go through IOMMU
> (IOVA->'HPA of physical doorbell') and then hit GIC
>
> In this case, host doorbell is treated as reserved resource in
> guest side. Guest has its own sw-management for virtual
> doorbell which is only used for emulated device. two paths
> are completely separated.
>
> If above captures the right flow, current v0.6 spec is complete
> regarding to required function definition.
Yes I think this summarizes well the current state or SW/HW MSI
> Then comes nested case, with two level page tables (stage-1
> and stage-2) in IOMMU. stage-1 is for IOVA->GPA and stage-2
> is for GPA->HPA. VFIO map/unmap happens on stage-2,
> while stage-1 is directly managed by guest (and bound to
> IOMMU which enables nested translation from IOVA->GPA
> ->HPA).
>
> For hardware MSI, there is nothing special compared to
> previous requirement. Both host/guest treat the doorbell
> as reserved and guarantee no mapping in either stage-1 or
> stage-2.
>
> For software MSI, more consideration is required:
>
> * for emulated device it is just fine as long as guest keeps
> IOVA->'GPA of virtual doorbell' in stage-1. Qemu is expected
> to walk stage-1 page table upon MSI request from emulated
> device to hit vGIC;
>
> * for passthrough device however there is a problem. We
> need valid mapping in both stage-1 and stage-2, while host
> kernel is only responsible for stage-2:
>
> 1) if we expect to keep same isolation policy (i.e.
> host MSI fully managed by host kernel), then an identity
> mapping for host-reported MSI range is expected in stage-1.
> In such case we need a new type VIRTIO_IOMMU_RESV_
> MEM_T_DIRECT to teach guest setup identity mapping.
> it should be the right thing to add since anyway there might
> be true IOMMU_RESV_DIRECT range reported from host
> which also should be handled.
>
> 2) Alternatively we could instead allow Qemu to
> request dynamic change of physical doorbell mapping in
> stage2, e.g. from GPA of virtual doorbell to HPA of physical
> doorbell. But it doesn't like a good design - VFIO doesn't
> assign interrupt controller to user space then why should
> VFIO allow user mapping to doorbell...
>
> if 1) is agreed, looks the missing part in spec is just VIRTIO_
> IOMMU_RESV_MEM_T_DIRECT, though the whole story
> is lengthy and fully enabling nested require many other
> works. :-)
This is a great write-up, thanks. As said on the v0.6 thread [1], I also
prefer 1), because it doesn't require any additional interface in the
host kernel, and it doesn't force host userspace to guess which doorbell
address the guest is writing into the MSI-X table.
Thanks,
Jean
[1] https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg30104.html
^ permalink raw reply
* Re: [PATCH 2/4] iommu/virtio: Add probe request
From: Jean-Philippe Brucker @ 2018-04-11 18:33 UTC (permalink / raw)
To: Robin Murphy, iommu@lists.linux-foundation.org,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
virtio-dev@lists.oasis-open.org, kvmarm@lists.cs.columbia.edu
Cc: jayachandran.nair@cavium.com, Lorenzo Pieralisi,
tnowicki@caviumnetworks.com, mst@redhat.com, Marc Zyngier,
Will Deacon, jintack@cs.columbia.edu, eric.auger@redhat.com,
joro@8bytes.org, eric.auger.pro@gmail.com
In-Reply-To: <8156517f-74b1-2923-2838-402f09a5c488@arm.com>
On 23/03/18 15:00, Robin Murphy wrote:
[...]
>> + /*
>> + * Treat unknown subtype as RESERVED, but urge users to update their
>> + * driver.
>> + */
>> + if (mem->subtype != VIRTIO_IOMMU_RESV_MEM_T_RESERVED &&
>> + mem->subtype != VIRTIO_IOMMU_RESV_MEM_T_MSI)
>> + pr_warn("unknown resv mem subtype 0x%x\n", mem->subtype);
>
> Might as well avoid the extra comparisons by incorporating this into the
> switch statement, i.e.:
>
> default:
> dev_warn(vdev->viommu_dev->dev, ...);
> /* Fallthrough */
> case VIRTIO_IOMMU_RESV_MEM_T_RESERVED:
> ...
>
> (dev_warn is generally preferable to pr_warn when feasible)
Alright, that's nicer
[...]
>> /*
>> * Last step creates a default domain and attaches to it. Everything
>> * must be ready.
>> @@ -735,7 +849,19 @@ static int viommu_add_device(struct device *dev)
>>
>> static void viommu_remove_device(struct device *dev)
>> {
>> - kfree(dev->iommu_fwspec->iommu_priv);
>> + struct viommu_endpoint *vdev;
>> + struct iommu_resv_region *entry, *next;
>> + struct iommu_fwspec *fwspec = dev->iommu_fwspec;
>> +
>> + if (!fwspec || fwspec->ops != &viommu_ops)
>> + return;
>
> Oh good :) I guess that was just a patch-splitting issue. The group
> thing still applies, though.
Ok
Thanks,
Jean
^ permalink raw reply
* Re: [PATCH 1/4] iommu: Add virtio-iommu driver
From: Jean-Philippe Brucker @ 2018-04-11 18:33 UTC (permalink / raw)
To: Robin Murphy, iommu@lists.linux-foundation.org,
kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
virtio-dev@lists.oasis-open.org, kvmarm@lists.cs.columbia.edu
Cc: jayachandran.nair@cavium.com, Lorenzo Pieralisi, Tomasz.Nowicki,
tnowicki@caviumnetworks.com, mst@redhat.com, Marc Zyngier,
Will Deacon, jintack@cs.columbia.edu, eric.auger@redhat.com,
joro@8bytes.org, eric.auger.pro@gmail.com
In-Reply-To: <d0cfe602-f6e8-2d6e-0942-23012f3facef@arm.com>
On 23/03/18 14:48, Robin Murphy wrote:
[..]
>> + * Virtio driver for the paravirtualized IOMMU
>> + *
>> + * Copyright (C) 2018 ARM Limited
>> + * Author: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> + *
>> + * SPDX-License-Identifier: GPL-2.0
>
> This wants to be a // comment at the very top of the file (thankfully
> the policy is now properly documented in-tree since
> Documentation/process/license-rules.rst got merged)
Ok
[...]
>> +
>> +/*
>> + * viommu_del_mappings - remove mappings from the internal tree
>> + *
>> + * @vdomain: the domain
>> + * @iova: start of the range
>> + * @size: size of the range. A size of 0 corresponds to the entire address
>> + * space.
>> + * @out_mapping: if not NULL, the first removed mapping is returned in there.
>> + * This allows the caller to reuse the buffer for the unmap request. When
>> + * the returned size is greater than zero, if a mapping is returned, the
>> + * caller must free it.
>
> This "free multiple mappings except maybe hand one of them off to the
> caller" interface is really unintuitive. AFAICS it's only used by
> viommu_unmap() to grab mapping->req, but that doesn't seem to care about
> mapping itself, so I wonder whether it wouldn't make more sense to just
> have a global kmem_cache of struct virtio_iommu_req_unmap for that and
> avoid a lot of complexity...
Well it's a small complication for what I hoped would be a meanignful
performance difference, but more below.
>> +
>> +/* IOMMU API */
>> +
>> +static bool viommu_capable(enum iommu_cap cap)
>> +{
>> + return false;
>> +}
>
> The .capable callback is optional, so it's only worth implementing once
> you want it to do something beyond the default behaviour.
>
Ah, right
[...]
>> +static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
>> + size_t size)
>> +{
>> + int ret = 0;
>> + size_t unmapped;
>> + struct viommu_mapping *mapping = NULL;
>> + struct viommu_domain *vdomain = to_viommu_domain(domain);
>> +
>> + unmapped = viommu_del_mappings(vdomain, iova, size, &mapping);
>> + if (unmapped < size) {
>> + ret = -EINVAL;
>> + goto out_free;
>> + }
>> +
>> + /* Device already removed all mappings after detach. */
>> + if (!vdomain->endpoints)
>> + goto out_free;
>> +
>> + if (WARN_ON(!mapping))
>> + return 0;
>> +
>> + mapping->req.unmap = (struct virtio_iommu_req_unmap) {
>> + .head.type = VIRTIO_IOMMU_T_UNMAP,
>> + .domain = cpu_to_le32(vdomain->id),
>> + .virt_start = cpu_to_le64(iova),
>> + .virt_end = cpu_to_le64(iova + unmapped - 1),
>> + };
>
> ...In fact, the kmem_cache idea might be moot since it looks like with a
> bit of restructuring you could get away with just a single per-viommu
> virtio_iommu_req_unmap structure; this lot could be passed around on the
> stack until request_lock is taken, at which point it would be copied
> into the 'real' DMA-able structure. The equivalent might apply to
> viommu_map() too - now that I'm looking at it, it seems like it would
> take pretty minimal effort to encapsulate the whole business cleanly in
> viommu_send_req_sync(), which could do something like this instead of
> going through viommu_send_reqs_sync():
>
> ...
> spin_lock_irqsave(&viommu->request_lock, flags);
> viommu_copy_req(viommu->dma_req, req);
> ret = _viommu_send_reqs_sync(viommu, viommu->dma_req, 1, &sent);
> spin_unlock_irqrestore(&viommu->request_lock, flags);
> ...
I'll have to come back to that sorry, still conducting some experiments
with map/unmap.
I'd rather avoid introducing a single dma_req per viommu, because I'd
like to move to the iotlb_range_add/iotlb_sync interface as soon as
possible, and the request logic changes a lot when multiple threads are
susceptible to interleave map/unmap requests.
I ran some tests, and adding a kmem_cache (or simply using kmemdup, it
doesn't make a noticeable difference at our scale) reduces netperf
stream/maerts throughput by 1.1%/1.4% (+/- 0.5% over 30 tests). That's
for a virtio-net device (1 tx/rx vq), and with a vfio device the
difference isn't measurable. At this point I'm not fussy about such
small difference, so don't mind simplifying viommu_del_mapping.
[...]
>> + /*
>> + * Last step creates a default domain and attaches to it. Everything
>> + * must be ready.
>> + */
>> + group = iommu_group_get_for_dev(dev);
>> + if (!IS_ERR(group))
>> + iommu_group_put(group);
>
> Since you create the sysfs IOMMU device, maybe also create the links for
> the masters?
Ok
>> +
>> + return PTR_ERR_OR_ZERO(group);
>> +}
>> +
>> +static void viommu_remove_device(struct device *dev)
>> +{
>
> You need to remove dev from its group, too (basically, .remove_device
> should always undo everything .add_device did)
>
> It would also be good practice to verify that dev->iommu_fwspec exists
> and is one of yours before touching anything, although having checked
> the core code I see we do currently just about get away with it thanks
> to the horrible per-bus ops.
Ok
>
>> + kfree(dev->iommu_fwspec->iommu_priv);
>> +}
>> +
>> +static struct iommu_group *viommu_device_group(struct device *dev)
>> +{
>> + if (dev_is_pci(dev))
>> + return pci_device_group(dev);
>> + else
>> + return generic_device_group(dev);
>> +}
>> +
>> +static int viommu_of_xlate(struct device *dev, struct of_phandle_args *args)
>> +{
>> + return iommu_fwspec_add_ids(dev, args->args, 1);
>
> I'm sure a DT binding somewhere needs to document the appropriate value
> and meaning for #iommu-cells - I guess that probably falls under the
> virtio-mmio binding?
Yes I guess mmio.txt would be the best place for this.
[...]
>> +/*
>> + * Virtio-iommu definition v0.6
>> + *
>> + * Copyright (C) 2018 ARM Ltd.
>> + *
>> + * SPDX-License-Identifier: BSD-3-Clause
>
> Again, at the top, although in /* */ here since it's a header.
Right
Thanks for the review,
Jean
^ permalink raw reply
* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-11 15:51 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <1523386790-12396-3-git-send-email-sridhar.samudrala@intel.com>
Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. existing netvsc driver that uses 2 netdev model. In this model, no
>master netdev is created. The paravirtual driver registers each bypass
>instance along with a set of ops to manage the slave events.
> bypass_master_register()
> bypass_master_unregister()
>2. new virtio_net based solution that uses 3 netdev model. In this model,
>the bypass module provides interfaces to create/destroy additional master
>netdev and all the slave events are managed internally.
> bypass_master_create()
> bypass_master_destroy()
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> include/linux/netdevice.h | 14 +
> include/net/bypass.h | 96 ++++++
> net/Kconfig | 18 +
> net/core/Makefile | 1 +
> net/core/bypass.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 973 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index cf44503ea81a..587293728f70 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> IFF_PHONY_HEADROOM = 1<<24,
> IFF_MACSEC = 1<<25,
> IFF_NO_RX_HANDLER = 1<<26,
>+ IFF_BYPASS = 1 << 27,
>+ IFF_BYPASS_SLAVE = 1 << 28,
I wonder, why you don't follow the existing coding style... Also, please
add these to into the comment above.
> };
>
> #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
>@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
> #define IFF_MACSEC IFF_MACSEC
> #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>+#define IFF_BYPASS IFF_BYPASS
>+#define IFF_BYPASS_SLAVE IFF_BYPASS_SLAVE
>
> /**
> * struct net_device - The DEVICE structure.
>@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> return dev->priv_flags & IFF_RXFH_CONFIGURED;
> }
>
>+static inline bool netif_is_bypass_master(const struct net_device *dev)
>+{
>+ return dev->priv_flags & IFF_BYPASS;
>+}
>+
>+static inline bool netif_is_bypass_slave(const struct net_device *dev)
>+{
>+ return dev->priv_flags & IFF_BYPASS_SLAVE;
>+}
>+
> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> static inline void netif_keep_dst(struct net_device *dev)
> {
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index 000000000000..86b02cb894cf
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,96 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include <linux/netdevice.h>
>+
>+struct bypass_ops {
>+ int (*slave_pre_register)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ int (*slave_join)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ int (*slave_pre_unregister)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ int (*slave_release)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ int (*slave_link_change)(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev);
>+ rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_master {
>+ struct list_head list;
>+ struct net_device __rcu *bypass_netdev;
>+ struct bypass_ops __rcu *ops;
>+};
>+
>+/* bypass state */
>+struct bypass_info {
>+ /* passthru netdev with same MAC */
>+ struct net_device __rcu *active_netdev;
You still use "active"/"backup" names which is highly misleading as
it has completely different meaning that in bond for example.
I noted that in my previous review already. Please change it.
>+
>+ /* virtio_net netdev */
>+ struct net_device __rcu *backup_netdev;
>+
>+ /* active netdev stats */
>+ struct rtnl_link_stats64 active_stats;
>+
>+ /* backup netdev stats */
>+ struct rtnl_link_stats64 backup_stats;
>+
>+ /* aggregated stats */
>+ struct rtnl_link_stats64 bypass_stats;
>+
>+ /* spinlock while updating stats */
>+ spinlock_t stats_lock;
>+};
>+
>+#if IS_ENABLED(CONFIG_NET_BYPASS)
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+ struct bypass_master **pbypass_master);
>+void bypass_master_destroy(struct bypass_master *bypass_master);
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+ struct bypass_master **pbypass_master);
>+void bypass_master_unregister(struct bypass_master *bypass_master);
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev);
>+
>+#else
>+
>+static inline
>+int bypass_master_create(struct net_device *backup_netdev,
>+ struct bypass_master **pbypass_master);
>+{
>+ return 0;
>+}
>+
>+static inline
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+ struct pbypass_master **pbypass_master);
>+{
>+ return 0;
>+}
>+
>+static inline
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+ return 0;
>+}
>+
>+#endif
>+
>+#endif /* _NET_BYPASS_H */
>diff --git a/net/Kconfig b/net/Kconfig
>index 0428f12c25c2..994445f4a96a 100644
>--- a/net/Kconfig
>+++ b/net/Kconfig
>@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
> on MAY_USE_DEVLINK to ensure they do not cause link errors when
> devlink is a loadable module and the driver using it is built-in.
>
>+config NET_BYPASS
>+ tristate "Bypass interface"
>+ ---help---
>+ This provides a generic interface for paravirtual drivers to listen
>+ for netdev register/unregister/link change events from pci ethernet
>+ devices with the same MAC and takeover their datapath. This also
>+ enables live migration of a VM with direct attached VF by failing
>+ over to the paravirtual datapath when the VF is unplugged.
>+
>+config MAY_USE_BYPASS
>+ tristate
>+ default m if NET_BYPASS=m
>+ default y if NET_BYPASS=y || NET_BYPASS=n
>+ help
>+ Drivers using the bypass infrastructure should have a dependency
>+ on MAY_USE_BYPASS to ensure they do not cause link errors when
>+ bypass is a loadable module and the driver using it is built-in.
>+
> endif # if NET
>
> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>diff --git a/net/core/Makefile b/net/core/Makefile
>index 6dbbba8c57ae..a9727ed1c8fc 100644
>--- a/net/core/Makefile
>+++ b/net/core/Makefile
>@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
> obj-$(CONFIG_HWBM) += hwbm.o
> obj-$(CONFIG_NET_DEVLINK) += devlink.o
> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>+obj-$(CONFIG_NET_BYPASS) += bypass.o
>diff --git a/net/core/bypass.c b/net/core/bypass.c
>new file mode 100644
>index 000000000000..b5b9cb554c3f
>--- /dev/null
>+++ b/net/core/bypass.c
>@@ -0,0 +1,844 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+/* A common module to handle registrations and notifications for paravirtual
>+ * drivers to enable accelerated datapath and support VF live migration.
>+ *
>+ * The notifier and event handling code is based on netvsc driver.
>+ */
>+
>+#include <linux/netdevice.h>
>+#include <linux/etherdevice.h>
>+#include <linux/ethtool.h>
>+#include <linux/module.h>
>+#include <linux/slab.h>
>+#include <linux/netdevice.h>
>+#include <linux/netpoll.h>
>+#include <linux/rtnetlink.h>
>+#include <linux/if_vlan.h>
>+#include <linux/pci.h>
>+#include <net/sch_generic.h>
>+#include <uapi/linux/if_arp.h>
>+#include <net/bypass.h>
>+
>+static LIST_HEAD(bypass_master_list);
>+static DEFINE_SPINLOCK(bypass_lock);
>+
>+static int bypass_slave_pre_register(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev,
>+ struct bypass_ops *bypass_ops)
>+{
>+ struct bypass_info *bi;
>+ bool backup;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_pre_register)
>+ return -EINVAL;
>+
>+ return bypass_ops->slave_pre_register(slave_netdev,
>+ bypass_netdev);
>+ }
>+
>+ bi = netdev_priv(bypass_netdev);
>+ backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+ if (backup ? rtnl_dereference(bi->backup_netdev) :
>+ rtnl_dereference(bi->active_netdev)) {
>+ netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>+ slave_netdev->name, backup ? "backup" : "active");
>+ return -EEXIST;
>+ }
>+
>+ /* Avoid non pci devices as active netdev */
>+ if (!backup && (!slave_netdev->dev.parent ||
>+ !dev_is_pci(slave_netdev->dev.parent)))
>+ return -EINVAL;
>+
>+ return 0;
>+}
>+
>+static int bypass_slave_join(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev,
>+ struct bypass_ops *bypass_ops)
>+{
>+ struct bypass_info *bi;
>+ bool backup;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_join)
>+ return -EINVAL;
>+
>+ return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>+ }
>+
>+ bi = netdev_priv(bypass_netdev);
>+ backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+
>+ dev_hold(slave_netdev);
>+
>+ if (backup) {
>+ rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>+ dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>+ } else {
>+ rcu_assign_pointer(bi->active_netdev, slave_netdev);
>+ dev_get_stats(bi->active_netdev, &bi->active_stats);
>+ bypass_netdev->min_mtu = slave_netdev->min_mtu;
>+ bypass_netdev->max_mtu = slave_netdev->max_mtu;
>+ }
>+
>+ netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>+ slave_netdev->name);
>+
>+ return 0;
>+}
>+
>+/* Called when slave dev is injecting data into network stack.
>+ * Change the associated network device from lower dev to virtio.
>+ * note: already called with rcu_read_lock
>+ */
>+static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>+{
>+ struct sk_buff *skb = *pskb;
>+ struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>+
>+ skb->dev = ndev;
>+
>+ return RX_HANDLER_ANOTHER;
>+}
>+
>+static struct net_device *bypass_master_get_bymac(u8 *mac,
>+ struct bypass_ops **ops)
>+{
>+ struct bypass_master *bypass_master;
>+ struct net_device *bypass_netdev;
>+
>+ spin_lock(&bypass_lock);
>+ list_for_each_entry(bypass_master, &bypass_master_list, list) {
As I wrote the last time, you don't need this list, spinlock.
You can do just something like:
for_each_net(net) {
for_each_netdev(net, dev) {
if (netif_is_bypass_master(dev)) {
>+ bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+ if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>+ *ops = rcu_dereference(bypass_master->ops);
I don't see how rcu_dereference is ok here.
1) I don't see rcu_read_lock taken
2) Looks like bypass_master->ops has the same value across the whole
existence.
>+ spin_unlock(&bypass_lock);
>+ return bypass_netdev;
>+ }
>+ }
>+ spin_unlock(&bypass_lock);
>+ return NULL;
>+}
>+
>+static int bypass_slave_register(struct net_device *slave_netdev)
>+{
>+ struct net_device *bypass_netdev;
>+ struct bypass_ops *bypass_ops;
>+ int ret, orig_mtu;
>+
>+ ASSERT_RTNL();
>+
>+ bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+ &bypass_ops);
For master, could you use word "master" in the variables so it is clear?
Also, "dev" is fine instead of "netdev".
Something like "bpmaster_dev"
>+ if (!bypass_netdev)
>+ goto done;
>+
>+ ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>+ bypass_ops);
>+ if (ret != 0)
Just "if (ret)" will do. You have this on more places.
>+ goto done;
>+
>+ ret = netdev_rx_handler_register(slave_netdev,
>+ bypass_ops ? bypass_ops->handle_frame :
>+ bypass_handle_frame, bypass_netdev);
>+ if (ret != 0) {
>+ netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>+ ret);
>+ goto done;
>+ }
>+
>+ ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>+ if (ret != 0) {
>+ netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>+ bypass_netdev->name, ret);
>+ goto upper_link_failed;
>+ }
>+
>+ slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>+
>+ if (netif_running(bypass_netdev)) {
>+ ret = dev_open(slave_netdev);
>+ if (ret && (ret != -EBUSY)) {
>+ netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>+ slave_netdev->name, ret);
>+ goto err_interface_up;
>+ }
>+ }
>+
>+ /* Align MTU of slave with master */
>+ orig_mtu = slave_netdev->mtu;
>+ ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>+ if (ret != 0) {
>+ netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>+ slave_netdev->name, bypass_netdev->mtu);
>+ goto err_set_mtu;
>+ }
>+
>+ ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>+ if (ret != 0)
>+ goto err_join;
>+
>+ call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>+
>+ netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>+ slave_netdev->name);
>+
>+ goto done;
>+
>+err_join:
>+ dev_set_mtu(slave_netdev, orig_mtu);
>+err_set_mtu:
>+ dev_close(slave_netdev);
>+err_interface_up:
>+ netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+ slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+upper_link_failed:
>+ netdev_rx_handler_unregister(slave_netdev);
>+done:
>+ return NOTIFY_DONE;
>+}
>+
>+static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev,
>+ struct bypass_ops *bypass_ops)
>+{
>+ struct net_device *backup_netdev, *active_netdev;
>+ struct bypass_info *bi;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_pre_unregister)
>+ return -EINVAL;
>+
>+ return bypass_ops->slave_pre_unregister(slave_netdev,
>+ bypass_netdev);
>+ }
>+
>+ bi = netdev_priv(bypass_netdev);
>+ active_netdev = rtnl_dereference(bi->active_netdev);
>+ backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+ if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+ return -EINVAL;
>+
>+ return 0;
>+}
>+
>+static int bypass_slave_release(struct net_device *slave_netdev,
>+ struct net_device *bypass_netdev,
>+ struct bypass_ops *bypass_ops)
>+{
>+ struct net_device *backup_netdev, *active_netdev;
>+ struct bypass_info *bi;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_release)
>+ return -EINVAL;
I think it would be good to make the API to the driver more strict and
have a separate set of ops for "active" and "backup" netdevices.
That should stop people thinking about extending this to more slaves in
the future.
>+
>+ return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>+ }
>+
>+ bi = netdev_priv(bypass_netdev);
>+ active_netdev = rtnl_dereference(bi->active_netdev);
>+ backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+ if (slave_netdev == backup_netdev) {
>+ RCU_INIT_POINTER(bi->backup_netdev, NULL);
>+ } else {
>+ RCU_INIT_POINTER(bi->active_netdev, NULL);
>+ if (backup_netdev) {
>+ bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+ bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+ }
>+ }
>+
>+ dev_put(slave_netdev);
>+
>+ netdev_info(bypass_netdev, "bypass slave:%s released\n",
>+ slave_netdev->name);
>+
>+ return 0;
>+}
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+ struct net_device *bypass_netdev;
>+ struct bypass_ops *bypass_ops;
>+ int ret;
>+
>+ if (!netif_is_bypass_slave(slave_netdev))
>+ goto done;
>+
>+ ASSERT_RTNL();
>+
>+ bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+ &bypass_ops);
>+ if (!bypass_netdev)
>+ goto done;
>+
>+ ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>+ bypass_ops);
>+ if (ret != 0)
>+ goto done;
>+
>+ netdev_rx_handler_unregister(slave_netdev);
>+ netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+ slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+
>+ bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>+
>+ netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>+ slave_netdev->name);
>+
>+done:
>+ return NOTIFY_DONE;
>+}
>+EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>+
>+static bool bypass_xmit_ready(struct net_device *dev)
>+{
>+ return netif_running(dev) && netif_carrier_ok(dev);
>+}
>+
>+static int bypass_slave_link_change(struct net_device *slave_netdev)
>+{
>+ struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>+ struct bypass_ops *bypass_ops;
>+ struct bypass_info *bi;
>+
>+ if (!netif_is_bypass_slave(slave_netdev))
>+ goto done;
>+
>+ ASSERT_RTNL();
>+
>+ bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+ &bypass_ops);
>+ if (!bypass_netdev)
>+ goto done;
>+
>+ if (bypass_ops) {
>+ if (!bypass_ops->slave_link_change)
>+ goto done;
>+
>+ return bypass_ops->slave_link_change(slave_netdev,
>+ bypass_netdev);
>+ }
>+
>+ if (!netif_running(bypass_netdev))
>+ return 0;
>+
>+ bi = netdev_priv(bypass_netdev);
>+
>+ active_netdev = rtnl_dereference(bi->active_netdev);
>+ backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+ if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+ goto done;
You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
above is enough.
>+
>+ if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>+ (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>+ netif_carrier_on(bypass_netdev);
>+ netif_tx_wake_all_queues(bypass_netdev);
>+ } else {
>+ netif_carrier_off(bypass_netdev);
>+ netif_tx_stop_all_queues(bypass_netdev);
>+ }
>+
>+done:
>+ return NOTIFY_DONE;
>+}
>+
>+static bool bypass_validate_event_dev(struct net_device *dev)
>+{
>+ /* Skip parent events */
>+ if (netif_is_bypass_master(dev))
>+ return false;
>+
>+ /* Avoid non-Ethernet type devices */
>+ if (dev->type != ARPHRD_ETHER)
>+ return false;
>+
>+ /* Avoid Vlan dev with same MAC registering as VF */
>+ if (is_vlan_dev(dev))
>+ return false;
>+
>+ /* Avoid Bonding master dev with same MAC registering as slave dev */
>+ if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
Yeah, this is certainly incorrect. One thing is, you should be using the
helpers netif_is_bond_master().
But what about the rest? macsec, macvlan, team, bridge, ovs and others?
You need to do it not by blacklisting, but with whitelisting. You need
to whitelist VF devices. My port flavours patchset might help with this.
>+ return false;
>+
>+ return true;
>+}
>+
>+static int
>+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>+{
>+ struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>+
>+ if (!bypass_validate_event_dev(event_dev))
>+ return NOTIFY_DONE;
>+
>+ switch (event) {
>+ case NETDEV_REGISTER:
>+ return bypass_slave_register(event_dev);
>+ case NETDEV_UNREGISTER:
>+ return bypass_slave_unregister(event_dev);
>+ case NETDEV_UP:
>+ case NETDEV_DOWN:
>+ case NETDEV_CHANGE:
>+ return bypass_slave_link_change(event_dev);
>+ default:
>+ return NOTIFY_DONE;
>+ }
>+}
>+
>+static struct notifier_block bypass_notifier = {
>+ .notifier_call = bypass_event,
>+};
>+
>+int bypass_open(struct net_device *dev)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
>+ struct net_device *active_netdev, *backup_netdev;
>+ int err;
>+
>+ netif_carrier_off(dev);
>+ netif_tx_wake_all_queues(dev);
>+
>+ active_netdev = rtnl_dereference(bi->active_netdev);
>+ if (active_netdev) {
>+ err = dev_open(active_netdev);
>+ if (err)
>+ goto err_active_open;
>+ }
>+
>+ backup_netdev = rtnl_dereference(bi->backup_netdev);
>+ if (backup_netdev) {
>+ err = dev_open(backup_netdev);
>+ if (err)
>+ goto err_backup_open;
>+ }
>+
>+ return 0;
>+
>+err_backup_open:
>+ dev_close(active_netdev);
>+err_active_open:
>+ netif_tx_disable(dev);
>+ return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_open);
>+
>+int bypass_close(struct net_device *dev)
>+{
>+ struct bypass_info *vi = netdev_priv(dev);
This should be probably "bi"
>+ struct net_device *slave_netdev;
>+
>+ netif_tx_disable(dev);
>+
>+ slave_netdev = rtnl_dereference(vi->active_netdev);
>+ if (slave_netdev)
>+ dev_close(slave_netdev);
>+
>+ slave_netdev = rtnl_dereference(vi->backup_netdev);
>+ if (slave_netdev)
>+ dev_close(slave_netdev);
>+
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_close);
>+
>+static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+ atomic_long_inc(&dev->tx_dropped);
>+ dev_kfree_skb_any(skb);
>+ return NETDEV_TX_OK;
>+}
>+
>+netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
If you rename the other variable to "bpmaster_dev", it would be nice to
rename this to bpinfo or something more descriptive. "bi" is too short
to know what that is right away.
>+ struct net_device *xmit_dev;
Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.
>+
>+ /* Try xmit via active netdev followed by backup netdev */
>+ xmit_dev = rcu_dereference_bh(bi->active_netdev);
>+ if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>+ xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>+ if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>+ return bypass_drop_xmit(skb, dev);
>+ }
>+
>+ skb->dev = xmit_dev;
>+ skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>+
>+ return dev_queue_xmit(skb);
>+}
>+EXPORT_SYMBOL_GPL(bypass_start_xmit);
>+
>+u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>+ void *accel_priv, select_queue_fallback_t fallback)
>+{
>+ /* This helper function exists to help dev_pick_tx get the correct
>+ * destination queue. Using a helper function skips a call to
>+ * skb_tx_hash and will put the skbs in the queue we expect on their
>+ * way down to the bonding driver.
>+ */
>+ u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>+
>+ /* Save the original txq to restore before passing to the driver */
>+ qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>+
>+ if (unlikely(txq >= dev->real_num_tx_queues)) {
>+ do {
>+ txq -= dev->real_num_tx_queues;
>+ } while (txq >= dev->real_num_tx_queues);
>+ }
>+
>+ return txq;
>+}
>+EXPORT_SYMBOL_GPL(bypass_select_queue);
>+
>+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>+ * that some drivers can provide 32bit values only.
>+ */
>+static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>+ const struct rtnl_link_stats64 *_new,
>+ const struct rtnl_link_stats64 *_old)
>+{
>+ const u64 *new = (const u64 *)_new;
>+ const u64 *old = (const u64 *)_old;
>+ u64 *res = (u64 *)_res;
>+ int i;
>+
>+ for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>+ u64 nv = new[i];
>+ u64 ov = old[i];
>+ s64 delta = nv - ov;
>+
>+ /* detects if this particular field is 32bit only */
>+ if (((nv | ov) >> 32) == 0)
>+ delta = (s64)(s32)((u32)nv - (u32)ov);
>+
>+ /* filter anomalies, some drivers reset their stats
>+ * at down/up events.
>+ */
>+ if (delta > 0)
>+ res[i] += delta;
>+ }
>+}
>+
>+void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
You can WARN_ON and return in case the dev is not bypass master, just
to catch buggy drivers. Same with other helpers.
>+ const struct rtnl_link_stats64 *new;
>+ struct rtnl_link_stats64 temp;
>+ struct net_device *slave_netdev;
>+
>+ spin_lock(&bi->stats_lock);
>+ memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>+
>+ rcu_read_lock();
>+
>+ slave_netdev = rcu_dereference(bi->active_netdev);
>+ if (slave_netdev) {
>+ new = dev_get_stats(slave_netdev, &temp);
>+ bypass_fold_stats(stats, new, &bi->active_stats);
>+ memcpy(&bi->active_stats, new, sizeof(*new));
>+ }
>+
>+ slave_netdev = rcu_dereference(bi->backup_netdev);
>+ if (slave_netdev) {
>+ new = dev_get_stats(slave_netdev, &temp);
>+ bypass_fold_stats(stats, new, &bi->backup_stats);
>+ memcpy(&bi->backup_stats, new, sizeof(*new));
>+ }
>+
>+ rcu_read_unlock();
>+
>+ memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>+ spin_unlock(&bi->stats_lock);
>+}
>+EXPORT_SYMBOL_GPL(bypass_get_stats);
>+
>+int bypass_change_mtu(struct net_device *dev, int new_mtu)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
>+ struct net_device *active_netdev, *backup_netdev;
>+ int ret = 0;
Pointless initialization.
>+
>+ active_netdev = rcu_dereference(bi->active_netdev);
>+ if (active_netdev) {
>+ ret = dev_set_mtu(active_netdev, new_mtu);
>+ if (ret)
>+ return ret;
>+ }
>+
>+ backup_netdev = rcu_dereference(bi->backup_netdev);
>+ if (backup_netdev) {
>+ ret = dev_set_mtu(backup_netdev, new_mtu);
>+ if (ret) {
>+ dev_set_mtu(active_netdev, dev->mtu);
>+ return ret;
>+ }
>+ }
>+
>+ dev->mtu = new_mtu;
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_change_mtu);
>+
>+void bypass_set_rx_mode(struct net_device *dev)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
>+ struct net_device *slave_netdev;
>+
>+ rcu_read_lock();
>+
>+ slave_netdev = rcu_dereference(bi->active_netdev);
>+ if (slave_netdev) {
>+ dev_uc_sync_multiple(slave_netdev, dev);
>+ dev_mc_sync_multiple(slave_netdev, dev);
>+ }
>+
>+ slave_netdev = rcu_dereference(bi->backup_netdev);
>+ if (slave_netdev) {
>+ dev_uc_sync_multiple(slave_netdev, dev);
>+ dev_mc_sync_multiple(slave_netdev, dev);
>+ }
>+
>+ rcu_read_unlock();
>+}
>+EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>+
>+static const struct net_device_ops bypass_netdev_ops = {
>+ .ndo_open = bypass_open,
>+ .ndo_stop = bypass_close,
>+ .ndo_start_xmit = bypass_start_xmit,
>+ .ndo_select_queue = bypass_select_queue,
>+ .ndo_get_stats64 = bypass_get_stats,
>+ .ndo_change_mtu = bypass_change_mtu,
>+ .ndo_set_rx_mode = bypass_set_rx_mode,
>+ .ndo_validate_addr = eth_validate_addr,
>+ .ndo_features_check = passthru_features_check,
>+};
>+
>+#define BYPASS_DRV_NAME "bypass"
>+#define BYPASS_DRV_VERSION "0.1"
>+
>+static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>+ struct ethtool_drvinfo *drvinfo)
>+{
>+ strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>+ strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>+}
>+
>+int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>+ struct ethtool_link_ksettings *cmd)
>+{
>+ struct bypass_info *bi = netdev_priv(dev);
>+ struct net_device *slave_netdev;
>+
>+ slave_netdev = rtnl_dereference(bi->active_netdev);
>+ if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+ slave_netdev = rtnl_dereference(bi->backup_netdev);
>+ if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+ cmd->base.duplex = DUPLEX_UNKNOWN;
>+ cmd->base.port = PORT_OTHER;
>+ cmd->base.speed = SPEED_UNKNOWN;
>+
>+ return 0;
>+ }
>+ }
>+
>+ return __ethtool_get_link_ksettings(slave_netdev, cmd);
>+}
>+EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>+
>+static const struct ethtool_ops bypass_ethtool_ops = {
>+ .get_drvinfo = bypass_ethtool_get_drvinfo,
>+ .get_link = ethtool_op_get_link,
>+ .get_link_ksettings = bypass_ethtool_get_link_ksettings,
>+};
>+
>+static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>+{
>+ struct net *net = dev_net(bypass_netdev);
>+ struct net_device *dev;
>+
>+ rtnl_lock();
>+ for_each_netdev(net, dev) {
>+ if (dev == bypass_netdev)
>+ continue;
>+ if (!bypass_validate_event_dev(dev))
>+ continue;
>+ if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>+ bypass_slave_register(dev);
>+ }
>+ rtnl_unlock();
>+}
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+ struct bypass_master **pbypass_master)
>+{
>+ struct bypass_master *bypass_master;
>+
>+ bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>+ if (!bypass_master)
>+ return -ENOMEM;
>+
>+ rcu_assign_pointer(bypass_master->ops, ops);
>+ dev_hold(dev);
>+ dev->priv_flags |= IFF_BYPASS;
>+ rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>+
>+ spin_lock(&bypass_lock);
>+ list_add_tail(&bypass_master->list, &bypass_master_list);
>+ spin_unlock(&bypass_lock);
>+
>+ bypass_register_existing_slave(dev);
>+
>+ *pbypass_master = bypass_master;
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_register);
>+
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+ struct net_device *bypass_netdev;
>+
>+ bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+
>+ bypass_netdev->priv_flags &= ~IFF_BYPASS;
>+ dev_put(bypass_netdev);
>+
>+ spin_lock(&bypass_lock);
>+ list_del(&bypass_master->list);
>+ spin_unlock(&bypass_lock);
>+
>+ kfree(bypass_master);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_unregister);
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+ struct bypass_master **pbypass_master)
>+{
>+ struct device *dev = backup_netdev->dev.parent;
>+ struct net_device *bypass_netdev;
>+ int err;
>+
>+ /* Alloc at least 2 queues, for now we are going with 16 assuming
>+ * that most devices being bonded won't have too many queues.
>+ */
>+ bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>+ if (!bypass_netdev) {
>+ dev_err(dev, "Unable to allocate bypass_netdev!\n");
>+ return -ENOMEM;
>+ }
>+
>+ dev_net_set(bypass_netdev, dev_net(backup_netdev));
>+ SET_NETDEV_DEV(bypass_netdev, dev);
>+
>+ bypass_netdev->netdev_ops = &bypass_netdev_ops;
>+ bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>+
>+ /* Initialize the device options */
>+ bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>+ bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>+ IFF_TX_SKB_SHARING);
>+
>+ /* don't acquire bypass netdev's netif_tx_lock when transmitting */
>+ bypass_netdev->features |= NETIF_F_LLTX;
>+
>+ /* Don't allow bypass devices to change network namespaces. */
>+ bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>+
>+ bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>+ NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>+ NETIF_F_HIGHDMA | NETIF_F_LRO;
>+
>+ bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>+ bypass_netdev->features |= bypass_netdev->hw_features;
>+
>+ memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>+ bypass_netdev->addr_len);
>+
>+ bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+ bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+
>+ err = register_netdev(bypass_netdev);
>+ if (err < 0) {
>+ dev_err(dev, "Unable to register bypass_netdev!\n");
>+ goto err_register_netdev;
>+ }
>+
>+ netif_carrier_off(bypass_netdev);
>+
>+ err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>+ if (err < 0)
just "if (err)" would do.
>+ goto err_bypass;
>+
>+ return 0;
>+
>+err_bypass:
>+ unregister_netdev(bypass_netdev);
>+err_register_netdev:
>+ free_netdev(bypass_netdev);
>+
>+ return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_create);
>+
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+ struct net_device *bypass_netdev;
>+ struct net_device *slave_netdev;
>+ struct bypass_info *bi;
>+
>+ if (!bypass_master)
>+ return;
>+
>+ bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+ bi = netdev_priv(bypass_netdev);
>+
>+ netif_device_detach(bypass_netdev);
>+
>+ rtnl_lock();
>+
>+ slave_netdev = rtnl_dereference(bi->active_netdev);
>+ if (slave_netdev)
>+ bypass_slave_unregister(slave_netdev);
>+
>+ slave_netdev = rtnl_dereference(bi->backup_netdev);
>+ if (slave_netdev)
>+ bypass_slave_unregister(slave_netdev);
>+
>+ bypass_master_unregister(bypass_master);
>+
>+ unregister_netdevice(bypass_netdev);
>+
>+ rtnl_unlock();
>+
>+ free_netdev(bypass_netdev);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_destroy);
>+
>+static __init int
>+bypass_init(void)
>+{
>+ register_netdevice_notifier(&bypass_notifier);
>+
>+ return 0;
>+}
>+module_init(bypass_init);
>+
>+static __exit
>+void bypass_exit(void)
>+{
>+ unregister_netdevice_notifier(&bypass_notifier);
>+}
>+module_exit(bypass_exit);
>+
>+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>+MODULE_LICENSE("GPL v2");
>--
>2.14.3
>
^ permalink raw reply
* Re: [PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check
From: David Miller @ 2018-04-11 14:55 UTC (permalink / raw)
To: mst
Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
stefanha, torvalds
In-Reply-To: <20180411162345-mutt-send-email-mst@kernel.org>
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 11 Apr 2018 16:24:02 +0300
> On Wed, Apr 11, 2018 at 10:35:39AM +0800, Stefan Hajnoczi wrote:
>> v3:
>> * Rebased onto net/master and resolved conflict [DaveM]
>>
>> v2:
>> * Rewrote the conditional to make the vq access check clearer [Linus]
>> * Added Patch 2 to make the return type consistent and harder to misuse [Linus]
>>
>> The first patch fixes the vhost virtqueue access check which was recently
>> broken. The second patch replaces the int return type with bool to prevent
>> future bugs.
>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>
> We need the 1st one on stable I think.
Series applied and patch #1 queued up for -stable.
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-11 14:51 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180411174157-mutt-send-email-mst@kernel.org>
Wed, Apr 11, 2018 at 04:45:07PM CEST, mst@redhat.com wrote:
>On Wed, Apr 11, 2018 at 10:03:32AM +0200, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 08:24:43AM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/10/2018 11:03 PM, Jiri Pirko wrote:
>> >> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> >> > > > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> >> > > > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> >> > > > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> >> > > > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > > > > > > [...]
>> >> > > > > > > > >
>> >> > > > > > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> >> > > > > > > > > > > > + struct net_device *child_netdev)
>> >> > > > > > > > > > > > +{
>> >> > > > > > > > > > > > + struct virtnet_bypass_info *vbi;
>> >> > > > > > > > > > > > + bool backup;
>> >> > > > > > > > > > > > +
>> >> > > > > > > > > > > > + vbi = netdev_priv(bypass_netdev);
>> >> > > > > > > > > > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> >> > > > > > > > > > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> >> > > > > > > > > > > > + rtnl_dereference(vbi->active_netdev)) {
>> >> > > > > > > > > > > > + netdev_info(bypass_netdev,
>> >> > > > > > > > > > > > + "%s attempting to join bypass dev when %s already present\n",
>> >> > > > > > > > > > > > + child_netdev->name, backup ? "backup" : "active");
>> >> > > > > > > > > > > Bypass module should check if there is already some other netdev
>> >> > > > > > > > > > > enslaved and refuse right there.
>> >> > > > > > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> >> > > > > > > > > > as its bypass_netdev is same as the backup_netdev.
>> >> > > > > > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> >> > > > > > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> >> > > > > > > > > > for 3 netdev scenario.
>> >> > > > > > > > > Just let me undestand it clearly. What I expect the difference would be
>> >> > > > > > > > > between 2netdev and3 netdev model is this:
>> >> > > > > > > > > 2netdev:
>> >> > > > > > > > > bypass_master
>> >> > > > > > > > > /
>> >> > > > > > > > > /
>> >> > > > > > > > > VF_slave
>> >> > > > > > > > >
>> >> > > > > > > > > 3netdev:
>> >> > > > > > > > > bypass_master
>> >> > > > > > > > > / \
>> >> > > > > > > > > / \
>> >> > > > > > > > > VF_slave backup_slave
>> >> > > > > > > > >
>> >> > > > > > > > > Is that correct? If not, how does it look like?
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > Looks correct.
>> >> > > > > > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
>> >> > > > > > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> >> > > > > > > > marked as the 2 slaves of this new netdev.
>> >> > > > > > > You say it looks correct and in another sentence you provide completely
>> >> > > > > > > different description. Could you please look again?
>> >> > > > > > >
>> >> > > > > > To be exact, 2 netdev model with netvsc looks like this.
>> >> > > > > >
>> >> > > > > > netvsc_netdev
>> >> > > > > > /
>> >> > > > > > /
>> >> > > > > > VF_slave
>> >> > > > > >
>> >> > > > > > With virtio_net, 3 netdev model
>> >> > > > > >
>> >> > > > > > bypass_netdev
>> >> > > > > > / \
>> >> > > > > > / \
>> >> > > > > > VF_slave virtio_net netdev
>> >> > > > > Could you also mark the original netdev which is there now? is it
>> >> > > > > bypass_netdev or virtio_net_netdev ?
>> >> > > > bypass_netdev
>> >> > > > / \
>> >> > > > / \
>> >> > > > VF_slave virtio_net netdev (original)
>> >> > > That does not make sense.
>> >> > > 1) You diverge from the behaviour of the netvsc, where the original
>> >> > > netdev is a master of the VF
>> >> > > 2) If the original netdev is a slave, you cannot have any IP address
>> >> > > configured on it (well you could, but the rx_handler would eat every
>> >> > > incoming packet). So you will break the user bacause he would have to
>> >> > > move the configuration to the new master device.
>> >> > > This only makes sense that the original netdev becomes the master for both
>> >> > > netvsc and virtio_net.
>> >> > Forgot to mention that bypass_netdev takes over the name of the original
>> >> > netdev and
>> >> > virtio_net netdev will get the backup name.
>> >> What do you mean by "name"?
>> >
>> >bypass_netdev also is associated with the same pci device as the original virtio_net
>> >netdev via SET_NETDEV_DEV(). Also, we added ndo_get_phys_port_name() to virtio_net
>> >that will return _bkup when BACKUP feature is enabled.
>>
>> Okay.
>>
>> >
>> >So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by udev
>> >without BACKUP feature, when BACKUP feature is enabled, the bypass_netdev will be
>> >named 'ens12' and the original virtio_net will get named as ens12n_bkup.
>>
>> Got it.
>>
>> I don't like the bypass_master to look differently in netvsc and
>> virtio_net :/ The best would be to convert netvsc to 3 netdev model and
>> treat them the same. The more I think about it, the more the 2 netdev
>> model feels wrong.
>
>If you believe that, then this patchset is a step in the right
>direction.
>
>With something like this patchset applied, converting netvsc to a 3
>device model will presumably be just a flag flip away. Afterwards
If done properly. Yes.
>we'll be able to drop dead code handling the bypass_master flag.
>
>>
>> >
>> >
>> >>
>> >> > So the userspace network configuration doesn't need to change.
>> >> >
>> >> >
>> >
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2018-04-11 14:45 UTC (permalink / raw)
To: Jiri Pirko
Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180411080332.GL2028@nanopsycho>
On Wed, Apr 11, 2018 at 10:03:32AM +0200, Jiri Pirko wrote:
> Wed, Apr 11, 2018 at 08:24:43AM CEST, sridhar.samudrala@intel.com wrote:
> >On 4/10/2018 11:03 PM, Jiri Pirko wrote:
> >> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudrala@intel.com wrote:
> >> > On 4/10/2018 8:43 AM, Jiri Pirko wrote:
> >> > > Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
> >> > > > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
> >> > > > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
> >> > > > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
> >> > > > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > > > > > > [...]
> >> > > > > > > > >
> >> > > > > > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
> >> > > > > > > > > > > > + struct net_device *child_netdev)
> >> > > > > > > > > > > > +{
> >> > > > > > > > > > > > + struct virtnet_bypass_info *vbi;
> >> > > > > > > > > > > > + bool backup;
> >> > > > > > > > > > > > +
> >> > > > > > > > > > > > + vbi = netdev_priv(bypass_netdev);
> >> > > > > > > > > > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
> >> > > > > > > > > > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
> >> > > > > > > > > > > > + rtnl_dereference(vbi->active_netdev)) {
> >> > > > > > > > > > > > + netdev_info(bypass_netdev,
> >> > > > > > > > > > > > + "%s attempting to join bypass dev when %s already present\n",
> >> > > > > > > > > > > > + child_netdev->name, backup ? "backup" : "active");
> >> > > > > > > > > > > Bypass module should check if there is already some other netdev
> >> > > > > > > > > > > enslaved and refuse right there.
> >> > > > > > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
> >> > > > > > > > > > as its bypass_netdev is same as the backup_netdev.
> >> > > > > > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
> >> > > > > > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
> >> > > > > > > > > > for 3 netdev scenario.
> >> > > > > > > > > Just let me undestand it clearly. What I expect the difference would be
> >> > > > > > > > > between 2netdev and3 netdev model is this:
> >> > > > > > > > > 2netdev:
> >> > > > > > > > > bypass_master
> >> > > > > > > > > /
> >> > > > > > > > > /
> >> > > > > > > > > VF_slave
> >> > > > > > > > >
> >> > > > > > > > > 3netdev:
> >> > > > > > > > > bypass_master
> >> > > > > > > > > / \
> >> > > > > > > > > / \
> >> > > > > > > > > VF_slave backup_slave
> >> > > > > > > > >
> >> > > > > > > > > Is that correct? If not, how does it look like?
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > Looks correct.
> >> > > > > > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
> >> > > > > > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
> >> > > > > > > > marked as the 2 slaves of this new netdev.
> >> > > > > > > You say it looks correct and in another sentence you provide completely
> >> > > > > > > different description. Could you please look again?
> >> > > > > > >
> >> > > > > > To be exact, 2 netdev model with netvsc looks like this.
> >> > > > > >
> >> > > > > > netvsc_netdev
> >> > > > > > /
> >> > > > > > /
> >> > > > > > VF_slave
> >> > > > > >
> >> > > > > > With virtio_net, 3 netdev model
> >> > > > > >
> >> > > > > > bypass_netdev
> >> > > > > > / \
> >> > > > > > / \
> >> > > > > > VF_slave virtio_net netdev
> >> > > > > Could you also mark the original netdev which is there now? is it
> >> > > > > bypass_netdev or virtio_net_netdev ?
> >> > > > bypass_netdev
> >> > > > / \
> >> > > > / \
> >> > > > VF_slave virtio_net netdev (original)
> >> > > That does not make sense.
> >> > > 1) You diverge from the behaviour of the netvsc, where the original
> >> > > netdev is a master of the VF
> >> > > 2) If the original netdev is a slave, you cannot have any IP address
> >> > > configured on it (well you could, but the rx_handler would eat every
> >> > > incoming packet). So you will break the user bacause he would have to
> >> > > move the configuration to the new master device.
> >> > > This only makes sense that the original netdev becomes the master for both
> >> > > netvsc and virtio_net.
> >> > Forgot to mention that bypass_netdev takes over the name of the original
> >> > netdev and
> >> > virtio_net netdev will get the backup name.
> >> What do you mean by "name"?
> >
> >bypass_netdev also is associated with the same pci device as the original virtio_net
> >netdev via SET_NETDEV_DEV(). Also, we added ndo_get_phys_port_name() to virtio_net
> >that will return _bkup when BACKUP feature is enabled.
>
> Okay.
>
> >
> >So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by udev
> >without BACKUP feature, when BACKUP feature is enabled, the bypass_netdev will be
> >named 'ens12' and the original virtio_net will get named as ens12n_bkup.
>
> Got it.
>
> I don't like the bypass_master to look differently in netvsc and
> virtio_net :/ The best would be to convert netvsc to 3 netdev model and
> treat them the same. The more I think about it, the more the 2 netdev
> model feels wrong.
If you believe that, then this patchset is a step in the right
direction.
With something like this patchset applied, converting netvsc to a 3
device model will presumably be just a flag flip away. Afterwards
we'll be able to drop dead code handling the bypass_master flag.
>
> >
> >
> >>
> >> > So the userspace network configuration doesn't need to change.
> >> >
> >> >
> >
^ permalink raw reply
* Re: [PATCH] vhost: Fix vhost_copy_to_user()
From: Michael S. Tsirkin @ 2018-04-11 13:51 UTC (permalink / raw)
To: Eric Auger
Cc: kvm, netdev, linux-kernel, virtualization, stefanha, kvmarm,
eric.auger.pro
In-Reply-To: <1523453438-4266-1-git-send-email-eric.auger@redhat.com>
On Wed, Apr 11, 2018 at 03:30:38PM +0200, Eric Auger wrote:
> vhost_copy_to_user is used to copy vring used elements to userspace.
> We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.
>
> Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>
> This fixes a stall observed when running an aarch64 guest with
> virtual smmu
> ---
> drivers/vhost/vhost.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index bec722e..f44aead 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -744,7 +744,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> struct iov_iter t;
> void __user *uaddr = vhost_vq_meta_fetch(vq,
> (u64)(uintptr_t)to, size,
> - VHOST_ADDR_DESC);
> + VHOST_ADDR_USED);
>
> if (uaddr)
> return __copy_to_user(uaddr, from, size);
> --
> 2.5.5
^ permalink raw reply
* Re: [PATCH] vhost: Fix vhost_copy_to_user()
From: Auger Eric @ 2018-04-11 13:45 UTC (permalink / raw)
To: Jason Wang, eric.auger.pro, linux-kernel, virtualization, netdev,
kvm, mst, kvmarm
Cc: stefanha
In-Reply-To: <85c033b9-b230-7ef9-744c-4e2799684609@redhat.com>
Hi Jason,
On 11/04/18 15:44, Jason Wang wrote:
>
>
> On 2018年04月11日 21:30, Eric Auger wrote:
>> vhost_copy_to_user is used to copy vring used elements to userspace.
>> We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.
>>
>> Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>>
>> This fixes a stall observed when running an aarch64 guest with
>> virtual smmu
>> ---
>> drivers/vhost/vhost.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index bec722e..f44aead 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -744,7 +744,7 @@ static int vhost_copy_to_user(struct
>> vhost_virtqueue *vq, void __user *to,
>> struct iov_iter t;
>> void __user *uaddr = vhost_vq_meta_fetch(vq,
>> (u64)(uintptr_t)to, size,
>> - VHOST_ADDR_DESC);
>> + VHOST_ADDR_USED);
>> if (uaddr)
>> return __copy_to_user(uaddr, from, size);
>
> Acked-by: Jason Wang <jasowang@redhat.com>
>
> Thanks!
>
> Stable material I think.
yes I think so.
Thanks
Eric
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] vhost: Fix vhost_copy_to_user()
From: Jason Wang @ 2018-04-11 13:44 UTC (permalink / raw)
To: Eric Auger, eric.auger.pro, linux-kernel, virtualization, netdev,
kvm, mst, kvmarm
Cc: stefanha
In-Reply-To: <1523453438-4266-1-git-send-email-eric.auger@redhat.com>
On 2018年04月11日 21:30, Eric Auger wrote:
> vhost_copy_to_user is used to copy vring used elements to userspace.
> We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.
>
> Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>
> ---
>
> This fixes a stall observed when running an aarch64 guest with
> virtual smmu
> ---
> drivers/vhost/vhost.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index bec722e..f44aead 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -744,7 +744,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> struct iov_iter t;
> void __user *uaddr = vhost_vq_meta_fetch(vq,
> (u64)(uintptr_t)to, size,
> - VHOST_ADDR_DESC);
> + VHOST_ADDR_USED);
>
> if (uaddr)
> return __copy_to_user(uaddr, from, size);
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks!
Stable material I think.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH] vhost: Fix vhost_copy_to_user()
From: Eric Auger @ 2018-04-11 13:30 UTC (permalink / raw)
To: eric.auger.pro, eric.auger, linux-kernel, virtualization, netdev,
kvm, jasowang, mst, kvmarm
Cc: stefanha
vhost_copy_to_user is used to copy vring used elements to userspace.
We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.
Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
This fixes a stall observed when running an aarch64 guest with
virtual smmu
---
drivers/vhost/vhost.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index bec722e..f44aead 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -744,7 +744,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
struct iov_iter t;
void __user *uaddr = vhost_vq_meta_fetch(vq,
(u64)(uintptr_t)to, size,
- VHOST_ADDR_DESC);
+ VHOST_ADDR_USED);
if (uaddr)
return __copy_to_user(uaddr, from, size);
--
2.5.5
^ permalink raw reply related
* Re: [PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-11 13:24 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
Linus Torvalds
In-Reply-To: <20180411023541.15776-1-stefanha@redhat.com>
On Wed, Apr 11, 2018 at 10:35:39AM +0800, Stefan Hajnoczi wrote:
> v3:
> * Rebased onto net/master and resolved conflict [DaveM]
>
> v2:
> * Rewrote the conditional to make the vq access check clearer [Linus]
> * Added Patch 2 to make the return type consistent and harder to misuse [Linus]
>
> The first patch fixes the vhost virtqueue access check which was recently
> broken. The second patch replaces the int return type with bool to prevent
> future bugs.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
We need the 1st one on stable I think.
> Stefan Hajnoczi (2):
> vhost: fix vhost_vq_access_ok() log check
> vhost: return bool from *_access_ok() functions
>
> drivers/vhost/vhost.h | 4 +--
> drivers/vhost/vhost.c | 70 ++++++++++++++++++++++++++-------------------------
> 2 files changed, 38 insertions(+), 36 deletions(-)
>
> --
> 2.14.3
^ permalink raw reply
* Re: [PATCH v3 2/2] vhost: return bool from *_access_ok() functions
From: Michael S. Tsirkin @ 2018-04-11 13:23 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
Linus Torvalds
In-Reply-To: <20180411023541.15776-3-stefanha@redhat.com>
On Wed, Apr 11, 2018 at 10:35:41AM +0800, Stefan Hajnoczi wrote:
> Currently vhost *_access_ok() functions return int. This is error-prone
> because there are two popular conventions:
>
> 1. 0 means failure, 1 means success
> 2. -errno means failure, 0 means success
>
> Although vhost mostly uses #1, it does not do so consistently.
> umem_access_ok() uses #2.
>
> This patch changes the return type from int to bool so that false means
> failure and true means success. This eliminates a potential source of
> errors.
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> drivers/vhost/vhost.h | 4 ++--
> drivers/vhost/vhost.c | 66 +++++++++++++++++++++++++--------------------------
> 2 files changed, 35 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index d8ee85ae8fdc..6c844b90a168 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
> void vhost_dev_stop(struct vhost_dev *);
> long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
> long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
> -int vhost_vq_access_ok(struct vhost_virtqueue *vq);
> -int vhost_log_access_ok(struct vhost_dev *);
> +bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
> +bool vhost_log_access_ok(struct vhost_dev *);
>
> int vhost_get_vq_desc(struct vhost_virtqueue *,
> struct iovec iov[], unsigned int iov_count,
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index fc805b7fad9d..0fcb51a9940c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> }
> EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
>
> -static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> +static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> {
> u64 a = addr / VHOST_PAGE_SIZE / 8;
>
> /* Make sure 64 bit math will not overflow. */
> if (a > ULONG_MAX - (unsigned long)log_base ||
> a + (unsigned long)log_base > ULONG_MAX)
> - return 0;
> + return false;
>
> return access_ok(VERIFY_WRITE, log_base + a,
> (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
> @@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
> }
>
> /* Caller should have vq mutex and device mutex. */
> -static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
> - int log_all)
> +static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
> + int log_all)
> {
> struct vhost_umem_node *node;
>
> if (!umem)
> - return 0;
> + return false;
>
> list_for_each_entry(node, &umem->umem_list, link) {
> unsigned long a = node->userspace_addr;
>
> if (vhost_overflow(node->userspace_addr, node->size))
> - return 0;
> + return false;
>
>
> if (!access_ok(VERIFY_WRITE, (void __user *)a,
> node->size))
> - return 0;
> + return false;
> else if (log_all && !log_access_ok(log_base,
> node->start,
> node->size))
> - return 0;
> + return false;
> }
> - return 1;
> + return true;
> }
>
> static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
> @@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
>
> /* Can we switch to this memory table? */
> /* Caller should have device mutex but not vq mutex */
> -static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> - int log_all)
> +static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> + int log_all)
> {
> int i;
>
> for (i = 0; i < d->nvqs; ++i) {
> - int ok;
> + bool ok;
> bool log;
>
> mutex_lock(&d->vqs[i]->mutex);
> @@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> ok = vq_memory_access_ok(d->vqs[i]->log_base,
> umem, log);
> else
> - ok = 1;
> + ok = true;
> mutex_unlock(&d->vqs[i]->mutex);
> if (!ok)
> - return 0;
> + return false;
> }
> - return 1;
> + return true;
> }
>
> static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> @@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
> spin_unlock(&d->iotlb_lock);
> }
>
> -static int umem_access_ok(u64 uaddr, u64 size, int access)
> +static bool umem_access_ok(u64 uaddr, u64 size, int access)
> {
> unsigned long a = uaddr;
>
> /* Make sure 64 bit math will not overflow. */
> if (vhost_overflow(uaddr, size))
> - return -EFAULT;
> + return false;
>
> if ((access & VHOST_ACCESS_RO) &&
> !access_ok(VERIFY_READ, (void __user *)a, size))
> - return -EFAULT;
> + return false;
> if ((access & VHOST_ACCESS_WO) &&
> !access_ok(VERIFY_WRITE, (void __user *)a, size))
> - return -EFAULT;
> - return 0;
> + return false;
> + return true;
> }
>
> static int vhost_process_iotlb_msg(struct vhost_dev *dev,
> @@ -988,7 +988,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
> ret = -EFAULT;
> break;
> }
> - if (umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
> + if (!umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
> ret = -EFAULT;
> break;
> }
> @@ -1135,10 +1135,10 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
> return 0;
> }
>
> -static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
> - struct vring_desc __user *desc,
> - struct vring_avail __user *avail,
> - struct vring_used __user *used)
> +static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
> + struct vring_desc __user *desc,
> + struct vring_avail __user *avail,
> + struct vring_used __user *used)
>
> {
> size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> @@ -1161,8 +1161,8 @@ static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
> vq->meta_iotlb[type] = node;
> }
>
> -static int iotlb_access_ok(struct vhost_virtqueue *vq,
> - int access, u64 addr, u64 len, int type)
> +static bool iotlb_access_ok(struct vhost_virtqueue *vq,
> + int access, u64 addr, u64 len, int type)
> {
> const struct vhost_umem_node *node;
> struct vhost_umem *umem = vq->iotlb;
> @@ -1220,7 +1220,7 @@ EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
>
> /* Can we log writes? */
> /* Caller should have device mutex but not vq mutex */
> -int vhost_log_access_ok(struct vhost_dev *dev)
> +bool vhost_log_access_ok(struct vhost_dev *dev)
> {
> return memory_access_ok(dev, dev->umem, 1);
> }
> @@ -1228,8 +1228,8 @@ EXPORT_SYMBOL_GPL(vhost_log_access_ok);
>
> /* Verify access for write logging. */
> /* Caller should have vq mutex and device mutex */
> -static int vq_log_access_ok(struct vhost_virtqueue *vq,
> - void __user *log_base)
> +static bool vq_log_access_ok(struct vhost_virtqueue *vq,
> + void __user *log_base)
> {
> size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
>
> @@ -1242,14 +1242,14 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
>
> /* Can we start vq? */
> /* Caller should have vq mutex and device mutex */
> -int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> +bool vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> if (!vq_log_access_ok(vq, vq->log_base))
> - return 0;
> + return false;
>
> /* Access validation occurs at prefetch time with IOTLB */
> if (vq->iotlb)
> - return 1;
> + return true;
>
> return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> }
> --
> 2.14.3
^ permalink raw reply
* Re: [PATCH v3 1/2] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-11 13:23 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization,
Linus Torvalds
In-Reply-To: <20180411023541.15776-2-stefanha@redhat.com>
On Wed, Apr 11, 2018 at 10:35:40AM +0800, Stefan Hajnoczi wrote:
> Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> when IOTLB is enabled") introduced a regression. The logic was
> originally:
>
> if (vq->iotlb)
> return 1;
> return A && B;
>
> After the patch the short-circuit logic for A was inverted:
>
> if (A || vq->iotlb)
> return A;
> return B;
>
> This patch fixes the regression by rewriting the checks in the obvious
> way, no longer returning A when vq->iotlb is non-NULL (which is hard to
> understand).
>
> Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
> Cc: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> drivers/vhost/vhost.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index bec722e41f58..fc805b7fad9d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
> /* Caller should have vq mutex and device mutex */
> int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> - int ret = vq_log_access_ok(vq, vq->log_base);
> + if (!vq_log_access_ok(vq, vq->log_base))
> + return 0;
>
> - if (ret || vq->iotlb)
> - return ret;
> + /* Access validation occurs at prefetch time with IOTLB */
> + if (vq->iotlb)
> + return 1;
>
> return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> }
> --
> 2.14.3
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-11 8:03 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <5d56e572-24f1-745d-49ae-c2dada5db03c@intel.com>
Wed, Apr 11, 2018 at 08:24:43AM CEST, sridhar.samudrala@intel.com wrote:
>On 4/10/2018 11:03 PM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> > > > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> > > > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > > > > > > [...]
>> > > > > > > > >
>> > > > > > > > > > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > > > > > > > > > + struct net_device *child_netdev)
>> > > > > > > > > > > > +{
>> > > > > > > > > > > > + struct virtnet_bypass_info *vbi;
>> > > > > > > > > > > > + bool backup;
>> > > > > > > > > > > > +
>> > > > > > > > > > > > + vbi = netdev_priv(bypass_netdev);
>> > > > > > > > > > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > > > > > > > > > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > > > > > > > + rtnl_dereference(vbi->active_netdev)) {
>> > > > > > > > > > > > + netdev_info(bypass_netdev,
>> > > > > > > > > > > > + "%s attempting to join bypass dev when %s already present\n",
>> > > > > > > > > > > > + child_netdev->name, backup ? "backup" : "active");
>> > > > > > > > > > > Bypass module should check if there is already some other netdev
>> > > > > > > > > > > enslaved and refuse right there.
>> > > > > > > > > > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> > > > > > > > > > as its bypass_netdev is same as the backup_netdev.
>> > > > > > > > > > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> > > > > > > > > > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> > > > > > > > > > for 3 netdev scenario.
>> > > > > > > > > Just let me undestand it clearly. What I expect the difference would be
>> > > > > > > > > between 2netdev and3 netdev model is this:
>> > > > > > > > > 2netdev:
>> > > > > > > > > bypass_master
>> > > > > > > > > /
>> > > > > > > > > /
>> > > > > > > > > VF_slave
>> > > > > > > > >
>> > > > > > > > > 3netdev:
>> > > > > > > > > bypass_master
>> > > > > > > > > / \
>> > > > > > > > > / \
>> > > > > > > > > VF_slave backup_slave
>> > > > > > > > >
>> > > > > > > > > Is that correct? If not, how does it look like?
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > Looks correct.
>> > > > > > > > VF_slave and backup_slave are the original netdevs and are present in both the models.
>> > > > > > > > In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>> > > > > > > > marked as the 2 slaves of this new netdev.
>> > > > > > > You say it looks correct and in another sentence you provide completely
>> > > > > > > different description. Could you please look again?
>> > > > > > >
>> > > > > > To be exact, 2 netdev model with netvsc looks like this.
>> > > > > >
>> > > > > > netvsc_netdev
>> > > > > > /
>> > > > > > /
>> > > > > > VF_slave
>> > > > > >
>> > > > > > With virtio_net, 3 netdev model
>> > > > > >
>> > > > > > bypass_netdev
>> > > > > > / \
>> > > > > > / \
>> > > > > > VF_slave virtio_net netdev
>> > > > > Could you also mark the original netdev which is there now? is it
>> > > > > bypass_netdev or virtio_net_netdev ?
>> > > > bypass_netdev
>> > > > / \
>> > > > / \
>> > > > VF_slave virtio_net netdev (original)
>> > > That does not make sense.
>> > > 1) You diverge from the behaviour of the netvsc, where the original
>> > > netdev is a master of the VF
>> > > 2) If the original netdev is a slave, you cannot have any IP address
>> > > configured on it (well you could, but the rx_handler would eat every
>> > > incoming packet). So you will break the user bacause he would have to
>> > > move the configuration to the new master device.
>> > > This only makes sense that the original netdev becomes the master for both
>> > > netvsc and virtio_net.
>> > Forgot to mention that bypass_netdev takes over the name of the original
>> > netdev and
>> > virtio_net netdev will get the backup name.
>> What do you mean by "name"?
>
>bypass_netdev also is associated with the same pci device as the original virtio_net
>netdev via SET_NETDEV_DEV(). Also, we added ndo_get_phys_port_name() to virtio_net
>that will return _bkup when BACKUP feature is enabled.
Okay.
>
>So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by udev
>without BACKUP feature, when BACKUP feature is enabled, the bypass_netdev will be
>named 'ens12' and the original virtio_net will get named as ens12n_bkup.
Got it.
I don't like the bypass_master to look differently in netvsc and
virtio_net :/ The best would be to convert netvsc to 3 netdev model and
treat them the same. The more I think about it, the more the 2 netdev
model feels wrong.
>
>
>>
>> > So the userspace network configuration doesn't need to change.
>> >
>> >
>
^ permalink raw reply
* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Jiri Pirko @ 2018-04-11 7:53 UTC (permalink / raw)
To: Stephen Hemminger
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, Sridhar Samudrala,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180410142608.50f15b45@xeon-e3>
Tue, Apr 10, 2018 at 11:26:08PM CEST, stephen@networkplumber.org wrote:
>On Tue, 10 Apr 2018 11:59:50 -0700
>Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>
>Thanks for doing this. Your current version has couple show stopper
>issues.
>
>First, the slave device is instantly taking over the slave.
>This doesn't allow udev/systemd to do its device rename of the slave
>device. Netvsc uses a delayed work to workaround this.
Wait. Why the fact a device is enslaved has to affect the udev in any
way? If it does, smells like a bug in udev.
>
>Secondly, the select queue needs to call queue selection in VF.
>The bonding/teaming logic doesn't work well for UDP flows.
>Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>fixed this performance problem.
>
>Lastly, more indirection is bad in current climate.
>
>I am not completely adverse to this but it needs to be fast, simple
>and completely transparent.
^ permalink raw reply
* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
From: Jiri Pirko @ 2018-04-11 7:50 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, kubakici, Sridhar Samudrala,
virtualization, loseweigh, netdev, davem
In-Reply-To: <20180411022807-mutt-send-email-mst@kernel.org>
Wed, Apr 11, 2018 at 01:28:51AM CEST, mst@redhat.com wrote:
>On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>>
>> Thanks for doing this. Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
>Interesting. Does this mean udev must act within a specific time window
>then?
Yeah. That is scarry. Also, wrong.
>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-11 6:24 UTC (permalink / raw)
To: Jiri Pirko
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <20180411060327.GH2028@nanopsycho>
On 4/10/2018 11:03 PM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudrala@intel.com wrote:
>>>> On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>>>>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>> On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>>>>>> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>>> On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>>>>>>>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>>>>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>>>>>>>>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>>>>>>>>>> + struct net_device *child_netdev)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + struct virtnet_bypass_info *vbi;
>>>>>>>>>>>> + bool backup;
>>>>>>>>>>>> +
>>>>>>>>>>>> + vbi = netdev_priv(bypass_netdev);
>>>>>>>>>>>> + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>>>>>>>>>> + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>>>>>>>>>> + rtnl_dereference(vbi->active_netdev)) {
>>>>>>>>>>>> + netdev_info(bypass_netdev,
>>>>>>>>>>>> + "%s attempting to join bypass dev when %s already present\n",
>>>>>>>>>>>> + child_netdev->name, backup ? "backup" : "active");
>>>>>>>>>>> Bypass module should check if there is already some other netdev
>>>>>>>>>>> enslaved and refuse right there.
>>>>>>>>>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>>>>>>>>>> as its bypass_netdev is same as the backup_netdev.
>>>>>>>>>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>>>>>>>>>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>>>>>>>>>> for 3 netdev scenario.
>>>>>>>>> Just let me undestand it clearly. What I expect the difference would be
>>>>>>>>> between 2netdev and3 netdev model is this:
>>>>>>>>> 2netdev:
>>>>>>>>> bypass_master
>>>>>>>>> /
>>>>>>>>> /
>>>>>>>>> VF_slave
>>>>>>>>>
>>>>>>>>> 3netdev:
>>>>>>>>> bypass_master
>>>>>>>>> / \
>>>>>>>>> / \
>>>>>>>>> VF_slave backup_slave
>>>>>>>>>
>>>>>>>>> Is that correct? If not, how does it look like?
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Looks correct.
>>>>>>>> VF_slave and backup_slave are the original netdevs and are present in both the models.
>>>>>>>> In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>>>>>>>> marked as the 2 slaves of this new netdev.
>>>>>>> You say it looks correct and in another sentence you provide completely
>>>>>>> different description. Could you please look again?
>>>>>>>
>>>>>> To be exact, 2 netdev model with netvsc looks like this.
>>>>>>
>>>>>> netvsc_netdev
>>>>>> /
>>>>>> /
>>>>>> VF_slave
>>>>>>
>>>>>> With virtio_net, 3 netdev model
>>>>>>
>>>>>> bypass_netdev
>>>>>> / \
>>>>>> / \
>>>>>> VF_slave virtio_net netdev
>>>>> Could you also mark the original netdev which is there now? is it
>>>>> bypass_netdev or virtio_net_netdev ?
>>>> bypass_netdev
>>>> / \
>>>> / \
>>>> VF_slave virtio_net netdev (original)
>>> That does not make sense.
>>> 1) You diverge from the behaviour of the netvsc, where the original
>>> netdev is a master of the VF
>>> 2) If the original netdev is a slave, you cannot have any IP address
>>> configured on it (well you could, but the rx_handler would eat every
>>> incoming packet). So you will break the user bacause he would have to
>>> move the configuration to the new master device.
>>> This only makes sense that the original netdev becomes the master for both
>>> netvsc and virtio_net.
>> Forgot to mention that bypass_netdev takes over the name of the original
>> netdev and
>> virtio_net netdev will get the backup name.
> What do you mean by "name"?
bypass_netdev also is associated with the same pci device as the original virtio_net
netdev via SET_NETDEV_DEV(). Also, we added ndo_get_phys_port_name() to virtio_net
that will return _bkup when BACKUP feature is enabled.
So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by udev
without BACKUP feature, when BACKUP feature is enabled, the bypass_netdev will be
named 'ens12' and the original virtio_net will get named as ens12n_bkup.
>
>> So the userspace network configuration doesn't need to change.
>>
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [virtio-dev] Re: [PATCH v2] virtio_balloon: export hugetlb page allocation counts
From: Jason Wang @ 2018-04-11 3:12 UTC (permalink / raw)
To: Jonathan Helman, Michael S. Tsirkin
Cc: Mihai Carabas, virtio-dev, linux-kernel, virtualization
In-Reply-To: <b1414be4-84bc-5d02-afff-9e42c1fd7210@oracle.com>
On 2018年04月10日 05:11, Jonathan Helman wrote:
>
>
> On 03/22/2018 07:38 PM, Jason Wang wrote:
>>
>>
>> On 2018年03月22日 11:10, Michael S. Tsirkin wrote:
>>> On Thu, Mar 22, 2018 at 09:52:18AM +0800, Jason Wang wrote:
>>>> On 2018年03月20日 12:26, Jonathan Helman wrote:
>>>>>> On Mar 19, 2018, at 7:31 PM, Jason Wang<jasowang@redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2018年03月20日 06:14, Jonathan Helman wrote:
>>>>>>> Export the number of successful and failed hugetlb page
>>>>>>> allocations via the virtio balloon driver. These 2 counts
>>>>>>> come directly from the vm_events HTLB_BUDDY_PGALLOC and
>>>>>>> HTLB_BUDDY_PGALLOC_FAIL.
>>>>>>>
>>>>>>> Signed-off-by: Jonathan Helman<jonathan.helman@oracle.com>
>>>>>> Reviewed-by: Jason Wang<jasowang@redhat.com>
>>>>> Thanks.
>>>>>
>>>>>>> ---
>>>>>>> drivers/virtio/virtio_balloon.c | 6 ++++++
>>>>>>> include/uapi/linux/virtio_balloon.h | 4 +++-
>>>>>>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/virtio/virtio_balloon.c
>>>>>>> b/drivers/virtio/virtio_balloon.c
>>>>>>> index dfe5684..6b237e3 100644
>>>>>>> --- a/drivers/virtio/virtio_balloon.c
>>>>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>>>>> @@ -272,6 +272,12 @@ static unsigned int
>>>>>>> update_balloon_stats(struct virtio_balloon *vb)
>>>>>>> pages_to_bytes(events[PSWPOUT]));
>>>>>>> update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT,
>>>>>>> events[PGMAJFAULT]);
>>>>>>> update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT,
>>>>>>> events[PGFAULT]);
>>>>>>> +#ifdef CONFIG_HUGETLB_PAGE
>>>>>>> + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
>>>>>>> + events[HTLB_BUDDY_PGALLOC]);
>>>>>>> + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
>>>>>>> + events[HTLB_BUDDY_PGALLOC_FAIL]);
>>>>>>> +#endif
>>>>>>> #endif
>>>>>>> update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
>>>>>>> pages_to_bytes(i.freeram));
>>>>>>> diff --git a/include/uapi/linux/virtio_balloon.h
>>>>>>> b/include/uapi/linux/virtio_balloon.h
>>>>>>> index 4e8b830..40297a3 100644
>>>>>>> --- a/include/uapi/linux/virtio_balloon.h
>>>>>>> +++ b/include/uapi/linux/virtio_balloon.h
>>>>>>> @@ -53,7 +53,9 @@ struct virtio_balloon_config {
>>>>>>> #define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of
>>>>>>> memory */
>>>>>>> #define VIRTIO_BALLOON_S_AVAIL 6 /* Available memory as
>>>>>>> in /proc */
>>>>>>> #define VIRTIO_BALLOON_S_CACHES 7 /* Disk caches */
>>>>>>> -#define VIRTIO_BALLOON_S_NR 8
>>>>>>> +#define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page
>>>>>>> allocations */
>>>>>>> +#define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page
>>>>>>> allocation failures */
>>>>>>> +#define VIRTIO_BALLOON_S_NR 10
>>>>>>> /*
>>>>>>> * Memory statistics structure.
>>>>>> Not for this patch, but it looks to me that exporting such nr
>>>>>> through uapi is fragile.
>>>>> Sorry, can you explain what you mean here?
>>>>>
>>>>> Jon
>>>> Spec said "Within an output buffer submitted to the statsq, the
>>>> device MUST
>>>> ignore entries with tag values that it does not recognize". So
>>>> exporting
>>>> VIRTIO_BALLOON_S_NR seems useless and device implementation can not
>>>> depend
>>>> on such number in uapi.
>>>>
>>>> Thanks
>>> Suggestions? I don't like to break build for people ...
>>>
>>
>> Didn't have a good idea. But maybe we should keep VIRTIO_BALLOON_S_NR
>> unchanged, and add a comment here.
>>
>> Thanks
>
> I think Jason's comment is for a future patch. Didn't see this patch
> get applied, so wondering if it could be.
>
> Thanks,
> Jon
Hi Jon:
Have you tested new driver with old qemu?
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH v3 2/2] vhost: return bool from *_access_ok() functions
From: Stefan Hajnoczi @ 2018-04-11 2:35 UTC (permalink / raw)
To: virtualization
Cc: kvm, mst, netdev, syzkaller-bugs, linux-kernel, Stefan Hajnoczi,
Linus Torvalds
In-Reply-To: <20180411023541.15776-1-stefanha@redhat.com>
Currently vhost *_access_ok() functions return int. This is error-prone
because there are two popular conventions:
1. 0 means failure, 1 means success
2. -errno means failure, 0 means success
Although vhost mostly uses #1, it does not do so consistently.
umem_access_ok() uses #2.
This patch changes the return type from int to bool so that false means
failure and true means success. This eliminates a potential source of
errors.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
drivers/vhost/vhost.h | 4 ++--
drivers/vhost/vhost.c | 66 +++++++++++++++++++++++++--------------------------
2 files changed, 35 insertions(+), 35 deletions(-)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d8ee85ae8fdc..6c844b90a168 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
void vhost_dev_stop(struct vhost_dev *);
long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
-int vhost_vq_access_ok(struct vhost_virtqueue *vq);
-int vhost_log_access_ok(struct vhost_dev *);
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
+bool vhost_log_access_ok(struct vhost_dev *);
int vhost_get_vq_desc(struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index fc805b7fad9d..0fcb51a9940c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
-static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
+static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
{
u64 a = addr / VHOST_PAGE_SIZE / 8;
/* Make sure 64 bit math will not overflow. */
if (a > ULONG_MAX - (unsigned long)log_base ||
a + (unsigned long)log_base > ULONG_MAX)
- return 0;
+ return false;
return access_ok(VERIFY_WRITE, log_base + a,
(sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
@@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
}
/* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
- int log_all)
+static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
+ int log_all)
{
struct vhost_umem_node *node;
if (!umem)
- return 0;
+ return false;
list_for_each_entry(node, &umem->umem_list, link) {
unsigned long a = node->userspace_addr;
if (vhost_overflow(node->userspace_addr, node->size))
- return 0;
+ return false;
if (!access_ok(VERIFY_WRITE, (void __user *)a,
node->size))
- return 0;
+ return false;
else if (log_all && !log_access_ok(log_base,
node->start,
node->size))
- return 0;
+ return false;
}
- return 1;
+ return true;
}
static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
@@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
/* Can we switch to this memory table? */
/* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
- int log_all)
+static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
+ int log_all)
{
int i;
for (i = 0; i < d->nvqs; ++i) {
- int ok;
+ bool ok;
bool log;
mutex_lock(&d->vqs[i]->mutex);
@@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
ok = vq_memory_access_ok(d->vqs[i]->log_base,
umem, log);
else
- ok = 1;
+ ok = true;
mutex_unlock(&d->vqs[i]->mutex);
if (!ok)
- return 0;
+ return false;
}
- return 1;
+ return true;
}
static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
@@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
spin_unlock(&d->iotlb_lock);
}
-static int umem_access_ok(u64 uaddr, u64 size, int access)
+static bool umem_access_ok(u64 uaddr, u64 size, int access)
{
unsigned long a = uaddr;
/* Make sure 64 bit math will not overflow. */
if (vhost_overflow(uaddr, size))
- return -EFAULT;
+ return false;
if ((access & VHOST_ACCESS_RO) &&
!access_ok(VERIFY_READ, (void __user *)a, size))
- return -EFAULT;
+ return false;
if ((access & VHOST_ACCESS_WO) &&
!access_ok(VERIFY_WRITE, (void __user *)a, size))
- return -EFAULT;
- return 0;
+ return false;
+ return true;
}
static int vhost_process_iotlb_msg(struct vhost_dev *dev,
@@ -988,7 +988,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
ret = -EFAULT;
break;
}
- if (umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
+ if (!umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
ret = -EFAULT;
break;
}
@@ -1135,10 +1135,10 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
return 0;
}
-static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
- struct vring_desc __user *desc,
- struct vring_avail __user *avail,
- struct vring_used __user *used)
+static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
@@ -1161,8 +1161,8 @@ static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
vq->meta_iotlb[type] = node;
}
-static int iotlb_access_ok(struct vhost_virtqueue *vq,
- int access, u64 addr, u64 len, int type)
+static bool iotlb_access_ok(struct vhost_virtqueue *vq,
+ int access, u64 addr, u64 len, int type)
{
const struct vhost_umem_node *node;
struct vhost_umem *umem = vq->iotlb;
@@ -1220,7 +1220,7 @@ EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
/* Can we log writes? */
/* Caller should have device mutex but not vq mutex */
-int vhost_log_access_ok(struct vhost_dev *dev)
+bool vhost_log_access_ok(struct vhost_dev *dev)
{
return memory_access_ok(dev, dev->umem, 1);
}
@@ -1228,8 +1228,8 @@ EXPORT_SYMBOL_GPL(vhost_log_access_ok);
/* Verify access for write logging. */
/* Caller should have vq mutex and device mutex */
-static int vq_log_access_ok(struct vhost_virtqueue *vq,
- void __user *log_base)
+static bool vq_log_access_ok(struct vhost_virtqueue *vq,
+ void __user *log_base)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
@@ -1242,14 +1242,14 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
/* Can we start vq? */
/* Caller should have vq mutex and device mutex */
-int vhost_vq_access_ok(struct vhost_virtqueue *vq)
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq)
{
if (!vq_log_access_ok(vq, vq->log_base))
- return 0;
+ return false;
/* Access validation occurs at prefetch time with IOTLB */
if (vq->iotlb)
- return 1;
+ return true;
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
}
--
2.14.3
^ permalink raw reply related
* [PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check
From: Stefan Hajnoczi @ 2018-04-11 2:35 UTC (permalink / raw)
To: virtualization
Cc: kvm, mst, netdev, syzkaller-bugs, linux-kernel, Stefan Hajnoczi,
Linus Torvalds
v3:
* Rebased onto net/master and resolved conflict [DaveM]
v2:
* Rewrote the conditional to make the vq access check clearer [Linus]
* Added Patch 2 to make the return type consistent and harder to misuse [Linus]
The first patch fixes the vhost virtqueue access check which was recently
broken. The second patch replaces the int return type with bool to prevent
future bugs.
Stefan Hajnoczi (2):
vhost: fix vhost_vq_access_ok() log check
vhost: return bool from *_access_ok() functions
drivers/vhost/vhost.h | 4 +--
drivers/vhost/vhost.c | 70 ++++++++++++++++++++++++++-------------------------
2 files changed, 38 insertions(+), 36 deletions(-)
--
2.14.3
^ permalink raw reply
* Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2018-04-11 2:18 UTC (permalink / raw)
To: Liang, Cunming, Michael S. Tsirkin
Cc: Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
kvm@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, Wang, Xiao W,
ddutile@redhat.com, Tan, Jianfeng, Wang, Zhihong, Paolo Bonzini
In-Reply-To: <D0158A423229094DA7ABF71CF2FA0DA34E9226AF@SHSMSX104.ccr.corp.intel.com>
On 2018年04月10日 22:23, Liang, Cunming wrote:
>> -----Original Message-----
>> From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> Sent: Tuesday, April 10, 2018 9:36 PM
>> To: Liang, Cunming<cunming.liang@intel.com>
>> Cc: Paolo Bonzini<pbonzini@redhat.com>; Bie, Tiwei<tiwei.bie@intel.com>;
>> Jason Wang<jasowang@redhat.com>;alex.williamson@redhat.com;
>> ddutile@redhat.com; Duyck, Alexander H<alexander.h.duyck@intel.com>;
>> virtio-dev@lists.oasis-open.org;linux-kernel@vger.kernel.org;
>> kvm@vger.kernel.org;virtualization@lists.linux-foundation.org;
>> netdev@vger.kernel.org; Daly, Dan<dan.daly@intel.com>; Wang, Zhihong
>> <zhihong.wang@intel.com>; Tan, Jianfeng<jianfeng.tan@intel.com>; Wang, Xiao
>> W<xiao.w.wang@intel.com>
>> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost
>> backend
>>
>> On Tue, Apr 10, 2018 at 09:23:53AM +0000, Liang, Cunming wrote:
>>>> -----Original Message-----
>>>> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
>>>> Sent: Tuesday, April 10, 2018 3:52 PM
>>>> To: Bie, Tiwei<tiwei.bie@intel.com>; Jason Wang
>>>> <jasowang@redhat.com>
>>>> Cc:mst@redhat.com;alex.williamson@redhat.com;ddutile@redhat.com;
>>>> Duyck, Alexander H<alexander.h.duyck@intel.com>;
>>>> virtio-dev@lists.oasis- open.org;linux-kernel@vger.kernel.org;
>>>> kvm@vger.kernel.org;virtualization@lists.linux-foundation.org;
>>>> netdev@vger.kernel.org; Daly, Dan<dan.daly@intel.com>; Liang,
>>>> Cunming<cunming.liang@intel.com>; Wang, Zhihong
>>>> <zhihong.wang@intel.com>; Tan, Jianfeng<jianfeng.tan@intel.com>;
>>>> Wang, Xiao W<xiao.w.wang@intel.com>
>>>> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based
>>>> hardware vhost backend
>>>>
>>>> On 10/04/2018 06:57, Tiwei Bie wrote:
>>>>>> So you just move the abstraction layer from qemu to kernel, and
>>>>>> you still need different drivers in kernel for different device
>>>>>> interfaces of accelerators. This looks even more complex than
>>>>>> leaving it in qemu. As you said, another idea is to implement
>>>>>> userspace vhost backend for accelerators which seems easier and
>>>>>> could co-work with other parts of qemu without inventing new type of
>> messages.
>>>>> I'm not quite sure. Do you think it's acceptable to add various
>>>>> vendor specific hardware drivers in QEMU?
>>>> I think so. We have vendor-specific quirks, and at some point there
>>>> was an idea of using quirks to implement (vendor-specific) live
>>>> migration support for assigned devices.
>>> Vendor-specific quirks of accessing VGA is a small portion. Other major portions
>> are still handled by guest driver.
>>> While in this case, when saying various vendor specific drivers in QEMU, it says
>> QEMU takes over and provides the entire user space device drivers. Some parts
>> are even not relevant to vhost, they're basic device function enabling. Moreover,
>> it could be different kinds of devices(network/block/...) under vhost. No matter
>> # of vendors or # of types, total LOC is not small.
>>> The idea is to avoid introducing these extra complexity out of QEMU, keeping
>> vhost adapter simple. As vhost protocol is de factor standard, it leverages kernel
>> device driver to provide the diversity. Changing once in QEMU, then it supports
>> multi-vendor devices whose drivers naturally providing kernel driver there.
>>> If QEMU is going to build a user space driver framework there, we're open mind
>> on that, even leveraging DPDK as the underlay library. Looking forward to more
>> others' comments from community.
>>> Steve
>> Dependency on a kernel driver is fine IMHO. It's the dependency on a DPDK
>> backend that makes people unhappy, since the functionality in question is setup-
>> time only.
> Agreed, we don't see dependency on kernel driver is a problem.
At engineering level, kernel driver is harder than userspace driver.
>
> mdev based vhost backend (this patch set) is independent with vhost-user extension patch set. In fact, there're a few vhost-user providers, DPDK librte_vhost is one of them. FD.IO/VPP and snabbswitch have their own vhost-user providers. So I can't agree on vhost-user extension patch depends on DPDK backend. But anyway, that's the topic of another mail thread.
>
Well we can treat mdev as another kind of transport of vhost-user. And
technically we can even implement a relay mdev than forward vhost-user
messages to dpdk.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox