Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH net-next v8 2/4] net: Introduce generic failover module
From: Dan Carpenter @ 2018-04-28  8:23 UTC (permalink / raw)
  To: kbuild
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, kbuild-all, sridhar.samudrala,
	aaron.f.brown, davem
In-Reply-To: <1524700768-38627-3-git-send-email-sridhar.samudrala@intel.com>

Hi Sridhar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
url:    https://github.com/0day-ci/linux/commits/Sridhar-Samudrala/Enable-virtio_net-to-act-as-a-standby-for-a-passthru-device/20180427-183842

smatch warnings:
net/core/net_failover.c:229 net_failover_change_mtu() error: we previously assumed 'primary_dev' could be null (see line 219)
net/core/net_failover.c:279 net_failover_vlan_rx_add_vid() error: we previously assumed 'primary_dev' could be null (see line 269)

# https://github.com/0day-ci/linux/commit/5a5f2e3efcb699867db79543dfebe764927b9c93
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 5a5f2e3efcb699867db79543dfebe764927b9c93
vim +/primary_dev +229 net/core/net_failover.c

5a5f2e3e Sridhar Samudrala 2018-04-25  211  
5a5f2e3e Sridhar Samudrala 2018-04-25  212  static int net_failover_change_mtu(struct net_device *dev, int new_mtu)
5a5f2e3e Sridhar Samudrala 2018-04-25  213  {
5a5f2e3e Sridhar Samudrala 2018-04-25  214  	struct net_failover_info *nfo_info = netdev_priv(dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  215  	struct net_device *primary_dev, *standby_dev;
5a5f2e3e Sridhar Samudrala 2018-04-25  216  	int ret = 0;
5a5f2e3e Sridhar Samudrala 2018-04-25  217  
5a5f2e3e Sridhar Samudrala 2018-04-25  218  	primary_dev = rcu_dereference(nfo_info->primary_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25 @219  	if (primary_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  220  		ret = dev_set_mtu(primary_dev, new_mtu);
5a5f2e3e Sridhar Samudrala 2018-04-25  221  		if (ret)
5a5f2e3e Sridhar Samudrala 2018-04-25  222  			return ret;
5a5f2e3e Sridhar Samudrala 2018-04-25  223  	}
5a5f2e3e Sridhar Samudrala 2018-04-25  224  
5a5f2e3e Sridhar Samudrala 2018-04-25  225  	standby_dev = rcu_dereference(nfo_info->standby_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  226  	if (standby_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  227  		ret = dev_set_mtu(standby_dev, new_mtu);
5a5f2e3e Sridhar Samudrala 2018-04-25  228  		if (ret) {
5a5f2e3e Sridhar Samudrala 2018-04-25 @229  			dev_set_mtu(primary_dev, dev->mtu);
5a5f2e3e Sridhar Samudrala 2018-04-25  230  			return ret;
5a5f2e3e Sridhar Samudrala 2018-04-25  231  		}
5a5f2e3e Sridhar Samudrala 2018-04-25  232  	}
5a5f2e3e Sridhar Samudrala 2018-04-25  233  
5a5f2e3e Sridhar Samudrala 2018-04-25  234  	dev->mtu = new_mtu;
5a5f2e3e Sridhar Samudrala 2018-04-25  235  
5a5f2e3e Sridhar Samudrala 2018-04-25  236  	return 0;
5a5f2e3e Sridhar Samudrala 2018-04-25  237  }
5a5f2e3e Sridhar Samudrala 2018-04-25  238  
5a5f2e3e Sridhar Samudrala 2018-04-25  239  static void net_failover_set_rx_mode(struct net_device *dev)
5a5f2e3e Sridhar Samudrala 2018-04-25  240  {
5a5f2e3e Sridhar Samudrala 2018-04-25  241  	struct net_failover_info *nfo_info = netdev_priv(dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  242  	struct net_device *slave_dev;
5a5f2e3e Sridhar Samudrala 2018-04-25  243  
5a5f2e3e Sridhar Samudrala 2018-04-25  244  	rcu_read_lock();
5a5f2e3e Sridhar Samudrala 2018-04-25  245  
5a5f2e3e Sridhar Samudrala 2018-04-25  246  	slave_dev = rcu_dereference(nfo_info->primary_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  247  	if (slave_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  248  		dev_uc_sync_multiple(slave_dev, dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  249  		dev_mc_sync_multiple(slave_dev, dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  250  	}
5a5f2e3e Sridhar Samudrala 2018-04-25  251  
5a5f2e3e Sridhar Samudrala 2018-04-25  252  	slave_dev = rcu_dereference(nfo_info->standby_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  253  	if (slave_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  254  		dev_uc_sync_multiple(slave_dev, dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  255  		dev_mc_sync_multiple(slave_dev, dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  256  	}
5a5f2e3e Sridhar Samudrala 2018-04-25  257  
5a5f2e3e Sridhar Samudrala 2018-04-25  258  	rcu_read_unlock();
5a5f2e3e Sridhar Samudrala 2018-04-25  259  }
5a5f2e3e Sridhar Samudrala 2018-04-25  260  
5a5f2e3e Sridhar Samudrala 2018-04-25  261  static int net_failover_vlan_rx_add_vid(struct net_device *dev, __be16 proto,
5a5f2e3e Sridhar Samudrala 2018-04-25  262  					u16 vid)
5a5f2e3e Sridhar Samudrala 2018-04-25  263  {
5a5f2e3e Sridhar Samudrala 2018-04-25  264  	struct net_failover_info *nfo_info = netdev_priv(dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  265  	struct net_device *primary_dev, *standby_dev;
5a5f2e3e Sridhar Samudrala 2018-04-25  266  	int ret = 0;
5a5f2e3e Sridhar Samudrala 2018-04-25  267  
5a5f2e3e Sridhar Samudrala 2018-04-25  268  	primary_dev = rcu_dereference(nfo_info->primary_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25 @269  	if (primary_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  270  		ret = vlan_vid_add(primary_dev, proto, vid);
5a5f2e3e Sridhar Samudrala 2018-04-25  271  		if (ret)
5a5f2e3e Sridhar Samudrala 2018-04-25  272  			return ret;
5a5f2e3e Sridhar Samudrala 2018-04-25  273  	}
5a5f2e3e Sridhar Samudrala 2018-04-25  274  
5a5f2e3e Sridhar Samudrala 2018-04-25  275  	standby_dev = rcu_dereference(nfo_info->standby_dev);
5a5f2e3e Sridhar Samudrala 2018-04-25  276  	if (standby_dev) {
5a5f2e3e Sridhar Samudrala 2018-04-25  277  		ret = vlan_vid_add(standby_dev, proto, vid);
5a5f2e3e Sridhar Samudrala 2018-04-25  278  		if (ret)
5a5f2e3e Sridhar Samudrala 2018-04-25 @279  			vlan_vid_del(primary_dev, proto, vid);
5a5f2e3e Sridhar Samudrala 2018-04-25  280  	}
5a5f2e3e Sridhar Samudrala 2018-04-25  281  
5a5f2e3e Sridhar Samudrala 2018-04-25  282  	return ret;
5a5f2e3e Sridhar Samudrala 2018-04-25  283  }
5a5f2e3e Sridhar Samudrala 2018-04-25  284  

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: [PATCH net-next v9 2/4] net: Introduce generic failover module
From: Jiri Pirko @ 2018-04-28  8:15 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, aaron.f.brown, davem
In-Reply-To: <1524848820-42258-3-git-send-email-sridhar.samudrala@intel.com>

Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudrala@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. For paravirtual drivers like virtio_net that use 3 netdev model, the
>   the failover module provides interfaces to create/destroy additional
>   master netdev and all the slave events are managed internally.
>        net_failover_create()
>        net_failover_destroy()
>   A failover netdev is created that acts a master device and controls 2
>   slave devices. The original virtio_net netdev is registered as 'standby'
>   netdev and a passthru/vf device with the same MAC gets registered as
>   'primary' netdev. Both 'standby' and 'primary' netdevs are associated
>   with the same 'pci' device.  The user accesses the network interface via

'standby' and 'primary' netdevs are not associated with the same 'pci'
device.
"Primary" is the VF netdevice and "standby" is virtio_net. Each
associated with different pci device.


>   'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
>   default for transmits when it is available with link up and running.
>2. For existing netvsc driver that uses 2 netdev model, no master netdev
>   is created. The paravirtual driver registers each instance of netvsc
>   as a 'failover' netdev  along with a set of ops to manage the slave
>   events. There is no 'standby' netdev in this model. A passthru/vf device
>   with the same MAC gets registered as 'primary' netdev.
>        net_failover_register()
>        net_failover_unregister()

[...]

^ permalink raw reply

* Re: [PATCH net-next v9 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
From: Jiri Pirko @ 2018-04-28  7:50 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, aaron.f.brown, davem
In-Reply-To: <1524848820-42258-2-git-send-email-sridhar.samudrala@intel.com>

Fri, Apr 27, 2018 at 07:06:57PM CEST, sridhar.samudrala@intel.com wrote:
>This feature bit can be used by hypervisor to indicate virtio_net device to
>act as a standby for another device with the same MAC address.
>
>VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> drivers/net/virtio_net.c        | 2 +-
> include/uapi/linux/virtio_net.h | 3 +++
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>index 3b5991734118..51a085b1a242 100644
>--- a/drivers/net/virtio_net.c
>+++ b/drivers/net/virtio_net.c
>@@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
> 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
> 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>-	VIRTIO_NET_F_SPEED_DUPLEX
>+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY


This is not part of current qemu master (head 6f0c4706b35dead265509115ddbd2a8d1af516c1)
Were I can find the qemu code?

Also, I think it makes sense to push HW (qemu HW in this case) first
and only then the driver.



> 
> static unsigned int features[] = {
> 	VIRTNET_FEATURES,
>diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>index 5de6ed37695b..a3715a3224c1 100644
>--- a/include/uapi/linux/virtio_net.h
>+++ b/include/uapi/linux/virtio_net.h
>@@ -57,6 +57,9 @@
> 					 * Steering */
> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> 
>+#define VIRTIO_NET_F_STANDBY	  62	/* Act as standby for another device
>+					 * with the same MAC.
>+					 */
> #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
> 
> #ifndef VIRTIO_NET_NO_LEGACY
>-- 
>2.14.3
>

^ permalink raw reply

* Re: [RFC v3 0/5] virtio: support packed ring
From: Jason Wang @ 2018-04-28  2:45 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: Michael S. Tsirkin, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180427091234.n6vr3wcnbnqxd3so@debian>



On 2018年04月27日 17:12, Tiwei Bie wrote:
> On Fri, Apr 27, 2018 at 02:17:51PM +0800, Jason Wang wrote:
>> On 2018年04月27日 12:18, Michael S. Tsirkin wrote:
>>> On Fri, Apr 27, 2018 at 11:56:05AM +0800, Jason Wang wrote:
>>>> On 2018年04月25日 13:15, Tiwei Bie wrote:
>>>>> Hello everyone,
>>>>>
>>>>> This RFC implements packed ring support in virtio driver.
>>>>>
>>>>> Some simple functional tests have been done with Jason's
>>>>> packed ring implementation in vhost:
>>>>>
>>>>> https://lkml.org/lkml/2018/4/23/12
>>>>>
>>>>> Both of ping and netperf worked as expected (with EVENT_IDX
>>>>> disabled). But there are below known issues:
>>>>>
>>>>> 1. Reloading the guest driver will break the Tx/Rx;
>>>> Will have a look at this issue.
>>>>
>>>>> 2. Zeroing the flags when detaching a used desc will
>>>>>       break the guest -> host path.
>>>> I still think zeroing flags is unnecessary or even a bug. At host, I track
>>>> last observed avail wrap counter and detect avail like (what is suggested in
>>>> the example code in the spec):
>>>>
>>>> static bool desc_is_avail(struct vhost_virtqueue *vq, __virtio16 flags)
>>>> {
>>>>          bool avail = flags & cpu_to_vhost16(vq, DESC_AVAIL);
>>>>
>>>>          return avail == vq->avail_wrap_counter;
>>>> }
>>>>
>>>> So zeroing wrap can not work with this obviously.
>>>>
>>>> Thanks
>>> I agree. I think what one should do is flip the available bit.
>>>
>> But is this flipping a must?
>>
>> Thanks
> Yeah, that's my question too. It seems to be a requirement
> for driver that, the only change to the desc status that a
> driver can do during running is to mark the desc as avail,
> and any other changes to the desc status are not allowed.
> Similarly, the device can only mark the desc as used, and
> any other changes to the desc status are also not allowed.
> So the question is, are there such requirements?

Looks not, but I think we need clarify this in the spec.

Thanks

>
> Based on below contents in the spec:
>
> """
> Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different
> for an available descriptor and equal for a used descriptor.
>
> Note that this observation is mostly useful for sanity-checking
> as these are necessary but not sufficient conditions
> """
>
> It seems that, it's necessary for devices to check whether
> the AVAIL bit and USED bit are different.
>
> Best regards,
> Tiwei Bie

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
From: Jason Wang @ 2018-04-28  2:23 UTC (permalink / raw)
  To: Kevin Easton, Michael S. Tsirkin
  Cc: netdev, syzkaller-bugs, linux-kernel, kvm, virtualization
In-Reply-To: <20180428015106.GA27738@la.guarana.org>



On 2018年04月28日 09:51, Kevin Easton wrote:
> On Fri, Apr 27, 2018 at 09:07:56PM -0400, Kevin Easton wrote:
>> On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote:
>>> On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
>>>> The struct vhost_msg within struct vhost_msg_node is copied to userspace,
>>>> so it should be allocated with kzalloc() to ensure all structure padding
>>>> is zeroed.
>>>>
>>>> Signed-off-by: Kevin Easton <kevin@guarana.org>
>>>> Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com
>>> Does it help if a patch naming the padding is applied,
>>> and then we init just the relevant field?
>>> Just curious.
>> No, I don't believe that is sufficient to fix the problem.
> Scratch that, somehow I missed the "..and then we init just the
> relevant field" part, sorry.
>
> There's still the padding after the vhost_iotlb_msg to consider.  It's
> named in the union but I don't think that's guaranteed to be initialised
> when the iotlb member of the union is used to initialise things.
>
>> I didn't name the padding in my original patch because I wasn't sure
>> if the padding actually exists on 32 bit architectures?
> This might still be a conce

Yes.

print &((struct vhost_msg *)0)->iotlb
$3 = (struct vhost_iotlb_msg *) 0x4


>
> At the end of the day, zeroing 96 bytes (the full size of vhost_msg_node)
> is pretty quick.
>
>      - Kevin

Right, and even if it may be used heavily in the data-path, zeroing is 
not the main delay in that path.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Siwei Liu @ 2018-04-28  0:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Netdev, David Miller
In-Reply-To: <20180427012530-mutt-send-email-mst@kernel.org>

On Thu, Apr 26, 2018 at 4:42 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Apr 26, 2018 at 03:14:46PM -0700, Siwei Liu wrote:
>> On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
>> >> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> >> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >> >> >>
>> >> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> >> >> >> > > > >
>> >> >> >> >> > > > >I will NAK patches to change to common code for netvsc especially the
>> >> >> >> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
>> >> >> >> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> >> >> >> > > > >
>> >> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> >> >> > > >
>> >> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> >> >> >> >> > >
>> >> >> >> >> > > The network device model is a userspace API, and DPDK is a userspace application.
>> >> >> >> >> >
>> >> >> >> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> >> >> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >> >> >
>> >> >> >> >> > > You can't go breaking userspace even if you don't like the application.
>> >> >> >> >> >
>> >> >> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> >> >> >> >> > API at all.
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
>> >> >> >> >> to look for Linux netvsc device and the paired VF device and setup the
>> >> >> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
>> >> >> >> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> >> >> >> >> VF device.
>> >> >> >> >>
>> >> >> >> >> So it depends on existing 2 device model. You can't go to a 3 device model
>> >> >> >> >> or start hiding devices from userspace.
>> >> >> >> >
>> >> >> >> > Okay so how does the existing patch break that? IIUC does not go to
>> >> >> >> > a 3 device model since netvsc calls failover_register directly.
>> >> >> >> >
>> >> >> >> >> Also, I am working on associating netvsc and VF device based on serial number
>> >> >> >> >> rather than MAC address. The serial number is how Windows works now, and it makes
>> >> >> >> >> sense for Linux and Windows to use the same mechanism if possible.
>> >> >> >> >
>> >> >> >> > Maybe we should support same for virtio ...
>> >> >> >> > Which serial do you mean? From vpd?
>> >> >> >> >
>> >> >> >> > I guess you will want to keep supporting MAC for old hypervisors?
>> >> >> >> >
>> >> >> >> > It all seems like a reasonable thing to support in the generic core.
>> >> >> >>
>> >> >> >> That's the reason why I chose explicit identifier rather than rely on
>> >> >> >> MAC address to bind/pair a device. MAC address can change. Even if it
>> >> >> >> can't, malicious guest user can fake MAC address to skip binding.
>> >> >> >>
>> >> >> >> -Siwei
>> >> >> >
>> >> >> > Address should be sampled at device creation to prevent this
>> >> >> > kind of hack. Not that it buys the malicious user much:
>> >> >> > if you can poke at MAC addresses you probably already can
>> >> >> > break networking.
>> >> >>
>> >> >> I don't understand why poking at MAC address may potentially break
>> >> >> networking.
>> >> >
>> >> > Set a MAC address to match another device on the same LAN,
>> >> > packets will stop reaching that MAC.
>> >>
>> >> What I meant was guest users may create a virtual link, say veth that
>> >> has exactly the same MAC address as that for the VF, which can easily
>> >> get around of the binding procedure.
>> >
>> > This patchset limits binding to PCI devices so it won't be affected
>> > by any hacks around virtual devices.
>>
>> Wait, I vaguely recall you seemed to like to generalize this feature
>> to non-PCI device.
>
> It's purely a layering thing.  It is cleaner not to have PCI specific
> data in the device-specific transport-independent section of the virtio
> spec.
>
OK. So looks like you think it's okay to include PCI specific concept
but not the data? Like a feature indicating the virtio device is
behind a (external) PCI bridge, and perhaps also includes the data
present in the PCI bridge/function's capability?

Sorry for asking tough questions. I still need to understand and
digest the boundary of this layering thing.

>
>> But now you're saying it should stick to PCI. It's
>> not that I'm reluctant with sticking to PCI. The fact is that I don't
>> think we can go with implementation until the semantics of the
>> so-called _F_STANDBY feature can be clearly defined into the spec.
>> Previously the boundary of using MAC address as the identifier for
>> bonding was quite confusing to me. And now PCI adds to the matrix.
>
> PCI is simply one way to exclude software NICs. It's not the most
> elegant one, but it will cover many setups.  We can add more types, but
> we do want to exclude software devices since these have
> not been supplied by the hypervisor.

I'm afraid it's a loose end. The real thing is there's no way to
indicate VF or passthrough device on Linux, even true on some other
OS. There's no such flag exists yet. Even the emulated e1000 and
rltk8139 device looks the same as PCI device. And as part of the
requirements of being a spec, the behaviour and expectation need to be
precisely described for implementations to follow. There's no point to
assume just one OS will implement this feature so it needs to depend
on specifics of that OS.

>
>> However it still does not gurantee uniqueness I think. It's almost
>> incorrect of choosing MAC address as the ID in the beginning since
>> that has the implication of breaking existing configs.
>
> IMO there's no chance it will break any existing config since
> no existing config sets _F_STANDBY.

True, but it breaks people's expectation that it has to rely on MAC
address being unique when turning it on for live migration, and once
it happens some configs with same MAC address would break (for e.g.
bonding setup can have it for cross subnet failover and site
replication). Unless this limitation is clearly documented in the spec
I don't think people will notice that until it breaks.

>
>> I don't think
>> libvirt or QEMU today retricts the MAC address to be unique per VM
>> instance. Neither the virtio spec mentions that.
>
> You really don't have to.
>
>> In addition, it's difficult to fake PCI device on Linux does not mean
>> the same applies to other OSes that is going to implement this VirtIO
>> feature. It's a fragile assumption IMHO.
>
> What an OS does internally is its own business.
>
> What we are telling the guest here is simply that the virtio NIC is
> actually the same device as some other NIC. At this point we do not
> specify this other NIC in any way. So how do you find it?  Well it has
> to have the same MAC clearly.

Well this condition is absolutely neccessary but not sufficient. There
should be some other unique key to help find the NIC as the MAC cannot
be unique as what people generally thought it be.

>
> You point out that there could be multiple NICs with the same
> MAC in theory. It's a broken config generally but since it
> kind of works in some setups maybe it's worth supporting.
> If so we can look for ways to make the matching more specific by e.g.
> adding more flags but I see that as a separate issue,
> and pretty narrow in scope.

Well there are precedents that people thought something broken but
soon find out users already depends on the "broken" behaviour.
Nowadays widely use of virtualization technology make MAC address
duplication really cheap. It's not that uncommon as one might think.

Unless the expectation can be explicitly documented in the spec, I
don't feel it's something users can easily infer from what the new
feature should target - live migration.

>
>> >
>> >> There's no explicit flag to
>> >> identify a VF or pass-through device AFAIK. And sometimes this happens
>> >> maybe due to user misconfiguring the link. This process should be
>> >> hardened to avoid from any potential configuration errors.
>> >
>> > They are still PCI devices though.
>> >
>> >> >
>> >> >> Unlike VF, passthrough PCI endpoint device has its freedom
>> >> >> to change the MAC address. Even on a VF setup it's not neccessarily
>> >> >> always safe to assume the VF's MAC address cannot or shouldn't be
>> >> >> changed. That depends on the specific need whether the host admin
>> >> >> wants to restrict guest from changing the MAC address, although in
>> >> >> most cases it's true.
>> >> >>
>> >> >> I understand we can use the perm_addr to distinguish. But as said,
>> >> >> this will pose limitation of flexible configuration where one can
>> >> >> assign VFs with identical MAC address at all while each VF belongs to
>> >> >> different PF and/or different subnet for e.g. load balancing.
>> >> >> And
>> >> >> furthermore, the QEMU device model never uses MAC address to be
>> >> >> interpreted as an identifier, which requires to be unique per VM
>> >> >> instance. Why we're introducing this inconsistency?
>> >> >>
>> >> >> -Siwei
>> >> >
>> >> > Because it addresses most of the issues and is simple.  That's already
>> >> > much better than what we have now which is nothing unless guest
>> >> > configures things manually.
>> >>
>> >> Did you see my QEMU patch for using BDF as the grouping identifier?
>> >
>> > Yes. And I don't think it can work because bus numbers are
>> > guest specified.
>>
>> I know it's not ideal but perhaps its the best one can do in the KVM
>> world without adding complex config e.g. PCI bridge.
>
> KVM is just a VMX/SVM driver. I think you mean QEMU.  And well -
> "best one can do" is a high bar to clear.
>
>

Glad you'd have to admit that there's no better way *without
introducing complex PCI bridge setup* in the KVM, oops, QEMU without
KVM? err, QEMU with KVM world.

>> Even if bus
>> number is guest specified, it's readily available in the guest and
>> recognizable by any OS, while on the QEMU configuration users specify
>> an id instead of the bus number. Unlike Hyper-V PCI bus, I don't think
>> there exists a para-virtual PCI bus in QEMU backend to expose VPD
>> capability to a passthrough device.
>
> We can always add more interfaces if we need them.  But let's be clear
> that we are adding an interface and what are we trying to fix by doing
> it. Let's not mix it as part of the failover discussion.

I'm sorry, I don't understand why this should not be part of the
failover discussion.

There's a lot of ambiguity about the semantics and the expectation of
the _F_STANDBY feature, and that should be recorded in virtio-dev. If
you think we should run it with a different thread, I can definitely
fork a new thread to continue.

As you may wonder, the other aspects unclear to me now are:
- does this feature imply the device model already? The 3-netdev?
- should clear the feature bit upon unsuccessful creation of the
failover interface or failure to enslave the VF?
- does the feature bit indicate migratability status for the
corresponding VF/PT device?
- does the feature expect automatic bonding by default or always?
- does the guest user have the freedom to disable/re-enable the
automatic bonding? such that they can use raw VF for DPDK or RDMA
after the migration
- ...

I hope the answer won't just be to look at what the current
implementation is doing. The discussion will be helpful, at least not
harmful, for people to understand the intention and definition
clearly, since live migration itself is just too complicated.

>
>> >
>> >> And there can be others like what you suggested, but the point is that
>> >> it's requried to support explicit grouping mechanism from day one,
>> >> before the backup property cast into stones.
>> >
>> > Let's start with addressing simple configs with just two NICs.
>> >
>> > Down the road I can see possible extensions that can work: for example,
>> > require that devices are on the same pci bridge. Or we could even make
>> > the virtio device actually include a pci bridge (as part of same
>> > or a child function), the PT would have to be
>> > behind it.
>> >
>> > As long as we are not breaking anything, adding more flags to fix
>> > non-working configurations is always fair game.
>>
>> While it may work, the PCI bridge has NUMA and IOMMU implications that
>> would restrict the current flexibility to group devices.
>
> It's interesting you should mention that.
>
> If you want to be flexible in placing the primary device WRT NUMA and
> IOMMU, and given that both IOMMU and NUMA are keyed by the bus address,
> then doesn't this completely break the idea of passing
> the bus address to the guest?

I'm confused. Isn't the NUMA and IOMMU disposition host admin should
explicitly define? In that case it's assumed that s/he understand the
implication and the bus address doesn't restrict the host admin from
placing the device according to the NUMA or IOMMU
consideration/constrait.

>
>> I'm not sure
>> if vIOMMU would have to be introduced inadvertently for
>> isolation/protection of devices under the PCI bridge which may cause
>> negative performance impact on the VF.
>
> No idea how do you introduce an IOMMU inadvertently.

If the virtio has to be behind a different bridge thus IOMMU domain
than that for VF (which does not actually need a guest IOMMU) then
your former proposal of grouping them *under the same bridge* would
come across hurtles.

>
>> >
>> >> This is orthogonal to
>> >> device model being proposed, be it 1-netdev or not. Delaying it would
>> >> just mean support and compatibility burden, appearing more like a
>> >> design flaw rather than a feature to add later on.
>> >
>> > Well it's mostly myself who gets to support it, and I see the device
>> > model as much more fundamental as userspace will come to depend
>> > on it. So I'm not too worried, let's take this one step at a time.
>> >
>> >> >
>> >> > I think ideally the infrastructure should suppport flexible matching of
>> >> > NICs - netvsc is already reported to be moving to some kind of serial
>> >> > address.
>> >> >
>> >> As Stephen said, Hyper-V supports the serial UUID thing from day-one.
>> >> It's just the Linux netvsc guest driver itself does not leverage that
>> >> ID from the very beginging.
>> >>
>> >> Regards,
>> >> -Siwei
>> >
>> > We could add something like this, too. For example,
>> > we could add a virtual VPD capability with a UUID.
>>
>> I'm not an expert on that and wonder how you could do this (add a
>> virtual VPD capability with a UUID to passthrough device) with
>> existing QEMU emulation model and native PCI bus.
>
>
> I think I see an elegant way to do that.
>
> You could put it in the port where you want to stick you PT device.
>
> Here's how it could work then:
>
>
> - standby virtio device is tied to a pci bridge.
>
>   Tied how? Well it could be
>   - behind this bridge

An external PCI bridge? This gets back to the first question I ask.
It's interesting a virtio feature should reference an externel object
which seems more like a layering problem at least to me.

>   - include a bridge internally
This internal one being a native PCI bridge or VirtIO PCI bridge? I'm
almost cerntain it should be the latter down the road. That determines
where the VPD or SN capability should reside.

>   - have the bridge as a PCI function
>   - include a bridge and the bridge as a PCI function
>   - have a VPD or serial capability with same UUID as the bridge
>
> - primary passthrough device is placed behind a bridge
>   *with the same ID*
>
>         - either simply behind the same bridge
>         - or behind another bridge with the same UUID.
>
Good. Decouple the concept of grouping to rely on same PCI bridge, and
another bridge with same UUID seems more flexible and promissing.

>
> The treatment could also be limited just to bridges which have a
> specific vendor/device id (maybe a good idea), or in any other arbitrary
> way.

I'd think anway VirtIO spec revision is unavoidable if you have to
involve PCI bridge. Not so complicated?

Regards,
-Siwei

>
>
>
>
>> >
>> > Do you know how exactly does hyperv pass the UUID for NICs?
>>
>> Stephen might know it more and can correct me. But my personal
>> interpretation is that the SN is a host generated 32 bit sequence
>> number which is unique per VM instance and gets propogated to guest
>> via the para-virtual Hyper-V PCI bus.
>>
>> Regards,
>> -Siwei
>
> Ah, so it's a Hyper-V thing.
>
>
>
>
>> >
>> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > MST

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-27 23:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andrew, eric.dumazet, linux-mm, edumazet, netdev, Randy Dunlap,
	John Stoffel, linux-kernel, Matthew Wilcox, Hocko,
	James Bottomley, Michal, dm-devel, David Rientjes, Morton,
	virtualization, David Miller, Vlastimil Babka
In-Reply-To: <20180427005213-mutt-send-email-mst@kernel.org>



On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:

> 2. Ability to control this from a separate config
>    option.
> 
>    It's still not that clear to me why is this such a
>    hard requirement.  If a distro wants to force specific
>    boot time options, why isn't CONFIG_CMDLINE sufficient?

So, try this and turn it on with CONFIG_CMDLINE. But I'm not a 
blogger and I will not write a blog post about it as James Bottomley 
suggests :-)
- so very few users will use it. 


fault-injection: introduce kvmalloc fallback options

This patch introduces a fault-injection option "kvmalloc_fallback". This
option makes kvmalloc randomly fall back to vmalloc.

Unfortunately, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code. This options helps to test for these bugs.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 Documentation/fault-injection/fault-injection.txt |    7 +++++
 mm/util.c                                         |   30 ++++++++++++++++++++++
 2 files changed, 37 insertions(+)

Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
===================================================================
--- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt	2018-04-28 01:01:25.000000000 +0200
+++ linux-2.6/Documentation/fault-injection/fault-injection.txt	2018-04-28 01:01:25.000000000 +0200
@@ -15,6 +15,12 @@ o fail_page_alloc
 
   injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
 
+o kvmalloc_fallback
+
+  makes the function kvmalloc randomly fall back to vmalloc. This could be used
+  to detects bugs such as using DMA-API on the result of kvmalloc or freeing
+  the result of kvmalloc with free.
+
 o fail_futex
 
   injects futex deadlock and uaddr fault errors.
@@ -167,6 +173,7 @@ use the boot option:
 
 	failslab=
 	fail_page_alloc=
+	kvmalloc_fallback=
 	fail_make_request=
 	fail_futex=
 	mmc_core.fail_request=<interval>,<probability>,<space>,<times>
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c	2018-04-28 01:01:25.000000000 +0200
+++ linux-2.6/mm/util.c	2018-04-28 01:03:25.000000000 +0200
@@ -14,6 +14,7 @@
 #include <linux/hugetlb.h>
 #include <linux/vmalloc.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/fault-inject.h>
 
 #include <asm/sections.h>
 #include <linux/uaccess.h>
@@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
 }
 EXPORT_SYMBOL(vm_mmap);
 
+#ifdef CONFIG_FAULT_INJECTION
+
+static DECLARE_FAULT_ATTR(kvmalloc_fallback);
+
+static int __init setup_kvmalloc_fallback(char *str)
+{
+	kvmalloc_fallback.verbose = 0;
+	return setup_fault_attr(&kvmalloc_fallback, str);
+}
+
+__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
+
+#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
+static int __init kvmalloc_fallback_debugfs_init(void)
+{
+	fault_create_debugfs_attr("kvmalloc_fallback", NULL, &kvmalloc_fallback);
+	return 0;
+}
+late_initcall(kvmalloc_fallback_debugfs_init);
+#endif
+
+#endif
+
 /**
  * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
  * failure, fall back to non-contiguous (vmalloc) allocation.
@@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
 	 */
 	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
+#ifdef CONFIG_FAULT_INJECTION
+	if (should_fail(&kvmalloc_fallback, size))
+		goto do_vmalloc;
+#endif
+
 	/*
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
+do_vmalloc: __maybe_unused
 	return __vmalloc_node_flags_caller(size, node, flags,
 			__builtin_return_address(0));
 }

^ permalink raw reply

* Re: [PATCH net-next v9 0/4] Enable virtio_net to act as a standby for a passthru device
From: Jiri Pirko @ 2018-04-27 19:38 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, aaron.f.brown, davem
In-Reply-To: <a60d08cd-6c8b-ea1e-e76c-4decba563e99@intel.com>

Fri, Apr 27, 2018 at 07:53:01PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/27/2018 10:45 AM, Jiri Pirko wrote:
>> Fri, Apr 27, 2018 at 07:06:56PM CEST, sridhar.samudrala@intel.com wrote:

[...]

>> 
>> No changes in v9?
>
>I listed v9 updates at the start of the message.

Hmm, odd. I expected that at the end, in the changelog among other Vs
changes.

Will review this patchset tomorrow.
Thanks!

^ permalink raw reply

* Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
From: Michael S. Tsirkin @ 2018-04-27 19:36 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Kevin Easton, KVM list, netdev, syzkaller-bugs, LKML,
	virtualization
In-Reply-To: <CACT4Y+b4fEOw=GjTsZMHVyQPOZmuqPWj6AJJshpS-UGRx=bhXA@mail.gmail.com>

On Fri, Apr 27, 2018 at 06:11:31PM +0200, Dmitry Vyukov wrote:
> On Fri, Apr 27, 2018 at 6:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> >> The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> >> so it should be allocated with kzalloc() to ensure all structure padding
> >> is zeroed.
> >>
> >> Signed-off-by: Kevin Easton <kevin@guarana.org>
> >> Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com
> >
> > Does it help if a patch naming the padding is applied,
> > and then we init just the relevant field?
> > Just curious.
> 
> Yes, it would help.

How about a Tested-by tag then?

> >> ---
> >>  drivers/vhost/vhost.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >> index f3bd8e9..1b84dcff 100644
> >> --- a/drivers/vhost/vhost.c
> >> +++ b/drivers/vhost/vhost.c
> >> @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
> >>  /* Create a new message. */
> >>  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
> >>  {
> >> -     struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
> >> +     struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
> >>       if (!node)
> >>               return NULL;
> >>       node->vq = vq;
> >> --
> >> 2.8.1
> >
> > --
> > You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20180427185501-mutt-send-email-mst%40kernel.org.
> > For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH net-next v9 2/4] net: Introduce generic failover module
From: Jiri Pirko @ 2018-04-27 17:53 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, aaron.f.brown, davem
In-Reply-To: <1524848820-42258-3-git-send-email-sridhar.samudrala@intel.com>

Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudrala@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. For paravirtual drivers like virtio_net that use 3 netdev model, the
>   the failover module provides interfaces to create/destroy additional
>   master netdev and all the slave events are managed internally.
>        net_failover_create()
>        net_failover_destroy()
>   A failover netdev is created that acts a master device and controls 2
>   slave devices. The original virtio_net netdev is registered as 'standby'
>   netdev and a passthru/vf device with the same MAC gets registered as
>   'primary' netdev. Both 'standby' and 'primary' netdevs are associated
>   with the same 'pci' device.  The user accesses the network interface via
>   'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
>   default for transmits when it is available with link up and running.
>2. For existing netvsc driver that uses 2 netdev model, no master netdev
>   is created. The paravirtual driver registers each instance of netvsc
>   as a 'failover' netdev  along with a set of ops to manage the slave
>   events. There is no 'standby' netdev in this model. A passthru/vf device
>   with the same MAC gets registered as 'primary' netdev.
>        net_failover_register()
>        net_failover_unregister()
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> include/linux/netdevice.h  |  16 +
> include/net/net_failover.h |  62 ++++
> net/Kconfig                |  10 +
> net/core/Makefile          |   1 +
> net/core/net_failover.c    | 892 +++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 981 insertions(+)
> create mode 100644 include/net/net_failover.h
> create mode 100644 net/core/net_failover.c

checkpatch says:

_exportax/0002-net-Introduce-generic-failover-module.patch
----------------------------------------------------------
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#92: 
new file mode 100644

Please add an entry to the MAINTAINERS file.

^ permalink raw reply

* Re: [PATCH net-next v9 0/4] Enable virtio_net to act as a standby for a passthru device
From: Samudrala, Sridhar @ 2018-04-27 17:53 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, aaron.f.brown, davem
In-Reply-To: <20180427174523.GE5632@nanopsycho.orion>

On 4/27/2018 10:45 AM, Jiri Pirko wrote:
> Fri, Apr 27, 2018 at 07:06:56PM CEST, sridhar.samudrala@intel.com wrote:
>> v9:
>> Select NET_FAILOVER automatically when VIRTIO_NET/HYPERV_NET
>> are enabled. (stephen)
>>
>> Tested live migration with virtio-net/AVF(i40evf) configured in
>> failover mode while running iperf in background.
>> Build tested netvsc module.
>>
>> The main motivation for this patch is to enable cloud service providers
>> to provide an accelerated datapath to virtio-net enabled VMs in a
>> transparent manner with no/minimal guest userspace changes. This also
>> enables hypervisor controlled live migration to be supported with VMs that
>> have direct attached SR-IOV VF devices.
>>
>> Patch 1 introduces a new feature bit VIRTIO_NET_F_STANDBY that can be
>> used by hypervisor to indicate that virtio_net interface should act as
>> a standby for another device with the same MAC address.
>>
>> Patch 2 introduces a failover module that provides a generic interface for
>> paravirtual drivers to listen for netdev register/unregister/link change
>> events from pci ethernet devices with the same MAC and takeover their
>> datapath. The notifier and event handling code is based on the existing
>> netvsc implementation. It provides 2 sets of interfaces to paravirtual
>> drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.
>>
>> Patch 3 extends virtio_net to use alternate datapath when available and
>> registered. When STANDBY feature is enabled, virtio_net driver creates
>> an additional 'failover' netdev that acts as a master device and controls
>> 2 slave devices.  The original virtio_net netdev is registered as
>> 'standby' netdev and a passthru/vf device with the same MAC gets
>> registered as 'primary' netdev. Both 'standby' and 'primary' netdevs are
>> associated with the same 'pci' device.  The user accesses the network
>> interface via 'failover' netdev. The 'failover' netdev chooses 'primary'
>> netdev as default for transmits when it is available with link up and
>> running.
>>
>> Patch 4 refactors netvsc to use the registration/notification framework
>> supported by failover module.
>>
>> As this patch series is initially focusing on usecases where hypervisor
>> fully controls the VM networking and the guest is not expected to directly
>> configure any hardware settings, it doesn't expose all the ndo/ethtool ops
>> that are supported by virtio_net at this time. To support additional usecases,
>> it should be possible to enable additional ops later by caching the state
>> in virtio netdev and replaying when the 'primary' netdev gets registered.
>>
>> The hypervisor needs to enable only one datapath at any time so that packets
>> don't get looped back to the VM over the other datapath. When a VF is
>> plugged, the virtio datapath link state can be marked as down.
>> At the time of live migration, the hypervisor needs to unplug the VF device
> >from the guest on the source host and reset the MAC filter of the VF to
>> initiate failover of datapath to virtio before starting the migration. After
>> the migration is completed, the destination hypervisor sets the MAC filter
>> on the VF and plugs it back to the guest to switch over to VF datapath.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> No changes in v9?

I listed v9 updates at the start of the message.

v9:
Select NET_FAILOVER automatically when VIRTIO_NET/HYPERV_NET
are enabled. (stephen)

Tested live migration with virtio-net/AVF(i40evf) configured in
failover mode while running iperf in background.
Build tested netvsc module.



>
>> v8:
>> - Made the failover managment routines more robust by updating the feature
>>   bits/other fields in the failover netdev when slave netdevs are
>>   registered/unregistered. (mst)
>> - added support for handling vlans.
>> - Limited the changes in netvsc to only use the notifier/event/lookups
>>   from the failover module. The slave register/unregister/link-change
>>   handlers are only updated to use the getbymac routine to get the
>>   upper netdev. There is no change in their functionality. (stephen)
>> - renamed structs/function/file names to use net_failover prefix. (mst)
>>
>> v7
>> - Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
>>   (jiri, mst)
>> - re-arranged dev_open() and dev_set_mtu() calls in the register routines
>>   so that they don't get called for 2-netdev model. (stephen)
>> - fixed select_queue() routine to do queue selection based on VF if it is
>>   registered as primary. (stephen)
>> -  minor bugfixes
>>
>> v6 RFC:
>>   Simplified virtio_net changes by moving all the ndo_ops of the
>>   bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
>>   avoided 2 phase registration(driver + instances).
>>   introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags
>>   replaced mutex with a spinlock
>>
>> v5 RFC:
>>   Based on Jiri's comments, moved the common functionality to a 'bypass'
>>   module so that the same notifier and event handlers to handle child
>>   register/unregister/link change events can be shared between virtio_net
>>   and netvsc.
>>   Improved error handling based on Siwei's comments.
>> v4:
>> - Based on the review comments on the v3 version of the RFC patch and
>>   Jakub's suggestion for the naming issue with 3 netdev solution,
>>   proposed 3 netdev in-driver bonding solution for virtio-net.
>> v3 RFC:
>> - Introduced 3 netdev model and pointed out a couple of issues with
>>   that model and proposed 2 netdev model to avoid these issues.
>> - Removed broadcast/multicast optimization and only use virtio as
>>   backup path when VF is unplugged.
>> v2 RFC:
>> - Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
>> - made a small change to the virtio-net xmit path to only use VF datapath
>>   for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
>>   east-west broadcasts to go over the PCI link.
>> - added suppport for the feature bit in qemu
>>
>> Sridhar Samudrala (4):
>>   virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
>>   net: Introduce generic failover module
>>   virtio_net: Extend virtio to use VF datapath when available
>>   netvsc: refactor notifier/event handling code to use the failover
>>     framework
>>
>> drivers/net/Kconfig             |   1 +
>> drivers/net/hyperv/Kconfig      |   1 +
>> drivers/net/hyperv/hyperv_net.h |   2 +
>> drivers/net/hyperv/netvsc_drv.c | 134 ++----
>> drivers/net/virtio_net.c        |  37 +-
>> include/linux/netdevice.h       |  16 +
>> include/net/net_failover.h      |  62 +++
>> include/uapi/linux/virtio_net.h |   3 +
>> net/Kconfig                     |  10 +
>> net/core/Makefile               |   1 +
>> net/core/net_failover.c         | 892 ++++++++++++++++++++++++++++++++++++++++
>> 11 files changed, 1046 insertions(+), 113 deletions(-)
>> create mode 100644 include/net/net_failover.h
>> create mode 100644 net/core/net_failover.c
>>
>> -- 
>> 2.14.3

^ permalink raw reply

* Re: [PATCH net-next v9 0/4] Enable virtio_net to act as a standby for a passthru device
From: Jiri Pirko @ 2018-04-27 17:45 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, aaron.f.brown, davem
In-Reply-To: <1524848820-42258-1-git-send-email-sridhar.samudrala@intel.com>

Fri, Apr 27, 2018 at 07:06:56PM CEST, sridhar.samudrala@intel.com wrote:
>v9:
>Select NET_FAILOVER automatically when VIRTIO_NET/HYPERV_NET 
>are enabled. (stephen)
>
>Tested live migration with virtio-net/AVF(i40evf) configured in 
>failover mode while running iperf in background.
>Build tested netvsc module.
>
>The main motivation for this patch is to enable cloud service providers
>to provide an accelerated datapath to virtio-net enabled VMs in a 
>transparent manner with no/minimal guest userspace changes. This also
>enables hypervisor controlled live migration to be supported with VMs that
>have direct attached SR-IOV VF devices.
>
>Patch 1 introduces a new feature bit VIRTIO_NET_F_STANDBY that can be
>used by hypervisor to indicate that virtio_net interface should act as
>a standby for another device with the same MAC address.
>
>Patch 2 introduces a failover module that provides a generic interface for 
>paravirtual drivers to listen for netdev register/unregister/link change
>events from pci ethernet devices with the same MAC and takeover their
>datapath. The notifier and event handling code is based on the existing
>netvsc implementation. It provides 2 sets of interfaces to paravirtual 
>drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.
>
>Patch 3 extends virtio_net to use alternate datapath when available and
>registered. When STANDBY feature is enabled, virtio_net driver creates
>an additional 'failover' netdev that acts as a master device and controls
>2 slave devices.  The original virtio_net netdev is registered as
>'standby' netdev and a passthru/vf device with the same MAC gets
>registered as 'primary' netdev. Both 'standby' and 'primary' netdevs are
>associated with the same 'pci' device.  The user accesses the network
>interface via 'failover' netdev. The 'failover' netdev chooses 'primary'
>netdev as default for transmits when it is available with link up and
>running.
>
>Patch 4 refactors netvsc to use the registration/notification framework
>supported by failover module.
>
>As this patch series is initially focusing on usecases where hypervisor 
>fully controls the VM networking and the guest is not expected to directly 
>configure any hardware settings, it doesn't expose all the ndo/ethtool ops
>that are supported by virtio_net at this time. To support additional usecases,
>it should be possible to enable additional ops later by caching the state
>in virtio netdev and replaying when the 'primary' netdev gets registered. 
> 
>The hypervisor needs to enable only one datapath at any time so that packets
>don't get looped back to the VM over the other datapath. When a VF is
>plugged, the virtio datapath link state can be marked as down.
>At the time of live migration, the hypervisor needs to unplug the VF device
>from the guest on the source host and reset the MAC filter of the VF to
>initiate failover of datapath to virtio before starting the migration. After
>the migration is completed, the destination hypervisor sets the MAC filter
>on the VF and plugs it back to the guest to switch over to VF datapath.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2


No changes in v9?


>
>v8:
>- Made the failover managment routines more robust by updating the feature 
>  bits/other fields in the failover netdev when slave netdevs are 
>  registered/unregistered. (mst)
>- added support for handling vlans.
>- Limited the changes in netvsc to only use the notifier/event/lookups
>  from the failover module. The slave register/unregister/link-change 
>  handlers are only updated to use the getbymac routine to get the 
>  upper netdev. There is no change in their functionality. (stephen)
>- renamed structs/function/file names to use net_failover prefix. (mst)
>
>v7
>- Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
>  (jiri, mst)
>- re-arranged dev_open() and dev_set_mtu() calls in the register routines
>  so that they don't get called for 2-netdev model. (stephen)
>- fixed select_queue() routine to do queue selection based on VF if it is
>  registered as primary. (stephen)
>-  minor bugfixes
>
>v6 RFC:
>  Simplified virtio_net changes by moving all the ndo_ops of the 
>  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
>  avoided 2 phase registration(driver + instances).
>  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
>  replaced mutex with a spinlock
>
>v5 RFC:
>  Based on Jiri's comments, moved the common functionality to a 'bypass'
>  module so that the same notifier and event handlers to handle child
>  register/unregister/link change events can be shared between virtio_net
>  and netvsc.
>  Improved error handling based on Siwei's comments.
>v4:
>- Based on the review comments on the v3 version of the RFC patch and
>  Jakub's suggestion for the naming issue with 3 netdev solution,
>  proposed 3 netdev in-driver bonding solution for virtio-net.
>v3 RFC:
>- Introduced 3 netdev model and pointed out a couple of issues with
>  that model and proposed 2 netdev model to avoid these issues.
>- Removed broadcast/multicast optimization and only use virtio as
>  backup path when VF is unplugged.
>v2 RFC:
>- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
>- made a small change to the virtio-net xmit path to only use VF datapath
>  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
>  east-west broadcasts to go over the PCI link.
>- added suppport for the feature bit in qemu
>
>Sridhar Samudrala (4):
>  virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
>  net: Introduce generic failover module
>  virtio_net: Extend virtio to use VF datapath when available
>  netvsc: refactor notifier/event handling code to use the failover
>    framework
>
> drivers/net/Kconfig             |   1 +
> drivers/net/hyperv/Kconfig      |   1 +
> drivers/net/hyperv/hyperv_net.h |   2 +
> drivers/net/hyperv/netvsc_drv.c | 134 ++----
> drivers/net/virtio_net.c        |  37 +-
> include/linux/netdevice.h       |  16 +
> include/net/net_failover.h      |  62 +++
> include/uapi/linux/virtio_net.h |   3 +
> net/Kconfig                     |  10 +
> net/core/Makefile               |   1 +
> net/core/net_failover.c         | 892 ++++++++++++++++++++++++++++++++++++++++
> 11 files changed, 1046 insertions(+), 113 deletions(-)
> create mode 100644 include/net/net_failover.h
> create mode 100644 net/core/net_failover.c
>
>-- 
>2.14.3

^ permalink raw reply

* [PATCH net-next v9 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Sridhar Samudrala @ 2018-04-27 17:07 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1524848820-42258-1-git-send-email-sridhar.samudrala@intel.com>

Use the registration/notification framework supported by the generic
failover infrastructure.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 134 +++++++---------------------------------
 3 files changed, 26 insertions(+), 111 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 0765d5f61714..1f8419fc7c7f 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -2,5 +2,6 @@ config HYPERV_NET
 	tristate "Microsoft Hyper-V virtual network driver"
 	depends on HYPERV
 	select UCS2_STRING
+	select NET_FAILOVER
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 6ebe39a3dde6..2ec18344c0e8 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -932,6 +932,8 @@ struct net_device_context {
 	u32 vf_alloc;
 	/* Serial number of the VF to team with */
 	u32 vf_serial;
+
+	struct net_failover *failover;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..fa446234bc11 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
 #include <net/pkt_sched.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
+#include <net/net_failover.h>
 
 #include "hyperv_net.h"
 
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		if (ether_addr_equal(mac, dev->perm_addr))
-			return dev;
-	}
-
-	return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		struct net_device_context *net_device_ctx;
-
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		net_device_ctx = netdev_priv(dev);
-		if (!rtnl_dereference(net_device_ctx->nvdev))
-			continue;	/* device is removed */
-
-		if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-			return dev;	/* a match */
-	}
-
-	return NULL;
-}
-
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1914,24 +1875,15 @@ static void netvsc_vf_setup(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_register_vf(struct net_device *vf_netdev,
+			      struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
 
 	if (vf_netdev->addr_len != ETH_ALEN)
 		return NOTIFY_DONE;
 
-	/*
-	 * We will use the MAC address to locate the synthetic interface to
-	 * associate with the VF interface. If we don't find a matching
-	 * synthetic interface, move on.
-	 */
-	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
@@ -1948,17 +1900,13 @@ static int netvsc_register_vf(struct net_device *vf_netdev)
 }
 
 /* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
-	struct net_device *ndev;
 	bool vf_is_up = netif_running(vf_netdev);
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev)
@@ -1971,15 +1919,11 @@ static int netvsc_vf_changed(struct net_device *vf_netdev)
 	return NOTIFY_OK;
 }
 
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_unregister_vf(struct net_device *vf_netdev,
+				struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
 
@@ -1993,6 +1937,12 @@ static int netvsc_unregister_vf(struct net_device *vf_netdev)
 	return NOTIFY_OK;
 }
 
+static struct net_failover_ops netvsc_failover_ops = {
+	.slave_register		= netvsc_register_vf,
+	.slave_unregister	= netvsc_unregister_vf,
+	.slave_link_change	= netvsc_vf_changed,
+};
+
 static int netvsc_probe(struct hv_device *dev,
 			const struct hv_vmbus_device_id *dev_id)
 {
@@ -2082,8 +2032,15 @@ static int netvsc_probe(struct hv_device *dev,
 		goto register_failed;
 	}
 
+	ret = net_failover_register(net, &netvsc_failover_ops,
+				    &net_device_ctx->failover);
+	if (ret != 0)
+		goto err_failover;
+
 	return ret;
 
+err_failover:
+	unregister_netdev(net);
 register_failed:
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
@@ -2124,13 +2081,15 @@ static int netvsc_remove(struct hv_device *dev)
 	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
 	if (vf_netdev)
-		netvsc_unregister_vf(vf_netdev);
+		net_failover_slave_unregister(vf_netdev);
 
 	if (nvdev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
 
+	net_failover_unregister(ndev_ctx->failover);
+
 	rtnl_unlock();
 	rcu_read_unlock();
 
@@ -2157,54 +2116,8 @@ static struct  hv_driver netvsc_drv = {
 	.remove = netvsc_remove,
 };
 
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
-			       unsigned long event, void *ptr)
-{
-	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
-	/* Skip our own events */
-	if (event_dev->netdev_ops == &device_ops)
-		return NOTIFY_DONE;
-
-	/* Avoid non-Ethernet type devices */
-	if (event_dev->type != ARPHRD_ETHER)
-		return NOTIFY_DONE;
-
-	/* Avoid Vlan dev with same MAC registering as VF */
-	if (is_vlan_dev(event_dev))
-		return NOTIFY_DONE;
-
-	/* Avoid Bonding master dev with same MAC registering as VF */
-	if ((event_dev->priv_flags & IFF_BONDING) &&
-	    (event_dev->flags & IFF_MASTER))
-		return NOTIFY_DONE;
-
-	switch (event) {
-	case NETDEV_REGISTER:
-		return netvsc_register_vf(event_dev);
-	case NETDEV_UNREGISTER:
-		return netvsc_unregister_vf(event_dev);
-	case NETDEV_UP:
-	case NETDEV_DOWN:
-		return netvsc_vf_changed(event_dev);
-	default:
-		return NOTIFY_DONE;
-	}
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
-	.notifier_call = netvsc_netdev_event,
-};
-
 static void __exit netvsc_drv_exit(void)
 {
-	unregister_netdevice_notifier(&netvsc_netdev_notifier);
 	vmbus_driver_unregister(&netvsc_drv);
 }
 
@@ -2224,7 +2137,6 @@ static int __init netvsc_drv_init(void)
 	if (ret)
 		return ret;
 
-	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Sridhar Samudrala @ 2018-04-27 17:06 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1524848820-42258-1-git-send-email-sridhar.samudrala@intel.com>

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

It uses the generic failover framework that provides 2 functions to create
and destroy a master failover netdev. When STANDBY feature is enabled, an
additional netdev(failover netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'standby' netdev and a passthru device with the same MAC is
registered as 'primary' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/Kconfig      |  1 +
 drivers/net/virtio_net.c | 35 ++++++++++++++++++++++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 891846655000..c4995625d9b1 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -331,6 +331,7 @@ config VETH
 config VIRTIO_NET
 	tristate "Virtio network driver"
 	depends on VIRTIO
+	select NET_FAILOVER
 	---help---
 	  This is the virtual network driver for virtio.  It can be used with
 	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 51a085b1a242..c326ee5344c0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,8 +30,11 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
 #include <net/route.h>
 #include <net/xdp.h>
+#include <net/net_failover.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -210,6 +213,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* failover when STANDBY feature enabled */
+	struct net_failover *failover;
 };
 
 struct padded_vnet_hdr {
@@ -2306,6 +2312,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+				      size_t len)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int ret;
+
+	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY))
+		return -EOPNOTSUPP;
+
+	ret = snprintf(buf, len, "_sby");
+	if (ret >= len)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -2323,6 +2345,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
 	.ndo_xdp_flush		= virtnet_xdp_flush,
 	.ndo_features_check	= passthru_features_check,
+	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -2876,10 +2899,16 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
+		err = net_failover_create(vi->dev, &vi->failover);
+		if (err)
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_failover;
 	}
 
 	virtio_device_ready(vdev);
@@ -2916,6 +2945,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_failover:
+	net_failover_destroy(vi->failover);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2950,6 +2981,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	net_failover_destroy(vi->failover);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v9 2/4] net: Introduce generic failover module
From: Sridhar Samudrala @ 2018-04-27 17:06 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1524848820-42258-1-git-send-email-sridhar.samudrala@intel.com>

This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation.

It exposes 2 sets of interfaces to the paravirtual drivers.
1. For paravirtual drivers like virtio_net that use 3 netdev model, the
   the failover module provides interfaces to create/destroy additional
   master netdev and all the slave events are managed internally.
        net_failover_create()
        net_failover_destroy()
   A failover netdev is created that acts a master device and controls 2
   slave devices. The original virtio_net netdev is registered as 'standby'
   netdev and a passthru/vf device with the same MAC gets registered as
   'primary' netdev. Both 'standby' and 'primary' netdevs are associated
   with the same 'pci' device.  The user accesses the network interface via
   'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
   default for transmits when it is available with link up and running.
2. For existing netvsc driver that uses 2 netdev model, no master netdev
   is created. The paravirtual driver registers each instance of netvsc
   as a 'failover' netdev  along with a set of ops to manage the slave
   events. There is no 'standby' netdev in this model. A passthru/vf device
   with the same MAC gets registered as 'primary' netdev.
        net_failover_register()
        net_failover_unregister()

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/linux/netdevice.h  |  16 +
 include/net/net_failover.h |  62 ++++
 net/Kconfig                |  10 +
 net/core/Makefile          |   1 +
 net/core/net_failover.c    | 892 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 981 insertions(+)
 create mode 100644 include/net/net_failover.h
 create mode 100644 net/core/net_failover.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 14e0777ffcfb..b04dbf7dcf1b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1401,6 +1401,8 @@ struct net_device_ops {
  *	entity (i.e. the master device for bridged veth)
  * @IFF_MACSEC: device is a MACsec device
  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
+ * @IFF_FAILOVER: device is a failover master device
+ * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1430,6 +1432,8 @@ enum netdev_priv_flags {
 	IFF_PHONY_HEADROOM		= 1<<24,
 	IFF_MACSEC			= 1<<25,
 	IFF_NO_RX_HANDLER		= 1<<26,
+	IFF_FAILOVER			= 1<<27,
+	IFF_FAILOVER_SLAVE		= 1<<28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1458,6 +1462,8 @@ enum netdev_priv_flags {
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
 #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
+#define IFF_FAILOVER			IFF_FAILOVER
+#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -4308,6 +4314,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
 }
 
+static inline bool netif_is_failover(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_FAILOVER;
+}
+
+static inline bool netif_is_failover_slave(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_FAILOVER_SLAVE;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/net_failover.h b/include/net/net_failover.h
new file mode 100644
index 000000000000..d852f17fb2af
--- /dev/null
+++ b/include/net/net_failover.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_FAILOVER_H
+#define _NET_FAILOVER_H
+
+#include <linux/netdevice.h>
+
+struct net_failover_ops {
+	int (*slave_register)(struct net_device *slave_dev,
+			      struct net_device *failover_dev);
+	int (*slave_unregister)(struct net_device *slave_dev,
+				struct net_device *failover_dev);
+	int (*slave_link_change)(struct net_device *slave_dev,
+				 struct net_device *failover_dev);
+};
+
+struct net_failover {
+	struct list_head list;
+	struct net_device __rcu *failover_dev;
+	struct net_failover_ops __rcu *ops;
+};
+
+/* failover state */
+struct net_failover_info {
+	/* primary netdev with same MAC */
+	struct net_device __rcu *primary_dev;
+
+	/* standby netdev */
+	struct net_device __rcu *standby_dev;
+
+	/* primary netdev stats */
+	struct rtnl_link_stats64 primary_stats;
+
+	/* standby netdev stats */
+	struct rtnl_link_stats64 standby_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 failover_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
+/* Paravirtual drivers supporting 3-netdev model can use these 2 interfaces to
+ * create and destroy a failover device.
+ */
+int net_failover_create(struct net_device *standby_dev,
+			struct net_failover **pfailover);
+void net_failover_destroy(struct net_failover *failover);
+
+/* Paravirtual drivers supporting 2-netdev model can use these 2 interfaces to
+ * register and unregister a failover device.
+ */
+int net_failover_register(struct net_device *standby_dev,
+			  struct net_failover_ops *ops,
+			  struct net_failover **pfailover);
+void net_failover_unregister(struct net_failover *failover);
+
+int net_failover_slave_unregister(struct net_device *slave_dev);
+
+#endif /* _NET_FAILOVER_H */
diff --git a/net/Kconfig b/net/Kconfig
index 6fa1a4493b8c..ebbc9b70418d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -426,6 +426,16 @@ config MAY_USE_DEVLINK
 config PAGE_POOL
        bool
 
+config NET_FAILOVER
+	tristate "Failover interface"
+	default m
+	help
+	  This provides a generic interface for paravirtual drivers to listen
+	  for netdev register/unregister/link change events from pci ethernet
+	  devices with the same MAC and takeover their datapath. This also
+	  enables live migration of a VM with direct attached VF by failing
+	  over to the paravirtual datapath when the VF is unplugged.
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 7080417f8bc8..283ed9b0e581 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
 obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_NET_FAILOVER) += net_failover.o
diff --git a/net/core/net_failover.c b/net/core/net_failover.c
new file mode 100644
index 000000000000..ac164be17563
--- /dev/null
+++ b/net/core/net_failover.c
@@ -0,0 +1,892 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* A common module to handle registrations and notifications for paravirtual
+ * drivers to enable accelerated datapath and support VF live migration.
+ *
+ * The notifier and event handling code is based on netvsc driver and failover
+ * netdev management routines are based on bond/team driver.
+ *
+ */
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/netpoll.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
+#include <linux/pci.h>
+#include <net/sch_generic.h>
+#include <uapi/linux/if_arp.h>
+#include <net/net_failover.h>
+
+static LIST_HEAD(net_failover_list);
+static DEFINE_SPINLOCK(net_failover_lock);
+
+static int net_failover_open(struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int err;
+
+	netif_carrier_off(dev);
+	netif_tx_wake_all_queues(dev);
+
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		err = dev_open(primary_dev);
+		if (err)
+			goto err_primary_open;
+	}
+
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		err = dev_open(standby_dev);
+		if (err)
+			goto err_standby_open;
+	}
+
+	return 0;
+
+err_standby_open:
+	dev_close(primary_dev);
+err_primary_open:
+	netif_tx_disable(dev);
+	return err;
+}
+
+static int net_failover_close(struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	netif_tx_disable(dev);
+
+	slave_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (slave_dev)
+		dev_close(slave_dev);
+
+	slave_dev = rtnl_dereference(nfo_info->standby_dev);
+	if (slave_dev)
+		dev_close(slave_dev);
+
+	return 0;
+}
+
+static netdev_tx_t net_failover_drop_xmit(struct sk_buff *skb,
+					  struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+static bool net_failover_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static netdev_tx_t net_failover_start_xmit(struct sk_buff *skb,
+					   struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via primary netdev followed by standby netdev */
+	xmit_dev = rcu_dereference_bh(nfo_info->primary_dev);
+	if (!xmit_dev || !net_failover_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(nfo_info->standby_dev);
+		if (!xmit_dev || !net_failover_xmit_ready(xmit_dev))
+			return net_failover_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+
+static u16 net_failover_select_queue(struct net_device *dev,
+				     struct sk_buff *skb, void *accel_priv,
+				     select_queue_fallback_t fallback)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev;
+	u16 txq;
+
+	rcu_read_lock();
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		const struct net_device_ops *ops = primary_dev->netdev_ops;
+
+		if (ops->ndo_select_queue)
+			txq = ops->ndo_select_queue(primary_dev, skb,
+						    accel_priv, fallback);
+		else
+			txq = fallback(primary_dev, skb);
+
+		qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+		return txq;
+	}
+
+	txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void net_failover_fold_stats(struct rtnl_link_stats64 *_res,
+				    const struct rtnl_link_stats64 *_new,
+				    const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+static void net_failover_get_stats(struct net_device *dev,
+				   struct rtnl_link_stats64 *stats)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *slave_dev;
+
+	spin_lock(&nfo_info->stats_lock);
+	memcpy(stats, &nfo_info->failover_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	slave_dev = rcu_dereference(nfo_info->primary_dev);
+	if (slave_dev) {
+		new = dev_get_stats(slave_dev, &temp);
+		net_failover_fold_stats(stats, new, &nfo_info->primary_stats);
+		memcpy(&nfo_info->primary_stats, new, sizeof(*new));
+	}
+
+	slave_dev = rcu_dereference(nfo_info->standby_dev);
+	if (slave_dev) {
+		new = dev_get_stats(slave_dev, &temp);
+		net_failover_fold_stats(stats, new, &nfo_info->standby_stats);
+		memcpy(&nfo_info->standby_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&nfo_info->failover_stats, stats, sizeof(*stats));
+	spin_unlock(&nfo_info->stats_lock);
+}
+
+static int net_failover_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int ret = 0;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		ret = dev_set_mtu(primary_dev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		ret = dev_set_mtu(standby_dev, new_mtu);
+		if (ret) {
+			dev_set_mtu(primary_dev, dev->mtu);
+			return ret;
+		}
+	}
+
+	dev->mtu = new_mtu;
+
+	return 0;
+}
+
+static void net_failover_set_rx_mode(struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	rcu_read_lock();
+
+	slave_dev = rcu_dereference(nfo_info->primary_dev);
+	if (slave_dev) {
+		dev_uc_sync_multiple(slave_dev, dev);
+		dev_mc_sync_multiple(slave_dev, dev);
+	}
+
+	slave_dev = rcu_dereference(nfo_info->standby_dev);
+	if (slave_dev) {
+		dev_uc_sync_multiple(slave_dev, dev);
+		dev_mc_sync_multiple(slave_dev, dev);
+	}
+
+	rcu_read_unlock();
+}
+
+static int net_failover_vlan_rx_add_vid(struct net_device *dev, __be16 proto,
+					u16 vid)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int ret = 0;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		ret = vlan_vid_add(primary_dev, proto, vid);
+		if (ret)
+			return ret;
+	}
+
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		ret = vlan_vid_add(standby_dev, proto, vid);
+		if (ret)
+			vlan_vid_del(primary_dev, proto, vid);
+	}
+
+	return ret;
+}
+
+static int net_failover_vlan_rx_kill_vid(struct net_device *dev, __be16 proto,
+					 u16 vid)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	slave_dev = rcu_dereference(nfo_info->primary_dev);
+	if (slave_dev)
+		vlan_vid_del(slave_dev, proto, vid);
+
+	slave_dev = rcu_dereference(nfo_info->standby_dev);
+	if (slave_dev)
+		vlan_vid_del(slave_dev, proto, vid);
+
+	return 0;
+}
+
+static const struct net_device_ops failover_dev_ops = {
+	.ndo_open		= net_failover_open,
+	.ndo_stop		= net_failover_close,
+	.ndo_start_xmit		= net_failover_start_xmit,
+	.ndo_select_queue	= net_failover_select_queue,
+	.ndo_get_stats64	= net_failover_get_stats,
+	.ndo_change_mtu		= net_failover_change_mtu,
+	.ndo_set_rx_mode	= net_failover_set_rx_mode,
+	.ndo_vlan_rx_add_vid	= net_failover_vlan_rx_add_vid,
+	.ndo_vlan_rx_kill_vid	= net_failover_vlan_rx_kill_vid,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+#define FAILOVER_NAME "failover"
+#define FAILOVER_VERSION "0.1"
+
+static void nfo_ethtool_get_drvinfo(struct net_device *dev,
+				    struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
+}
+
+static int nfo_ethtool_get_link_ksettings(struct net_device *dev,
+					  struct ethtool_link_ksettings *cmd)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	slave_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (!slave_dev || !net_failover_xmit_ready(slave_dev)) {
+		slave_dev = rtnl_dereference(nfo_info->standby_dev);
+		if (!slave_dev || !net_failover_xmit_ready(slave_dev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(slave_dev, cmd);
+}
+
+static const struct ethtool_ops failover_ethtool_ops = {
+	.get_drvinfo            = nfo_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = nfo_ethtool_get_link_ksettings,
+};
+
+static struct net_device *net_failover_get_bymac(u8 *mac,
+						 struct net_failover_ops **ops)
+{
+	struct net_device *failover_dev;
+	struct net_failover *failover;
+
+	spin_lock(&net_failover_lock);
+	list_for_each_entry(failover, &net_failover_list, list) {
+		failover_dev = rtnl_dereference(failover->failover_dev);
+		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
+			*ops = rtnl_dereference(failover->ops);
+			spin_unlock(&net_failover_lock);
+			return failover_dev;
+		}
+	}
+	spin_unlock(&net_failover_lock);
+	return NULL;
+}
+
+/* Called when slave dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t net_failover_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+	skb->dev = ndev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+#define FAILOVER_VLAN_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
+				 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
+				 NETIF_F_HIGHDMA | NETIF_F_LRO)
+
+#define FAILOVER_ENC_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
+				 NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
+
+static void net_failover_compute_features(struct net_device *dev)
+{
+	u32 vlan_features = FAILOVER_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL;
+	netdev_features_t enc_features  = FAILOVER_ENC_FEATURES;
+	unsigned short max_hard_header_len = ETH_HLEN;
+	unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |
+					IFF_XMIT_DST_RELEASE_PERM;
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		vlan_features =
+			netdev_increment_features(vlan_features,
+						  primary_dev->vlan_features,
+						  FAILOVER_VLAN_FEATURES);
+		enc_features =
+			netdev_increment_features(enc_features,
+						  primary_dev->hw_enc_features,
+						  FAILOVER_ENC_FEATURES);
+
+		dst_release_flag &= primary_dev->priv_flags;
+		if (primary_dev->hard_header_len > max_hard_header_len)
+			max_hard_header_len = primary_dev->hard_header_len;
+	}
+
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		vlan_features =
+			netdev_increment_features(vlan_features,
+						  standby_dev->vlan_features,
+						  FAILOVER_VLAN_FEATURES);
+		enc_features =
+			netdev_increment_features(enc_features,
+						  standby_dev->hw_enc_features,
+						  FAILOVER_ENC_FEATURES);
+
+		dst_release_flag &= standby_dev->priv_flags;
+		if (standby_dev->hard_header_len > max_hard_header_len)
+			max_hard_header_len = standby_dev->hard_header_len;
+	}
+
+	dev->vlan_features = vlan_features;
+	dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
+	dev->hard_header_len = max_hard_header_len;
+
+	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+	if (dst_release_flag == (IFF_XMIT_DST_RELEASE |
+				 IFF_XMIT_DST_RELEASE_PERM))
+		dev->priv_flags |= IFF_XMIT_DST_RELEASE;
+
+	netdev_change_features(dev);
+}
+
+static int net_failover_slave_register(struct net_device *slave_dev)
+{
+	struct net_failover_info *nfo_info;
+	struct net_failover_ops *nfo_ops;
+	struct net_device *failover_dev;
+	bool slave_is_standby;
+	u32 orig_mtu;
+	int err;
+
+	ASSERT_RTNL();
+
+	failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
+	if (!failover_dev)
+		goto done;
+
+	if (failover_dev->type != slave_dev->type)
+		goto done;
+
+	if (nfo_ops && nfo_ops->slave_register)
+		return nfo_ops->slave_register(slave_dev, failover_dev);
+
+	nfo_info = netdev_priv(failover_dev);
+	slave_is_standby = (slave_dev->dev.parent == failover_dev->dev.parent);
+	if (slave_is_standby ? rtnl_dereference(nfo_info->standby_dev) :
+			rtnl_dereference(nfo_info->primary_dev)) {
+		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
+			   slave_dev->name,
+			   slave_is_standby ? "standby" : "primary");
+		goto done;
+	}
+
+	/* We want to allow only a direct attached VF device as a primary
+	 * netdev. As there is no easy way to check for a VF device, restrict
+	 * this to a pci device.
+	 */
+	if (!slave_is_standby && (!slave_dev->dev.parent ||
+				  !dev_is_pci(slave_dev->dev.parent)))
+		goto done;
+
+	if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
+	    vlan_uses_dev(failover_dev)) {
+		netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
+			   failover_dev->name);
+		goto done;
+	}
+
+	/* Align MTU of slave with failover dev */
+	orig_mtu = slave_dev->mtu;
+	err = dev_set_mtu(slave_dev, failover_dev->mtu);
+	if (err) {
+		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
+			   slave_dev->name, failover_dev->mtu);
+		goto done;
+	}
+
+	dev_hold(slave_dev);
+
+	if (netif_running(failover_dev)) {
+		err = dev_open(slave_dev);
+		if (err && (err != -EBUSY)) {
+			netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
+				   slave_dev->name, err);
+			goto err_dev_open;
+		}
+	}
+
+	netif_addr_lock_bh(failover_dev);
+	dev_uc_sync_multiple(slave_dev, failover_dev);
+	dev_uc_sync_multiple(slave_dev, failover_dev);
+	netif_addr_unlock_bh(failover_dev);
+
+	err = vlan_vids_add_by_dev(slave_dev, failover_dev);
+	if (err) {
+		netdev_err(failover_dev, "Failed to add vlan ids to device %s err:%d\n",
+			   slave_dev->name, err);
+		goto err_vlan_add;
+	}
+
+	err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
+					 failover_dev);
+	if (err) {
+		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
+			   err);
+		goto err_handler_register;
+	}
+
+	err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
+	if (err) {
+		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
+			   failover_dev->name, err);
+		goto err_upper_link;
+	}
+
+	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
+
+	if (slave_is_standby) {
+		rcu_assign_pointer(nfo_info->standby_dev, slave_dev);
+		dev_get_stats(nfo_info->standby_dev, &nfo_info->standby_stats);
+	} else {
+		rcu_assign_pointer(nfo_info->primary_dev, slave_dev);
+		dev_get_stats(nfo_info->primary_dev, &nfo_info->primary_stats);
+		failover_dev->min_mtu = slave_dev->min_mtu;
+		failover_dev->max_mtu = slave_dev->max_mtu;
+	}
+
+	net_failover_compute_features(failover_dev);
+
+	call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
+
+	netdev_info(failover_dev, "failover %s slave:%s registered\n",
+		    slave_is_standby ? "standby" : "primary", slave_dev->name);
+
+	goto done;
+
+err_upper_link:
+	netdev_rx_handler_unregister(slave_dev);
+err_handler_register:
+	vlan_vids_del_by_dev(slave_dev, failover_dev);
+err_vlan_add:
+	dev_uc_unsync(slave_dev, failover_dev);
+	dev_mc_unsync(slave_dev, failover_dev);
+	dev_close(slave_dev);
+err_dev_open:
+	dev_put(slave_dev);
+	dev_set_mtu(slave_dev, orig_mtu);
+done:
+	return NOTIFY_DONE;
+}
+
+int net_failover_slave_unregister(struct net_device *slave_dev)
+{
+	struct net_device *standby_dev, *primary_dev;
+	struct net_failover_info *nfo_info;
+	struct net_failover_ops *nfo_ops;
+	struct net_device *failover_dev;
+	bool slave_is_standby;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
+	if (!failover_dev)
+		goto done;
+
+	if (nfo_ops && nfo_ops->slave_unregister)
+		return nfo_ops->slave_unregister(slave_dev, failover_dev);
+
+	nfo_info = netdev_priv(failover_dev);
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		goto done;
+
+	slave_is_standby = (slave_dev->dev.parent == failover_dev->dev.parent);
+
+	netdev_rx_handler_unregister(slave_dev);
+	netdev_upper_dev_unlink(slave_dev, failover_dev);
+	vlan_vids_del_by_dev(slave_dev, failover_dev);
+	dev_uc_unsync(slave_dev, failover_dev);
+	dev_mc_unsync(slave_dev, failover_dev);
+	dev_close(slave_dev);
+	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
+
+	nfo_info = netdev_priv(failover_dev);
+	net_failover_get_stats(failover_dev, &nfo_info->failover_stats);
+
+	if (slave_is_standby) {
+		RCU_INIT_POINTER(nfo_info->standby_dev, NULL);
+	} else {
+		RCU_INIT_POINTER(nfo_info->primary_dev, NULL);
+		if (standby_dev) {
+			failover_dev->min_mtu = standby_dev->min_mtu;
+			failover_dev->max_mtu = standby_dev->max_mtu;
+		}
+	}
+
+	dev_put(slave_dev);
+
+	net_failover_compute_features(failover_dev);
+
+	netdev_info(failover_dev, "failover %s slave:%s unregistered\n",
+		    slave_is_standby ? "standby" : "primary", slave_dev->name);
+
+done:
+	return NOTIFY_DONE;
+}
+EXPORT_SYMBOL_GPL(net_failover_slave_unregister);
+
+static int net_failover_slave_link_change(struct net_device *slave_dev)
+{
+	struct net_device *failover_dev, *primary_dev, *standby_dev;
+	struct net_failover_info *nfo_info;
+	struct net_failover_ops *nfo_ops;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
+	if (!failover_dev)
+		goto done;
+
+	if (nfo_ops && nfo_ops->slave_link_change)
+		return nfo_ops->slave_link_change(slave_dev, failover_dev);
+
+	if (!netif_running(failover_dev))
+		return 0;
+
+	nfo_info = netdev_priv(failover_dev);
+
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		goto done;
+
+	if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
+	    (standby_dev && net_failover_xmit_ready(standby_dev))) {
+		netif_carrier_on(failover_dev);
+		netif_tx_wake_all_queues(failover_dev);
+	} else {
+		net_failover_get_stats(failover_dev, &nfo_info->failover_stats);
+		netif_carrier_off(failover_dev);
+		netif_tx_stop_all_queues(failover_dev);
+	}
+
+done:
+	return NOTIFY_DONE;
+}
+
+static int
+net_failover_event(struct notifier_block *this, unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip parent events */
+	if (netif_is_failover(event_dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return net_failover_slave_register(event_dev);
+	case NETDEV_UNREGISTER:
+		return net_failover_slave_unregister(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return net_failover_slave_link_change(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block net_failover_notifier = {
+	.notifier_call = net_failover_event,
+};
+
+static void nfo_register_existing_slave(struct net_device *failover_dev)
+{
+	struct net *net = dev_net(failover_dev);
+	struct net_device *dev;
+
+	rtnl_lock();
+	for_each_netdev(net, dev) {
+		if (netif_is_failover(dev))
+			continue;
+		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
+			net_failover_slave_register(dev);
+	}
+	rtnl_unlock();
+}
+
+int net_failover_register(struct net_device *dev, struct net_failover_ops *ops,
+			  struct net_failover **pfailover)
+{
+	struct net_failover *failover;
+
+	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
+	if (!failover)
+		return -ENOMEM;
+
+	rcu_assign_pointer(failover->ops, ops);
+	dev_hold(dev);
+	dev->priv_flags |= IFF_FAILOVER;
+	rcu_assign_pointer(failover->failover_dev, dev);
+
+	spin_lock(&net_failover_lock);
+	list_add_tail(&failover->list, &net_failover_list);
+	spin_unlock(&net_failover_lock);
+
+	netdev_info(dev, "failover master:%s registered\n", dev->name);
+
+	nfo_register_existing_slave(dev);
+
+	*pfailover = failover;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(net_failover_register);
+
+void net_failover_unregister(struct net_failover *failover)
+{
+	struct net_device *failover_dev;
+
+	failover_dev = rcu_dereference(failover->failover_dev);
+
+	netdev_info(failover_dev, "failover master:%s unregistered\n",
+		    failover_dev->name);
+
+	failover_dev->priv_flags &= ~IFF_FAILOVER;
+	dev_put(failover_dev);
+
+	spin_lock(&net_failover_lock);
+	list_del(&failover->list);
+	spin_unlock(&net_failover_lock);
+
+	kfree(failover);
+}
+EXPORT_SYMBOL_GPL(net_failover_unregister);
+
+int net_failover_create(struct net_device *standby_dev,
+			struct net_failover **pfailover)
+{
+	struct device *dev = standby_dev->dev.parent;
+	struct net_device *failover_dev;
+	int err;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that VF devices being enslaved won't have too many queues.
+	 */
+	failover_dev = alloc_etherdev_mq(sizeof(struct net_failover_info), 16);
+	if (!failover_dev) {
+		dev_err(dev, "Unable to allocate failover_netdev!\n");
+		return -ENOMEM;
+	}
+
+	dev_net_set(failover_dev, dev_net(standby_dev));
+	SET_NETDEV_DEV(failover_dev, dev);
+
+	failover_dev->netdev_ops = &failover_dev_ops;
+	failover_dev->ethtool_ops = &failover_ethtool_ops;
+
+	/* Initialize the device options */
+	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire failover netdev's netif_tx_lock when transmitting */
+	failover_dev->features |= NETIF_F_LLTX;
+
+	/* Don't allow failover devices to change network namespaces. */
+	failover_dev->features |= NETIF_F_NETNS_LOCAL;
+
+	failover_dev->hw_features = FAILOVER_VLAN_FEATURES |
+				    NETIF_F_HW_VLAN_CTAG_TX |
+				    NETIF_F_HW_VLAN_CTAG_RX |
+				    NETIF_F_HW_VLAN_CTAG_FILTER;
+
+	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	failover_dev->features |= failover_dev->hw_features;
+
+	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
+	       failover_dev->addr_len);
+
+	failover_dev->min_mtu = standby_dev->min_mtu;
+	failover_dev->max_mtu = standby_dev->max_mtu;
+
+	err = register_netdev(failover_dev);
+	if (err < 0) {
+		dev_err(dev, "Unable to register failover_dev!\n");
+		goto err_register_netdev;
+	}
+
+	netif_carrier_off(failover_dev);
+
+	err = net_failover_register(failover_dev, NULL, pfailover);
+	if (err < 0)
+		goto err_failover_register;
+
+	return 0;
+
+err_failover_register:
+	unregister_netdev(failover_dev);
+err_register_netdev:
+	free_netdev(failover_dev);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(net_failover_create);
+
+void net_failover_destroy(struct net_failover *failover)
+{
+	struct net_failover_info *nfo_info;
+	struct net_device *failover_dev;
+	struct net_device *slave_dev;
+
+	if (!failover)
+		return;
+
+	failover_dev = rcu_dereference(failover->failover_dev);
+	nfo_info = netdev_priv(failover_dev);
+
+	netif_device_detach(failover_dev);
+
+	rtnl_lock();
+
+	slave_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (slave_dev)
+		net_failover_slave_unregister(slave_dev);
+
+	slave_dev = rtnl_dereference(nfo_info->standby_dev);
+	if (slave_dev)
+		net_failover_slave_unregister(slave_dev);
+
+	net_failover_unregister(failover);
+
+	unregister_netdevice(failover_dev);
+
+	rtnl_unlock();
+
+	free_netdev(failover_dev);
+}
+EXPORT_SYMBOL_GPL(net_failover_destroy);
+
+static __init int
+net_failover_init(void)
+{
+	register_netdevice_notifier(&net_failover_notifier);
+
+	return 0;
+}
+module_init(net_failover_init);
+
+static __exit
+void net_failover_exit(void)
+{
+	unregister_netdevice_notifier(&net_failover_notifier);
+}
+module_exit(net_failover_exit);
+
+MODULE_DESCRIPTION("Failover infrastructure/interface for Paravirtual drivers");
+MODULE_LICENSE("GPL v2");
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v9 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
From: Sridhar Samudrala @ 2018-04-27 17:06 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1524848820-42258-1-git-send-email-sridhar.samudrala@intel.com>

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.

VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3b5991734118..51a085b1a242 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..a3715a3224c1 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_STANDBY	  62	/* Act as standby for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v9 0/4] Enable virtio_net to act as a standby for a passthru device
From: Sridhar Samudrala @ 2018-04-27 17:06 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown

v9:
Select NET_FAILOVER automatically when VIRTIO_NET/HYPERV_NET 
are enabled. (stephen)

Tested live migration with virtio-net/AVF(i40evf) configured in 
failover mode while running iperf in background.
Build tested netvsc module.

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_STANDBY that can be
used by hypervisor to indicate that virtio_net interface should act as
a standby for another device with the same MAC address.

Patch 2 introduces a failover module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. It provides 2 sets of interfaces to paravirtual 
drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.

Patch 3 extends virtio_net to use alternate datapath when available and
registered. When STANDBY feature is enabled, virtio_net driver creates
an additional 'failover' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'standby' netdev and a passthru/vf device with the same MAC gets
registered as 'primary' netdev. Both 'standby' and 'primary' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'failover' netdev. The 'failover' netdev chooses 'primary'
netdev as default for transmits when it is available with link up and
running.

Patch 4 refactors netvsc to use the registration/notification framework
supported by failover module.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'primary' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v8:
- Made the failover managment routines more robust by updating the feature 
  bits/other fields in the failover netdev when slave netdevs are 
  registered/unregistered. (mst)
- added support for handling vlans.
- Limited the changes in netvsc to only use the notifier/event/lookups
  from the failover module. The slave register/unregister/link-change 
  handlers are only updated to use the getbymac routine to get the 
  upper netdev. There is no change in their functionality. (stephen)
- renamed structs/function/file names to use net_failover prefix. (mst)

v7
- Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
  (jiri, mst)
- re-arranged dev_open() and dev_set_mtu() calls in the register routines
  so that they don't get called for 2-netdev model. (stephen)
- fixed select_queue() routine to do queue selection based on VF if it is
  registered as primary. (stephen)
-  minor bugfixes

v6 RFC:
  Simplified virtio_net changes by moving all the ndo_ops of the 
  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
  avoided 2 phase registration(driver + instances).
  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
  replaced mutex with a spinlock

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (4):
  virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
  net: Introduce generic failover module
  virtio_net: Extend virtio to use VF datapath when available
  netvsc: refactor notifier/event handling code to use the failover
    framework

 drivers/net/Kconfig             |   1 +
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 134 ++----
 drivers/net/virtio_net.c        |  37 +-
 include/linux/netdevice.h       |  16 +
 include/net/net_failover.h      |  62 +++
 include/uapi/linux/virtio_net.h |   3 +
 net/Kconfig                     |  10 +
 net/core/Makefile               |   1 +
 net/core/net_failover.c         | 892 ++++++++++++++++++++++++++++++++++++++++
 11 files changed, 1046 insertions(+), 113 deletions(-)
 create mode 100644 include/net/net_failover.h
 create mode 100644 net/core/net_failover.c

-- 
2.14.3

^ permalink raw reply

* Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
From: Dmitry Vyukov via Virtualization @ 2018-04-27 16:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Easton, KVM list, netdev, syzkaller-bugs, LKML,
	virtualization
In-Reply-To: <CACT4Y+bzWiPvV+pVvys4v8CwUhF7iYVskxn_yeo6ztN5uKA0VA@mail.gmail.com>

On Fri, Apr 27, 2018 at 6:25 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> >> The struct vhost_msg within struct vhost_msg_node is copied to userspace,
>>> >> so it should be allocated with kzalloc() to ensure all structure padding
>>> >> is zeroed.
>>> >>
>>> >> Signed-off-by: Kevin Easton <kevin@guarana.org>
>>> >> Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com
>>> >
>>> > Does it help if a patch naming the padding is applied,
>>> > and then we init just the relevant field?
>>> > Just curious.
>>>
>>> Yes, it would help.
>>
>> I think it's slightly better that way then. node has a lot of internal
>> stuff we don't care to init. Would you mind taking my patch and building
>> on top of that then?
>
>
> But it's asking for more information leaks in future. This looks like
> work for compiler.


Modern compilers are perfectly capable of doing this:

#include <memory.h>
#include <unistd.h>
int main()
{
    int x[10];
    memset(&x, 0, sizeof(x));
    x[0] = 0;
    x[2] = 2;
    x[3] = 3;
    x[4] = 4;
    x[5] = 5;
    x[6] = 6;
    x[7] = 7;
    x[8] = 8;
    x[9] = 9;
    write(0, x, sizeof(x));
    return 0;
}

gcc 7.2 -O3

0000000000000540 <main>:
 540:   sub    $0x38,%rsp
 544:   mov    $0x28,%edx
 549:   xor    %edi,%edi
 54b:   movdqa 0x1cd(%rip),%xmm0        # 720 <_IO_stdin_used+0x10>
 553:   mov    %rsp,%rsi
 556:   movq   $0x0,(%rsp)
 55e:   movups %xmm0,0x8(%rsp)
 563:   movdqa 0x1c5(%rip),%xmm0        # 730 <_IO_stdin_used+0x20>
 56b:   movups %xmm0,0x18(%rsp)
 570:   callq  520 <write@plt>
 575:   xor    %eax,%eax
 577:   add    $0x38,%rsp
 57b:   retq
 57c:   nopl   0x0(%rax)


But they will not put a security hole next time fields are shuffled.




>>> >> ---
>>> >>  drivers/vhost/vhost.c | 2 +-
>>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>>> >>
>>> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>> >> index f3bd8e9..1b84dcff 100644
>>> >> --- a/drivers/vhost/vhost.c
>>> >> +++ b/drivers/vhost/vhost.c
>>> >> @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
>>> >>  /* Create a new message. */
>>> >>  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
>>> >>  {
>>> >> -     struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
>>> >> +     struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
>>> >>       if (!node)
>>> >>               return NULL;
>>> >>       node->vq = vq;
>>> >> --
>>> >> 2.8.1

^ permalink raw reply

* Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
From: Dmitry Vyukov via Virtualization @ 2018-04-27 16:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Easton, KVM list, netdev, syzkaller-bugs, LKML,
	virtualization
In-Reply-To: <20180427191430-mutt-send-email-mst@kernel.org>

On Fri, Apr 27, 2018 at 6:15 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> The struct vhost_msg within struct vhost_msg_node is copied to userspace,
>> >> so it should be allocated with kzalloc() to ensure all structure padding
>> >> is zeroed.
>> >>
>> >> Signed-off-by: Kevin Easton <kevin@guarana.org>
>> >> Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com
>> >
>> > Does it help if a patch naming the padding is applied,
>> > and then we init just the relevant field?
>> > Just curious.
>>
>> Yes, it would help.
>
> I think it's slightly better that way then. node has a lot of internal
> stuff we don't care to init. Would you mind taking my patch and building
> on top of that then?


But it's asking for more information leaks in future. This looks like
work for compiler.


>> >> ---
>> >>  drivers/vhost/vhost.c | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >>
>> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> >> index f3bd8e9..1b84dcff 100644
>> >> --- a/drivers/vhost/vhost.c
>> >> +++ b/drivers/vhost/vhost.c
>> >> @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
>> >>  /* Create a new message. */
>> >>  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
>> >>  {
>> >> -     struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
>> >> +     struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
>> >>       if (!node)
>> >>               return NULL;
>> >>       node->vq = vq;
>> >> --
>> >> 2.8.1

^ permalink raw reply

* Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
From: Michael S. Tsirkin @ 2018-04-27 16:15 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Kevin Easton, KVM list, netdev, syzkaller-bugs, LKML,
	virtualization
In-Reply-To: <CACT4Y+b4fEOw=GjTsZMHVyQPOZmuqPWj6AJJshpS-UGRx=bhXA@mail.gmail.com>

On Fri, Apr 27, 2018 at 06:11:31PM +0200, Dmitry Vyukov wrote:
> On Fri, Apr 27, 2018 at 6:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> >> The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> >> so it should be allocated with kzalloc() to ensure all structure padding
> >> is zeroed.
> >>
> >> Signed-off-by: Kevin Easton <kevin@guarana.org>
> >> Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com
> >
> > Does it help if a patch naming the padding is applied,
> > and then we init just the relevant field?
> > Just curious.
> 
> Yes, it would help.

I think it's slightly better that way then. node has a lot of internal
stuff we don't care to init. Would you mind taking my patch and building
on top of that then?

> >> ---
> >>  drivers/vhost/vhost.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >> index f3bd8e9..1b84dcff 100644
> >> --- a/drivers/vhost/vhost.c
> >> +++ b/drivers/vhost/vhost.c
> >> @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
> >>  /* Create a new message. */
> >>  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
> >>  {
> >> -     struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
> >> +     struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
> >>       if (!node)
> >>               return NULL;
> >>       node->vq = vq;
> >> --
> >> 2.8.1
> >
> > --
> > You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20180427185501-mutt-send-email-mst%40kernel.org.
> > For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
From: Dmitry Vyukov via Virtualization @ 2018-04-27 16:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Kevin Easton, KVM list, netdev, syzkaller-bugs, LKML,
	virtualization
In-Reply-To: <20180427185501-mutt-send-email-mst@kernel.org>

On Fri, Apr 27, 2018 at 6:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
>> The struct vhost_msg within struct vhost_msg_node is copied to userspace,
>> so it should be allocated with kzalloc() to ensure all structure padding
>> is zeroed.
>>
>> Signed-off-by: Kevin Easton <kevin@guarana.org>
>> Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com
>
> Does it help if a patch naming the padding is applied,
> and then we init just the relevant field?
> Just curious.

Yes, it would help.

>> ---
>>  drivers/vhost/vhost.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index f3bd8e9..1b84dcff 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
>>  /* Create a new message. */
>>  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
>>  {
>> -     struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
>> +     struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
>>       if (!node)
>>               return NULL;
>>       node->vq = vq;
>> --
>> 2.8.1
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20180427185501-mutt-send-email-mst%40kernel.org.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node
From: Michael S. Tsirkin @ 2018-04-27 16:05 UTC (permalink / raw)
  To: Kevin Easton; +Cc: kvm, netdev, syzkaller-bugs, linux-kernel, virtualization
In-Reply-To: <20180427154502.GA22544@la.guarana.org>

On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> so it should be allocated with kzalloc() to ensure all structure padding
> is zeroed.
> 
> Signed-off-by: Kevin Easton <kevin@guarana.org>
> Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com

Does it help if a patch naming the padding is applied,
and then we init just the relevant field?
Just curious.

> ---
>  drivers/vhost/vhost.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f3bd8e9..1b84dcff 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
>  /* Create a new message. */
>  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
>  {
> -	struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
> +	struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
>  	if (!node)
>  		return NULL;
>  	node->vq = vq;
> -- 
> 2.8.1

^ permalink raw reply

* [PATCH] vhost: make msg padding explicit
From: Michael S. Tsirkin @ 2018-04-27 16:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, Kevin Easton, kvm, virtualization

There's a 32 bit hole just after type. It's best to
give it a name, this way compiler is forced to initialize
it with rest of the structure.

Reported-by: Kevin Easton <kevin@guarana.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/vhost.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index c51f8e5..5a8ad06 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -68,6 +68,7 @@ struct vhost_iotlb_msg {
 
 struct vhost_msg {
 	int type;
+	int padding0;
 	union {
 		struct vhost_iotlb_msg iotlb;
 		__u8 padding[64];
-- 
MST

^ permalink raw reply related

* Re: [PATCH net-next v8 2/4] net: Introduce generic failover module
From: kbuild test robot @ 2018-04-27 12:39 UTC (permalink / raw)
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, kbuild-all, sridhar.samudrala,
	aaron.f.brown, davem
In-Reply-To: <1524700768-38627-3-git-send-email-sridhar.samudrala@intel.com>

Hi Sridhar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Sridhar-Samudrala/Enable-virtio_net-to-act-as-a-standby-for-a-passthru-device/20180427-183842
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> net/core/net_failover.c:544:39: sparse: incorrect type in argument 1 (different address spaces) @@    expected struct net_device *dev @@    got struct net_devicestruct net_device *dev @@
   net/core/net_failover.c:544:39:    expected struct net_device *dev
   net/core/net_failover.c:544:39:    got struct net_device [noderef] <asn:4>*standby_dev
   net/core/net_failover.c:547:39: sparse: incorrect type in argument 1 (different address spaces) @@    expected struct net_device *dev @@    got struct net_devicestruct net_device *dev @@
   net/core/net_failover.c:547:39:    expected struct net_device *dev
   net/core/net_failover.c:547:39:    got struct net_device [noderef] <asn:4>*primary_dev
>> net/core/net_failover.c:112:12: sparse: context imbalance in 'net_failover_select_queue' - wrong count at exit

vim +544 net/core/net_failover.c

   446	
   447	static int net_failover_slave_register(struct net_device *slave_dev)
   448	{
   449		struct net_failover_info *nfo_info;
   450		struct net_failover_ops *nfo_ops;
   451		struct net_device *failover_dev;
   452		bool slave_is_standby;
   453		u32 orig_mtu;
   454		int err;
   455	
   456		ASSERT_RTNL();
   457	
   458		failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
   459		if (!failover_dev)
   460			goto done;
   461	
   462		if (failover_dev->type != slave_dev->type)
   463			goto done;
   464	
   465		if (nfo_ops && nfo_ops->slave_register)
   466			return nfo_ops->slave_register(slave_dev, failover_dev);
   467	
   468		nfo_info = netdev_priv(failover_dev);
   469		slave_is_standby = (slave_dev->dev.parent == failover_dev->dev.parent);
   470		if (slave_is_standby ? rtnl_dereference(nfo_info->standby_dev) :
   471				rtnl_dereference(nfo_info->primary_dev)) {
   472			netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
   473				   slave_dev->name,
   474				   slave_is_standby ? "standby" : "primary");
   475			goto done;
   476		}
   477	
   478		/* We want to allow only a direct attached VF device as a primary
   479		 * netdev. As there is no easy way to check for a VF device, restrict
   480		 * this to a pci device.
   481		 */
   482		if (!slave_is_standby && (!slave_dev->dev.parent ||
   483					  !dev_is_pci(slave_dev->dev.parent)))
   484			goto done;
   485	
   486		if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
   487		    vlan_uses_dev(failover_dev)) {
   488			netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
   489				   failover_dev->name);
   490			goto done;
   491		}
   492	
   493		/* Align MTU of slave with failover dev */
   494		orig_mtu = slave_dev->mtu;
   495		err = dev_set_mtu(slave_dev, failover_dev->mtu);
   496		if (err) {
   497			netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
   498				   slave_dev->name, failover_dev->mtu);
   499			goto done;
   500		}
   501	
   502		dev_hold(slave_dev);
   503	
   504		if (netif_running(failover_dev)) {
   505			err = dev_open(slave_dev);
   506			if (err && (err != -EBUSY)) {
   507				netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
   508					   slave_dev->name, err);
   509				goto err_dev_open;
   510			}
   511		}
   512	
   513		netif_addr_lock_bh(failover_dev);
   514		dev_uc_sync_multiple(slave_dev, failover_dev);
   515		dev_uc_sync_multiple(slave_dev, failover_dev);
   516		netif_addr_unlock_bh(failover_dev);
   517	
   518		err = vlan_vids_add_by_dev(slave_dev, failover_dev);
   519		if (err) {
   520			netdev_err(failover_dev, "Failed to add vlan ids to device %s err:%d\n",
   521				   slave_dev->name, err);
   522			goto err_vlan_add;
   523		}
   524	
   525		err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
   526						 failover_dev);
   527		if (err) {
   528			netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
   529				   err);
   530			goto err_handler_register;
   531		}
   532	
   533		err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
   534		if (err) {
   535			netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
   536				   failover_dev->name, err);
   537			goto err_upper_link;
   538		}
   539	
   540		slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
   541	
   542		if (slave_is_standby) {
   543			rcu_assign_pointer(nfo_info->standby_dev, slave_dev);
 > 544			dev_get_stats(nfo_info->standby_dev, &nfo_info->standby_stats);
   545		} else {
   546			rcu_assign_pointer(nfo_info->primary_dev, slave_dev);
   547			dev_get_stats(nfo_info->primary_dev, &nfo_info->primary_stats);
   548			failover_dev->min_mtu = slave_dev->min_mtu;
   549			failover_dev->max_mtu = slave_dev->max_mtu;
   550		}
   551	
   552		net_failover_compute_features(failover_dev);
   553	
   554		call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
   555	
   556		netdev_info(failover_dev, "failover %s slave:%s registered\n",
   557			    slave_is_standby ? "standby" : "primary", slave_dev->name);
   558	
   559		goto done;
   560	
   561	err_upper_link:
   562		netdev_rx_handler_unregister(slave_dev);
   563	err_handler_register:
   564		vlan_vids_del_by_dev(slave_dev, failover_dev);
   565	err_vlan_add:
   566		dev_uc_unsync(slave_dev, failover_dev);
   567		dev_mc_unsync(slave_dev, failover_dev);
   568		dev_close(slave_dev);
   569	err_dev_open:
   570		dev_put(slave_dev);
   571		dev_set_mtu(slave_dev, orig_mtu);
   572	done:
   573		return NOTIFY_DONE;
   574	}
   575	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-27 10:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew, eric.dumazet, Michael S. Tsirkin, edumazet, netdev,
	Randy Dunlap, John Stoffel, linux-kernel, Matthew Wilcox,
	virtualization, James Bottomley, Michal, dm-devel, David Miller,
	David Rientjes, Morton, linux-mm, Vlastimil Babka
In-Reply-To: <20180427082555.GC17484@dhcp22.suse.cz>



On Fri, 27 Apr 2018, Michal Hocko wrote:

> On Thu 26-04-18 18:52:05, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:
> [...]
> > >    But assuming it's important to control this kind of
> > >    fault injection to be controlled from
> > >    a dedicated menuconfig option, why not the rest of
> > >    faults?
> > 
> > The injected faults cause damage to the user, so there's no point to 
> > enable them by default. vmalloc fallback should not cause any damage 
> > (assuming that the code is correctly written).
> 
> But you want to find those bugs which would BUG_ON easier, so there is a
> risk of harm IIUC

Yes, I want to harm them, but I only want to harm the users using the 
debugging kernel. Testers should be "harmed" by crashes - so that the 
users of production kernels are harmed less.

If someone hits this, he should report it, use the kernel parameter to 
turn it off and continue with the testing.

> and this is not much different than other fault injecting paths.

Fault injections causes misbehavior even on completely bug-free code (for 
example, syscalls randomly returning -ENOMEM). This won't cause 
misbehavior on bug-free code.

Mikulas

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox